Summary
Background
Interpretation of plasma metanephrines and methoxytyramine to assess likelihood of phaeochromocytoma/paraganglioma (PPGL) during screening can be challenging. This study (study period: 2021–2023) introduces new methods to select machine-learning (ML) models and evaluate derived probability-scores to better interpret laboratory results.
Methods
ML models were trained and internally tested using data from 2046 patients with and without PPGL and according to several features: age, pre-test risk of PPGL, plasma metanephrines and methoxytyramine. External validation involved a second cohort of 1641 patients with and without PPGL. The study employed several processes to select and evaluate the best model: concordance of models with human intelligence; intra- and inter-laboratory variability in derived probability-scores; and comparison of scores of the selected model to predictions of ten clinical care specialists before and after provision of those scores.
Findings
External validation established equally excellent diagnostic performance for all five best ML models according to areas under ROC curves (0.988–0.994) and balanced accuracies (0.958–0.981). Probability-scores of models, however, varied widely and were poorly correlated. The additional selection processes indicated an artificial-network model as a superior and more robust model than others. Predictions of disease likelihood by specialists, according to six categories from disease highly unlikely to disease clear, varied widely for individual patients. Within each of the six predictive categories, median probability-scores of the artificial-network model were 70-, 175-, 59-, 15-, 3.5- and 1.7-fold higher (P < 0.0001) in patients with than without PPGL. This superiority of probability scores over variable predictions by specialists remained evident after specialists were tasked to modify their predictions according to those scores.
Interpretation
This study employed several novel processes to establish an ML model with probability-scores superior to predictions of disease likelihood by specialists. However, the negligible improvement in interpretations by specialists after provision of probability-scores indicates this alone is insufficient to improve decision-making.
Funding
Deutsche Forschungsgemeinschaft.