Figure 1. Cumulative frequency histogram of area under the receiver operating characteristic curve (AUC) values for mortality.
Figure 2. Area under the receiver operating characteristic curve (AUC) values for predictive tools that were examined in 4 or more assessments (n = number of assessments) with 95% confidence intervals (CIs). Summary results of AUC and 95% CIs are provided using random effects meta-analysis. APACHE II indicates Acute Physiology And Chronic Health Evaluation II; CLIP, Cancer of the Liver Italian Program; CTP, Child-Turcotte-Pugh; CURB-65, confusion–blood urea nitrogen–respiratory rate–blood pressure–age ≥65 years; JIS, Japan Integrated Staging; MELD, Model for End-Stage Liver Disease; NT-pro-BNP, N-terminal-pro-B-type natriuretic peptide; PSI, Pneumonia Severity Index; SAPS, Simplified Acute Physiology Score; SOFA, Sequential Organ Failure Assessment. To obtain further information about the specific studies that contribute AUC estimates to each predictive tool listed in this table, please contact the authors or consult the eTables and eReferences.
Siontis GCM, Tzoulaki I, Ioannidis JPA. Predicting DeathAn Empirical Evaluation of Predictive Tools for Mortality. Arch Intern Med. 2011;171(19):1721-1726. doi:10.1001/archinternmed.2011.334
Author Affiliations: Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece (Drs Siontis, Tzoulaki, and Ioannidis); Department of Epidemiology and Biostatistics, Imperial College of Medicine, London, England (Drs Tzoulaki and Ioannidis); the Institute for Clinical Research and Health Policy Studies, Department of Medicine, Tufts University School of Medicine, Boston, Massachusetts (Dr Ioannidis); the Department of Epidemiology, Harvard School of Public Health, Boston (Dr Ioannidis); and the Stanford Prevention Research Center, Stanford University School of Medicine, Stanford, California (Dr Ioannidis).
Background The ability to predict death is crucial in medicine, and many relevant prognostic tools have been developed for application in diverse settings. We aimed to evaluate the discriminating performance of predictive tools for death and the variability in this performance across different clinical conditions and studies.
Methods We used Medline to identify studies published in 2009 that assessed the accuracy (based on the area under the receiver operating characteristic curve [AUC]) of validated tools for predicting all-cause mortality. For tools where accuracy was reported in 4 or more assessments, we calculated summary accuracy measures. Characteristics of studies of the predictive tools were evaluated to determine if they were associated with the reported accuracy of the tool.
Results A total of 94 eligible studies provided data on 240 assessments of 118 predictive tools. The AUC ranged from 0.43 to 0.98 (median [interquartile range], 0.77 [0.71-0.83]), with only 23 of the assessments reporting excellent discrimination (10%) (AUC, >0.90). For 10 tools, accuracy was reported in 4 or more assessments; only 1 tool had a summary AUC exceeding 0.80. Established tools showed large heterogeneity in their performance across different cohorts (I2 range, 68%-95%). Reported AUC was higher for tools published in journals with lower impact factor (P = .01), with larger sample size (P = .01), and for those that aimed to predict mortality among the highest-risk patients (P = .002) and among children (P < .001).
Conclusions Most tools designed to predict mortality have only modest accuracy, and there is large variability across various diseases and populations. Most proposed tools do not have documented clinical utility.
The ability to predict death accurately is crucial for conveying information to patients about their future; for making sound medical decisions for management, treatment, and prevention; and for having realistic expectations. Evidence suggests that physicians perform poorly in predicting when patients will die.1,2 However, numerous models have been developed to predict mortality in diverse settings.3- 5
Herein we aim to empirically evaluate the ability of available predictive tools (multivariate or single variables) to predict the risk of death accurately for diverse conditions and populations. We assess how accurately and consistently these tools perform to help understand their potential clinical utility.
To evaluate recently published studies that assessed the accuracy (discrimination) of tools to predict mortality, we searched Medline for studies published in 2009 by using the Clinical Queries tool. For more details on our search strategy and data extraction, see the eAppendix).
We included studies of any design published in 2009 that assessed the accuracy of tools to predict mortality (either single predictors or multivariable models); included assessment of accuracy based on the area under the receiver operating characteristic curve (AUC) (aka, C statistic or C index); and focused on all-cause death as the primary outcome. The AUC6- 9 is the most commonly used metric for assessing the accuracy of predictive tools.10 The AUCs can be compared across different tools, while relative risk metrics depend on the unit to which they are expressed and cannot directly compare predictive tools expressed for different units of measurement.11
We excluded studies that only had data on the development of a new predictive tool or validated the predictive tool in the same cohort where it was developed because new, nonvalidated predictive tools are likely to have inflated estimates of accuracy.12- 14 We also excluded articles that did not provide primary data (eg, reviews) and studies where death was part of a composite outcome or was determined as cause-specific (rather than all-cause) mortality.
When there were several eligible predictive tools and/or they assessed the ability to predict death at different lengths of follow-up in the same cohort, each proposed predictive tool and each time of follow-up assessment was included separately. For example, one study examined 2 predictive tools (Multidimensional Prognostic Index [MPI] and Pneumonia Severity Index [PSI]) for a total of 6 assessments at 3 different follow-up periods (1, 6, and 12 months) (see reference S47 in eReferences).
The full text of the eligible studies and any supplementary materials were scrutinized to extract information on study design, characteristics of the cohort (prevalence of specific diseases), characteristics of the predictive tool and data on calibration,15 reclassification,16- 18 and accuracy. For each study, we recorded the journal impact factor per the Institute for Scientific Information.19 Calibration examines whether the risk prediction is equally good for patients at different levels of risk or there is a lack of fit. Reclassification examines whether the predictive tool helps classify patients in different, more appropriate risk categories compared with what could be done without its knowledge or compared with some other model. Accuracy is assessed by the AUC.
The AUC was defined as mean (SD) or median (interquartile range [IQR]). An AUC of 1 indicates perfect discrimination, while an AUC of 0.5 indicates discrimination no better than chance. While there are no absolute thresholds, usually an AUC of greater than 0.80 is considered to show very good discrimination, and AUC greater than 0.90 suggests excellent discrimination.9
For predictive tools where there was more than 1 assessment available, we noted the range of AUC values. For predictive tools with at least 4 data sets where both the AUC and corresponding 95% confidence intervals (CIs) were available, we summarized the AUC estimates using random effect models, weighting the AUC of each data set by the inverse of the sum of the between and within-study variances.20- 22 We quantified the heterogeneity in AUC values by the I2 metric and its 95% CI. The I2 metric takes values between 0% and 100%, and it is independent of the number of data sets (50%-75% indicates moderate heterogeneity, while >75% indicates very large heterogeneity).23
We compared the AUC values among prespecified subgroups based on prevalence of disease and predictive tool characteristics using 1-way analysis of variance for categorical variables and the Spearman correlation coefficient for continuous variables. Analyses were performed with STATA software, version 10.0 (StataCorp LP, College Station, Texas).
Overall 544 items were retrieved from Medline, of which 235 were reviewed in full text. Of those, 94 articles (eReferences) were deemed eligible (eFigure). The interrater agreement (between G.C.M.S. and I.T.) for the selection of the eligible studies had κ value of 0.86.
These 94 manuscripts presented data on 240 assessments (224 multivariate models and 16 single predictors) of the accuracy of 118 predictive tools. Characteristics of studies and predictive tool assessments are listed in eTable 1. Most of the studies were performed in the United States or Europe, had a prospective cohort design, and pertained to acute disease conditions. Cardiovascular, critical-illness, infectious, gastroenterology-related, and malignant diseases accounted for 83% of the cohorts, but many other diseases were also assessed (eTables 1, 2, and 3). The median (IQR) sample size for the assessments was 502 (185-2016); the median (IQR) number of deaths was 71 (32-157); the median (IQR) proportion of deaths was 14% (5%-29%); and the median (IQR) death rate was 13% (4%-44%) per month. Among the whole data set (94 studies), in only 1 study (S85 in eReferences) did the investigators review and abstract patient data blinded to patients' hospital course and clinical status (eTable 1). For 78 studies, the percentage of losses to follow-up was available (70 studies reported no losses, while for the rest loss was generally low (median [IQR] loss to follow-up, 3.5% [1.25%-10.25%]).
Overall, 110 different predictive models and 8 different predictors were examined in the 240 assessments. The most commonly evaluated models included the Acute Physiology And Chronic Health Evaluation (APACHE) II model (n = 19) and the MELD score (Model for End-Stage Liver Disease) (n = 17) (Table 1). The predictive models included a wide range of variables (eTable 2). The number of variables in the models ranged from 2 to 30, and the median (IQR) number was 6 (4-12). All of the identified single predictors were biomarkers (eTable 3).
Calibration of the examined predictive tools was examined in fewer than half of the included studies (n = 45; 48%), mainly by using the Hosmer-Lemeshow statistic (n = 35; 78%) and observed/predicted ratio (n = 5; 11%). Results were available in 44 studies (105 predictive tool assessments), indicating lack of fit for 8 studies (17 predictive tools).
Only 1 study (S83 in eReferences) examined reclassification analysis by means of the net reclassification improvement and the integrated discrimination index. This study investigated the added predictive value of radiographic ascites over and above the MELD-Na score in patients with cirrhosis.
The AUC values ranged from 0.43 to 0.98 (Figure 1), and the median (IQR) AUC value was 0.77 (0.71-0.83). A total of 95 of the AUC values were higher than 0.80 (very good discrimination) (40%), but only 23 were higher than 0.90 (excellent discrimination) (10%).
The AUC data for all predictive tools with 2 or more assessments are listed in Table 1. For each of these 34 tools, the range of AUC estimates was large, sometimes spanning the spectrum from inaccurate to excellent accuracy. The median AUC values suggested modest accuracy. For only 2 predictive tools (Clinical Risk Index for Babies [CRIB] II [S25 and S27 in eReferences] and Pediatric death prediction model [S92 in eReferences]), the median AUC value suggested excellent accuracy (AUC, 0.91 and 0.92, respectively), but this was based on only 2 assessments of each tool. Four or more assessments of the accuracy of a predictive tool were available for only 9 tools (APACHE, MELD, SOFA [Sequential Organ Failure Assessment], CTP [Child-Turcotte-Pugh], SAPS [Simplified Acute Physiology Score] II, PSI, CLIP [Cancer of the Liver Italian Program], CURB-65 [confusion–blood urea nitrogen–respiratory rate–blood pressure–age ≥65 years], JIS [Japan Integrated Staging]) and 1 biomarker (NT-pro-BNP [N -terminal-pro-B-type natriuretic peptide]). Using random effects meta-analysis, we found that the summary AUC estimates for these 10 tools ranged between 0.73 and 0.84 (Figure 2). For each of the 9 multivariable tools, there was marked heterogeneity of AUC values across diverse settings and studies (heterogeneity I2 estimates in AUC ranged from 68% to 95%). The 95% CIs of the I2 were also consistent with a large or very large heterogeneity. For NT-pro-BNP, the I2 estimate was 25%. Meta-analyses retaining only the longest follow-up assessment when several follow-up assessments were available from the same study showed similar results (all changes in summary AUC estimates were <5% compared with the primary analysis including all data).
As listed in Table 2, predictive tools published in journals of lower impact factor had higher reported AUC estimates than those published in journals of higher impact factor. Predictive tools were more accurate in predicting mortality when a smaller proportion of study participants died. The AUC values were also higher in pediatric than in adult populations. Finally, studies with larger sample size tended to have higher AUC values than smaller studies.
There was no evidence that study design (retrospective vs prospective), area of origin, disease status, clinical condition examined, death rate per month, loss to follow-up, or number of variables included in the predictive tool were associated with the AUC values (data not shown).
Our systematic evaluation of a large number of seemingly well-validated predictive tools reported in the recent literature shows that these tools are not very accurate and that there is wide variation in their predictive accuracy for death. Most of the tools included in our analysis are not sufficiently accurate for wide use in clinical practice. Moreover, calibration was assessed in fewer than half of the tools, and of those tested, several showed lack of fit, meaning that prediction was not equally good for patients at different levels of risk. Studies published in journals with lower impact factor tended to show better AUC values, while tools performed better when they tried to predict death only for the highest-risk patients.
For a proposed predictive tool to be useful in clinical practice, there are several prerequisites. The tool must be validated in populations other than the one in which it was developed; it should be reproducible; and it should have good accuracy and calibration. Such a predictive tool can make accurate predictions in diverse settings across the range of both low- and high-risk patients. Few tools for predicting risk of death currently fit these criteria. Even tools that meet these criteria may not necessarily result in improvement in patient management and outcomes. This depends on whether effective, feasible interventions are available, the use of which is based on accurate knowledge of patient risk. However, reclassification, the ability to reclassify individuals into more appropriate risk categories where different actions/interventions might be indicated, is almost never assessed in the current literature of death prediction. Moreover, randomized trials on the use of predictive models, the ultimate proof of benefit, are few and difficult to conduct. Finally, clinicians are unlikely to use complex tools that require collection of extensive information, including data derived from expensive tests. It is possible that other predictive tools, based on far more limited clinical data, may perform equally well or better. In our empirical evaluation, models with more variables did not seem to perform clearly better than models with few variables.
Some characteristics of predictive tools were significantly associated with higher AUC estimates. For example, tools performed better when they tried to predict death only for the highest-risk patients. Excellent performance was seen in a small number of pediatric tools, while performance was substantially worse in predictive tools for adults. Larger studies tended to have slightly higher AUC estimates. These associations are exploratory and should be viewed with caution.
In our evaluation we focused on validated tools. However, even for some of the most widely applied predictive tools (such as APACHE II, MELD score, and SAPS II), we found great within-tool variability in accuracy across different studies and clinical settings. The observed variation of the accuracy for the same predictive tool may be partly ascribed to the selective analysis and reporting of studies of predictive tools that may lead to exaggerated results of predictive discrimination in some studies. Efforts at standardization of reporting are important in this regard.24,25 The inverse correlation between journal impact factor and reported AUC that we observed may represent lower methodologic quality with spuriously high reported predictive performance in some articles published in journals with low impact factor.26 Moreover, studies often test predictive tools in populations that are very different than the one the model was developed for and for a wide range of outcomes. This may further contribute to the variability seen in their discriminatory performance.
Some limitations should be mentioned. Our empirical assessment was restricted to studies published during a single year. An effort to appraise the entire predictive literature would be a task requiring extensive international effort by hundreds of researchers, much as the Cochrane Collaboration has done for clinical trials. Moreover, we included only studies dealing with prediction of all-cause death, and we did not evaluate the accuracy of tools designed to predict other outcomes. However, death from any cause is a common outcome with great clinical impact, and it is possible to standardize unambiguously. Finally, we considered only predictive studies that assessed accuracy using the AUC. However, AUC is not the only metric to assess predictive ability,27 and like any single metric, it can have limitations.16,28- 30 For example, the AUC does not provide information on the actual predicted probabilities, and it does not convey the exact risk distribution in the respective study population. Also, improvements in AUC are more difficult in the high-range values than when AUC is closer to 0.50.6 Nevertheless, AUC is a very useful metric16,30 and is the most widely used standardized metric in the predictive literature.
Given the very wide variability in the AUC, even for the same predictive tool, we believe that systematic efforts are needed to organize and synthesize the predictive literature, such as those proposed by the Cochrane Prognosis Methods Group. Such efforts are needed to enhance the evidence derived from predictive research and to establish standard methods for developing, evaluating, reporting,31,32 and eventually adopting new predictive tools in clinical practice. Clinicians should be cautious about adopting new, initially promising predictive tools, especially complex ones based on expensive measurements that have not been extensively validated and shown to be consistently useful in practice.
Correspondence: John P. A. Ioannidis, MD, DSc, Stanford Prevention Research Center, Stanford University School of Medicine, 251 Campus Dr, MSOB X306, Stanford, CA 94305 (email@example.com).
Accepted for Publication: May 9, 2011.
Published Online: July 25, 2011. doi:10.1001/archinternmed.2011.334
Author Contributions: Dr Ioannidis had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Siontis, Tzoulaki, and Ioannidis. Acquisition of data: Siontis, Tzoulaki, and Ioannidis. Analysis and interpretation of data: Siontis, Tzoulaki, and Ioannidis. Drafting of the manuscript: Siontis, Tzoulaki, and Ioannidis. Critical revision of the manuscript for important intellectual content: Siontis, Tzoulaki, and Ioannidis. Statistical analysis: Siontis, Tzoulaki, and Ioannidis. Administrative, technical, and material support: Siontis, Tzoulaki, and Ioannidis. Study supervision: Ioannidis.
Financial Disclosure: None reported.