AUC indicates area under the curve; CART, Cardiac Arrest Risk Triage; MEWS, Modified Early Warning Score; NEWS, National Early Warning Score; PACS, Singapore-based Patient Acuity Category Scale; RAPS, Rapid Acute Physiology Score; REMS, Rapid Emergency Medicine Score.
eTable 1. List of Candidate Variables and Their Definitions
eTable 2. Varying Cutoffs of Predicted Risk Based on the SERP-30d, MEWS, NEWS, CART, RAPS, and REMS, the Proportion of Patients Stratified for 30-Day Mortality, and the Corresponding Sensitivity, Specificity, Positive and Negative Predictive Values on the Testing Cohort
eTable 3. Varying Cutoffs of Predicted Risk Based on the SERP-2d, SERP-7d, SERP-30d, MEWS, NEWS, CART, RAPS, and REMS, the Proportion of Patients Stratified for 2-Day Mortality, and the Corresponding Sensitivity and Specificity on the Testing Cohort
eFigure 1. Flowchart of the AutoScore Framework
eFigure 2. Parsimony Plot of the Number of Variables Vs AUC Values for (A) SERP-2d, (B) SERP-7d, and (C) SERP-30d on the Validation Cohort
eFigure 3. Number of Cases Versus Score Value on the Testing Cohort for (A) SERP-2d and (B) SERP-30d
eFigure 4. Observed Mortality Rate Versus Score Value on the Testing Cohort for (A) SERP-2d and (B) SERP-30d
eMethods. Description of the AutoScore Method
Customize your JAMA Network experience by selecting one or more topics from the list below.
Identify all potential conflicts of interest that might be relevant to your comment.
Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.
Err on the side of full disclosure.
If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.
Not all submitted comments are published. Please see our commenting policy for details.
Xie F, Ong MEH, Liew JNMH, et al. Development and Assessment of an Interpretable Machine Learning Triage Tool for Estimating Mortality After Emergency Admissions. JAMA Netw Open. 2021;4(8):e2118467. doi:10.1001/jamanetworkopen.2021.18467
How does an interpretable machine learning triage tool for estimating mortality perform in a cohort of individuals admitted to the hospital from the emergency department compared with other clinical scores?
In this cohort study, the parsimonious and point-based Score for Emergency Risk Prediction was more accurate in identifying patients who died within 2, 7, or 30 days of admissions than other point-based clinical scores.
These results suggest that the Score for Emergency Risk Prediction tool shows promise for triaging patients admitted from the emergency department according to mortality risk.
Triage in the emergency department (ED) is a complex clinical judgment based on the tacit understanding of the patient’s likelihood of survival, availability of medical resources, and local practices. Although a scoring tool could be valuable in risk stratification, currently available scores have demonstrated limitations.
To develop an interpretable machine learning tool based on a parsimonious list of variables available at ED triage; provide a simple, early, and accurate estimate of patients’ risk of death; and evaluate the tool’s predictive accuracy compared with several established clinical scores.
Design, Setting, and Participants
This single-site, retrospective cohort study assessed all ED patients between January 1, 2009, and December 31, 2016, who were subsequently admitted to a tertiary hospital in Singapore. The Score for Emergency Risk Prediction (SERP) tool was derived using a machine learning framework. To estimate mortality outcomes after emergency admissions, SERP was compared with several triage systems, including Patient Acuity Category Scale, Modified Early Warning Score, National Early Warning Score, Cardiac Arrest Risk Triage, Rapid Acute Physiology Score, and Rapid Emergency Medicine Score. The initial analyses were completed in October 2020, and additional analyses were conducted in May 2021.
Main Outcomes and Measures
Three SERP scores, namely SERP-2d, SERP-7d, and SERP-30d, were developed using the primary outcomes of interest of 2-, 7-, and 30-day mortality, respectively. Secondary outcomes included 3-day mortality and inpatient mortality. The SERP’s predictive power was measured using the area under the curve in the receiver operating characteristic analysis.
The study included 224 666 ED episodes in the model training cohort (mean [SD] patient age, 63.60 [16.90] years; 113 426 [50.5%] female), 56 167 episodes in the validation cohort (mean [SD] patient age, 63.58 [16.87] years; 28 427 [50.6%] female), and 42 676 episodes in the testing cohort (mean [SD] patient age, 64.85 [16.80] years; 21 556 [50.5%] female). The mortality rates in the training cohort were 0.8% at 2 days, 2.2% at 7 days, and 5.9% at 30 days. In the testing cohort, the areas under the curve of SERP-30d were 0.821 (95% CI, 0.796-0.847) for 2-day mortality, 0.826 (95% CI, 0.811-0.841) for 7-day mortality, and 0.823 (95% CI, 0.814-0.832) for 30-day mortality and outperformed several benchmark scores.
Conclusions and Relevance
In this retrospective cohort study, SERP had better prediction performance than existing triage scores while maintaining easy implementation and ease of ascertainment in the ED. It has the potential to be widely applied and validated in different circumstances and health care settings.
Triage in the emergency department (ED) for admission and appropriate level of hospital care is a complex clinical judgment based on the tacit understanding of the patient’s likely short-term course, availability of medical resources, and local practices.1,2 Besides triage categories, early warning scores are also used to identify patients at risk of having adverse events. One such example is the Cardiac Risk Assessment Triage (CART) score,3 which calculates a score based on a patient’s vital signs, indicating their risk for cardiac arrest, subsequent transfer to the intensive care unit, and mortality.4
To date, few studies5-8 have investigated variables of short-term and long-term mortality among the general ED population, using the limited data available at the point of triage. Most ED-specific scores are targeted toward specific conditions, such as the quick Sepsis-Related Organ Failure Assessment for infection and sepsis,5,6 CART for cardiac conditions, or Predicting Mortality in the Emergency Department for elderly populations.7,8 Several general purpose scores have been adopted by the ED, such as the Modified Early Warning Score (MEWS) and Acute Physiology and Chronic Health Evaluation (APACHE) II score. However, MEWS has only moderate predictive capabilities, with an area under the curve (AUC) of 0.71,9 and APACHE II requires laboratory variables unavailable at the point of triage.10 In the fast-paced ED environment, a scoring tool needs to be accurate and straightforward.
To address the need for a risk tool appropriate to the ED workflow, we developed the Score for Emergency Risk Prediction (SERP) using a general-purpose machine learning–based scoring framework named AutoScore.11 The resulting tool was compared in a test set to the current triage system used in Singapore, the Patient Acuity Category Scale (PACS),12 and several established early warning or triage scores.
We performed a retrospective cohort study of patients seen in the ED of Singapore General Hospital (SGH). Singapore is a city-state in Southeast Asia with a rapidly aging society13; currently, approximately 1 in 5 Singaporeans are 60 years or older.14 The SGH is the largest and oldest public tertiary hospital in Singapore. The SGH ED receives more than 120 000 visits and has 36 000 inpatient admissions annually. The electronic health record (EHR) data were obtained from Singapore Health Services and analyzed. This study was approved by Singapore Health Services’ Centralized Institutional Review Board, and a waiver of consent was granted for EHR data collection and analysis because of the retrospective nature of the study. All data were deidentified. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.15
All patients visiting the SGH ED from January 1, 2009, until December 31, 2016, who were subsequently admitted, were included. We denote these included episodes as emergency admissions. Patients younger than 21 years were excluded. We also excluded noncitizen patients who might not have complete medical records. These admission episodes from January 1, 2009, to December 31, 2015, were randomly split into 2 nonoverlapping cohorts: a training cohort (80%) and a validation cohort (20%). The admission episodes in 2016 were assigned to the testing cohort. This sequential testing design was chosen to be more consistent with future application scenarios and evaluate whether population shift would influence the model’s performance.
The primary outcomes used to develop and test the tool were 2-, 7-, and 30-day mortality, defined as deaths within 2, 7, and 30 days after emergency admission, respectively. Three SERP scores, namely SERP-2d, SERP-7d, and SERP-30d, were developed using the corresponding primary outcome. We also tested the performance of those clinical scores on the secondary outcomes, including inpatient mortality, defined as deaths in the hospital, and 3-day mortality, defined as deaths within 72 hours after the time of admission. Death records were obtained from the national death registry and were matched to specific patients in the EHR.
We extracted data from the hospital’s EHR through the SingHealth Electronic Health Intelligence System. Patient details were deidentified, complying with Health Insurance Portability and Accountability Act regulations. Comorbidities were obtained from hospital diagnosis and discharge records in the preceding 5 years before patients’ index emergency admissions. They were extracted from the International Classification of Diseases, Ninth Revision (ICD-9) and International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10),16 globally used diagnostic tools for epidemiology and clinical purposes. We preselected candidate variables available in the ED before hospital admission to ensure that SERP was clinically useful and valid for early risk stratification of patients in the ED. Candidate variables included demographic characteristics, administrative variables, medical history in the preceding year, vital signs, and comorbidities. The list of candidate variables is given in the eTable 1 in the Supplement. Comorbidity variables were defined according to the Charlson Comorbidity Index. We used the algorithms developed and updated by Quan et al17 for the linkage between Charlson Comorbidity Index and ICD-9/ICD-10 codes.
The data were analyzed using R software, version 3.5.3 (R Foundation for Statistical Computing). The initial analyses were completed in October 2020, and additional analyses were conducted in May 2021. Baseline characteristics of the study population were analyzed on all 3 cohorts (training, validation, and testing). In the descriptive summaries, numbers (percentages) were reported for categorical variables. For continuous variables, means (SDs) were reported. During the analysis, the value for a vital sign would be considered as an outlier and set to missing if it were beyond the plausible physiologic ranges based on clinical knowledge. For example, any value of vital signs below 0, heart rate above 300/min, respiration rate above 50/min, systolic blood pressure above 300 mm Hg, diastolic blood pressure above 180 mm Hg, or oxygen saturation as measured by pulse oximetry above 100% was deemed an outlier. Subsequently, all missing values were imputed using the median value of the training cohort.
We implemented the AutoScore,11 a machine learning–based clinical score generation algorithm, to derive the SERP scoring models. AutoScore combines machine learning and logistic regression, integrates multiple modules of data manipulation, and automates the development of parsimonious sparse-score risk models for predefined outcomes. In addition, it enables users to build interpretable clinical scores quickly and seamlessly, which can be easily implemented and validated in clinical practice. The training cohort was used for the generation of the tentative SERP models using AutoScore framework. The validation cohort was used to evaluate multiple candidate SERP models for parameter tuning and model selection. Then, we calculated the performance metrics of the final SERP model based on the testing cohort. Finally, we used the primary outcomes for model derivation and applied primary and secondary outcomes for model evaluation. The implementation details and methodologic descriptions are elaborated in eFigure 1 and the eMethods in the Supplement.
After model derivation, the predictive performance of the final SERP scores was reported based on the testing cohort, where bootstrapped samples were applied to calculate 95% CIs. Each of the SERP breakdowns was allocated a score that reflected the magnitude of disturbance to each variable. The individual scores were then summed to derive the aggregated SERP score for risk stratification of outcomes. The predictive power of SERP was measured using the AUC in the receiver operating characteristic (ROC) analysis. Sensitivity, specificity, positive predictive value, and negative predictive value were calculated under the optimal threshold, defined as the point nearest to the upper-left corner of the ROC curve. The metrics calculated under different thresholds were also compared to evaluate predictive performance. By using the same testing cohort, we compared the 3 SERP scores with PACS, MEWS,18 the National Early Warning Score (NEWS),19 CART,3 the Rapid Acute Physiology Score (RAPS),20 and the Rapid Emergency Medicine Score (REMS),21 in estimating multiple mortality outcomes in this study.
Between January 1, 2009, and December 31, 2015, a total of 280 833 individual admission episodes were assessed, including 224 666 in the training cohort (mean [SD] patient age, 63.60 [16.90] years; 113 426 [50.5%] female) and 56 167 in the validation cohort (mean [SD] patient age, 63.58 [16.87] years; 28 427 [50.6%] female). In addition, 42 676 admission episodes in the year 2016 were included in the testing cohort (mean [SD] patient age, 64.85 [16.80] years; 21 556 [50.5%] female) (Figure 1). The mortality rates observed in the training cohort were 0.8% at 2 days, 2.2% at 7 days, and 5.9% at 30 days. The ethnic compositions were similar to the population norm (74.3% for Chinese, 12.9% for Malay, 10.0% for Indian, and 2.8% for others). A total of 39 548 episodes (17.6%) were triaged as PACS 1, and 128 644 episodes (57.3%) were triaged as PACS 2. Table 1 indicates that patient characteristics in the training and validation cohorts were similar in terms of age, sex, racial and ethnic compositions, and other characteristics. Compared with those in the training and validation cohorts, patients in the testing cohort were slightly older and had a higher risk of triage to PACS 1, with more people having comorbidities of dementia, diabetes, and kidney diseases. The patients in the testing cohort also had marginally lower mortality rates and higher numbers of emergency admissions or operations in the past year. This difference likely reflects the population shift and improvements in health care over time.
AutoScore was used to select the most discriminative variables from all 26 candidate variables (eTable 1 in the Supplement). Parsimony plots (ie, model performance vs complexity) based on the validation set were used for determining the choice of variables (eFigure 2 in the Supplement). We chose 6 variables as the parsimonious choice for SERP-2d and SERP-30d, whereas SERP-7d with 5 variables achieved a good balance in the parsimony plot. Five variables were chosen by all 3 SERP scores, including age, heart rate, respiration rate, diastolic blood pressure, and systolic blood pressure. These selected variables highlighted the importance of vital signs in risk-triaging patients in emergency settings. As seen from eFigure 2 in the Supplement, when more variables were added to the scoring model, the performance was not markedly improved.
The SERP scores derived based on primary outcomes (2-, 7-, and 30-day mortality) were tabulated in Table 2. All 3 scores summed from their included variables ranged from 0 to approximately 60. We used the testing cohort to evaluate the performance of the SERP scores. eFigure 3 in the Supplement depicts the distribution of episodes at different score intervals, which had near-normal distribution. For SERP-2d and SERP-30d, most patients had a risk score between 16 and 24, and few patients had scores under 9 or above 40. As seen in eFigure 4 in the Supplement, the observed mortality rate increased as our risk scores increased in the testing cohort. In terms of different components of SERP, when age was younger than 30 years, its corresponding risk (quantified as points) was the lowest; when age was older than 80 years, the risk was the highest. Likewise, when a reported diastolic blood pressure was between 50 and 94 mm Hg, the corresponding risk was the lowest, and when it was lower than 49 mm Hg, the risk was the highest. Thus, SERP scores had varying points for each component according to the outcomes of interest.
The performance of the SERP scores and other clinical scores as assessed by ROC analysis in the testing cohort are reported in Table 3 and Figure 2. SERP had promising discriminatory capability in estimating all mortality-related outcomes. The SERP-30d achieved the best performance for short-term and long-term mortality prediction, with an AUC of 0.821 (95% CI, 0.796-0.847) for 2-day mortality, an AUC of 0.826 (95% CI, 0.811-0.841) for 7-day mortality, an AUC of 0.823 (95% CI, 0.814-0.832) for 30-day mortality, and an AUC of 0.810 (95% CI, 0.799-0.821) for inpatient mortality. eTables 2 and 3 in the Supplement summarize the predictive performance of the SERP scores and their comparators on 30- and 2-day mortality risk estimation, respectively.
This cohort study developed parsimonious and point-based SERP scores based on 2-, 7-, and 30-day mortality for risk-stratifying patients after emergency admissions. SERP scores were more accurate in identifying patients who died during short- or long-term care than other point-based clinical tools (ie, PACS, NEWS, MEWS, CART, RAPS, and REMS). A previous study22 developed a model for inpatient mortality using variables including basic demographic, administrative, and clinical information acquired in the ED. Despite the model showing good discriminative performance, the need to use a computer with 19 variables to calculate a score limited its applicability and interpretability. Instead, SERP is an additive, point-based triage tool, making it simple, quick to calculate, transparent, and interpretable. Moreover, SERP has the advantage of easy implementation, enabling its wide application in different real-world circumstances.
Among the 3 SERP scores, SERP-30d achieved satisfactory performance on short-term (eg, 2- or 3-day) mortality and relatively long-term (eg, 30-day) mortality risk estimation. Several possible reasons exist for SERP-30d to excel. The 2-day mortality rate in our cohort was as low as 0.8%. Thus, SERP-2d was developed based on highly imbalanced data, for which the abundance of samples from the majority class (survival group) could overwhelm the minority class in predictive modeling. As a comparison, 30-day mortality contained all 2- or 7-day mortality cases and was more prevalent at a rate of 5.9%, making the SERP-30d score more reliable and accurate. Our results reaffirmed the value of 30-day mortality as an essential indicator for the ED.23-25 Besides vital signs, the SERP-30d score included comorbidities, the importance of which has been demonstrated in several studies.26-29 For example, Chu et al26 reported the contribution of patient comorbidity to short- and long-term mortality. Among all 3 SERP scores, age was selected as a key variable through a data-driven process, which aligns with the evidence on the vital role of age among ED patients.27-29
The SERP scores could provide an objective measure during ED triage to estimate a patient’s mortality risk. Although physicians can generally ascertain the severity of a patient’s acute condition and the threat to life, their decisions are often subjective and depend on an individual’s experience and knowledge. In a study30 of elderly patients, although physicians could estimate the 30-day mortality during the consultation, they missed 4 of every 5 deaths, with a sensitivity of 20% only. Like the Emergency Severity Index,31 some triage scores may achieve better performance in risk estimation but require some subjective variables. Some recent studies32-34 highlighted the role of data-driven, objective clinical decision tools to help physicians rethink and reassess the triage process in the ED. Because our SERP scores only comprise objective elements, they can be easily computed by trained medical assistants or integrated into an existing hospital EHR, without the need for professional medical personnel. Therefore, one can rapidly estimate a patient’s risk of death without adversely affecting ED workloads, which is important in the fast-paced ED environment and other heterogenous emergency care systems run by generalists rather than emergency medicine specialists. Given our tool’s purpose as an adjunct to clinical acumen during the consultation, such a risk stratification tool would conceivably be used when a physician plans to admit a patient and considers the level of service that might be appropriate for that individual.
Ultimately, the most important unanswered question is whether SERP can improve outcomes in actual clinical practice. To address this, prospective studies are needed to validate its real-world predictive capabilities and determine appropriate thresholds to stratify the ED population into various risk categories. In addition, given the strength of SERP as a simple and interpretable scoring tool, further assessments could be performed to evaluate the more intangible aspects of score implementation.35,36 Such measures would include determining SERP’s long-term sustainability, overall cost-effectiveness, and physician-perceived acceptability. These future assessments might lend credence to SERP as an effective and accurate tool for decision-making within the ED.
This study has several strengths. First, machine learning–based variable selection by AutoScore11 can efficiently filter out redundant information to achieve a sparse solution. Sanchez-Pinto et al37 also suggested that variable selection plays an essential role in reducing the complexity of prediction models without compromising their accuracy, especially when facing a large number of candidate features extracted from EHRs.38 Likewise, Liu et al39 demonstrated that more variables did not necessarily lead to better prediction of adverse cardiac events. The second strength of SERP is the size of the data set that was used to derive the risk scores. This data set is one of the largest used to generate a point-based triage model, with a cohort of more than 300 000 emergency admissions during 8 years, obtained from a large tertiary hospital. Third, the SERP scores consistently performed well in the testing cohort, even with changes in patient characteristics, outcome prevalence, and clinical practices amid the continuously evolving clinical environment.40
This study also has several limitations. First, the data set used in this study was based on EHR data of routinely collected variables. Thus, some variables, such as socioeconomic status, were not used in SERP score development. Second, because this was a single-center study at a tertiary hospital, the performance of SERP scores may vary in different settings. Third, our ED cohort accounted for only ED admissions, which might influence score generalizability when applying the scores to a general ED population.
SERP is a parsimonious and point-based scoring tool for triaging patients in the ED. In this cohort study, SERP performed better in comparison with existing triage scores and has the advantage of easy implementation and ease of ascertainment at ED presentation. SERP scores have the potential to be widely used and validated in different circumstances and health care settings. Following the clinical application of SERP in ED triage, more tailored scores can be derived in various clinical areas through the machine learning–based AutoScore framework in the future.
Accepted for Publication: May 20, 2021.
Published: August 27, 2021. doi:10.1001/jamanetworkopen.2021.18467
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Xie F et al. JAMA Network Open.
Corresponding Author: Nan Liu, PhD, Programme in Health Services and Systems Research, Duke–National University of Singapore Medical School, 8 College Rd, Singapore 169857, Singapore (email@example.com).
Author Contributions: Drs Chakraborty and Liu are co–primary authors. Mr Xie and Dr Liu had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Xie, Ong, Liew, Tan, Ho, Nadarajan, Low, Goldstein, Chakraborty, Liu.
Acquisition, analysis, or interpretation of data: Xie, Ong, Liew, Tan, Low, Kwan, Goldstein, Matchar, Chakraborty, Liu.
Drafting of the manuscript: Xie, Ong, Liew, Tan, Ho, Liu.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Xie, Ong, Liew, Goldstein, Chakraborty.
Obtained funding: Xie, Low, Goldstein, Liu.
Administrative, technical, or material support: Xie, Ong, Tan, Low, Liu.
Supervision: Ong, Ho, Low, Matchar, Chakraborty, Liu.
Conflict of Interest Disclosures: Dr Ong reported having a patent for system and method of determining a risk score for triage (US 13/791,764 US 2014/0257122 A1) and a patent for method of predicting acute cardiopulmonary events and survivability of a patient (US 8668644 US 8,932,220 B2). Dr Liu reported receiving grants from Duke–National University of Singapore Medical School, the Estate of Tan Sri Khoo Teck Puat, and Singapore National Medical Research Council during the conduct of the study. No other disclosures were reported.
Funding/Support: This research received funding from grant Duke-NUS-KP(Coll)/2018/0011B from Duke–National University of Singapore Medical School and the Estate of Tan Sri Khoo Teck Puat under the Khoo Pilot Award (collaborative). This study was also supported by grant CGDec19F04 from the Singapore National Medical Research Council under the PULSES Center Grant.
Role of the Funder/Sponsor: The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.