[Skip to Navigation]
Sign In
Figure 1.  Approach to Autonomously Ordering Tests in an Emergency Department (ED) Using Machine Learning Medical Directives (MLMDs)
Approach to Autonomously Ordering Tests in an Emergency Department (ED) Using Machine Learning Medical Directives (MLMDs)

Standard ED workflow vs MLMD augmentation of preexisting ED workflows with enabling aspects of clinical automation. With MLMDs, patients for whom the directive is activated have immediate testing ordered before being seen by a clinician. When the directive is not activated, patients proceed to the current standard of care pathway and wait for clinician assessment before testing is ordered. Overtesting can be addressed proactively by ensuring model decision thresholds yield high positive predictive values and low false-positive rates. This model threshold approach inevitably produces false-negative cases, but simultaneously allows for true automation of test ordering for a subset of patients as a result of maintaining a high positive predictive value. When false-negative cases occur, if the MLMD is not activated, the patient travels through the standard of care ED process. This dual pathway for streamlining care for patients identified by MLMDs and sending those not identified back into the typical workflow can allow for clinical automation in the ED for common presenting signs and symptoms without risking missed diagnoses or overtesting. EHR indicates electronic health record.

Figure 2.  Area Under the Receiver Operator Curve for Each Machine Learning–Based Directive Use Case With Corresponding Model Operating Thresholds for Top-Performing Models
Area Under the Receiver Operator Curve for Each Machine Learning–Based Directive Use Case With Corresponding Model Operating Thresholds for Top-Performing Models

Top-performing models were those with the highest positive predictive value (PPV). Neural network (NN) models obtained the highest PPVs across all use cases: abdominal ultrasonography (true-positive rate [TPR], 0.10; false-positive rate [FPR], 0.0006; PPV 0.86) (A), electrocardiogram (TPR, 0.60; FPR, 0.003; PPV, 0.84) (B), urine dipstick (TPR, 0.30; FPR, 0.004; PPV, 0.91) (C), and testicular ultrasonography (TPR, 0.40; FPR, 0.0003; PPV, 0.88) (D). The corresponding operating thresholds (gray dots) are displayed for each NN model. Model thresholds can be adjusted such that the true-positive rate is increased to capture more positive cases; however, this comes at the expense of additional false-positive results and potential for overtesting.

Figure 3.  Feature Importance Assessment Using Shapley Additive Explanations (SHAP) Values
Feature Importance Assessment Using Shapley Additive Explanations (SHAP) Values

The top 20 features for each model are ranked. Blue represents low values (or 0 for a binary feature that is not present) and red high values (or 1 for a binary feature that is present). Individual patient-level explainability was also computed using SHAP values (eFigure 1 in the Supplement). CSN indicates an EHR encounter number that is ordered based on time of patient arrival; CTAS4, Canadian Triage Acuity Scale, score 4; UTI, urinary tract infection.

aConcept unique identifier coded feature input that organizes free text symptoms into higher-level groupings and does not represent the electronic health record diagnosis label, which is not used as a feature input into our models.

Figure 4.  Error Analysis Stratified by Sex for Top-Performing Machine Learning–Based Directive (MLMD) Models Using Pearson χ2 Test
Error Analysis Stratified by Sex for Top-Performing Machine Learning–Based Directive (MLMD) Models Using Pearson χ2 Test

A, Overall false-positive rates. B, Subgroup error analysis by age for urine dipstick testing. C, Subgroup error analysis by age for abdominal ultrasonography testing. ECG indicates electrocardiogram.

Table.  MLMD Summary Statistics and Model Performancea
MLMD Summary Statistics and Model Performancea
1.
Doan  Q, Wong  H, Meckler  G,  et al; for Pediatric Emergency Research Canada (PERC).  The impact of pediatric emergency department crowding on patient and health care system outcomes: a multicentre cohort study.   CMAJ. 2019;191(23):E627-E635. doi:10.1503/cmaj.181426 PubMedGoogle ScholarCrossref
2.
Bernstein  SL, Aronsky  D, Duseja  R,  et al; Society for Academic Emergency Medicine, Emergency Department Crowding Task Force.  The effect of emergency department crowding on clinically oriented outcomes.   Acad Emerg Med. 2009;16(1):1-10. doi:10.1111/j.1553-2712.2008.00295.x PubMedGoogle ScholarCrossref
3.
Horwitz  LI, Green  J, Bradley  EH.  US emergency department performance on wait time and length of visit.   Ann Emerg Med. 2010;55(2):133-141. doi:10.1016/j.annemergmed.2009.07.023 PubMedGoogle ScholarCrossref
4.
Deforest  EK, Thompson  GC.  Advanced nursing directives: integrating validated clinical scoring systems into nursing care in the pediatric emergency department.   Nurs Res Pract. 2012;2012:596393. doi:10.1155/2012/596393 PubMedGoogle Scholar
5.
Zemek  R, Plint  A, Osmond  MH,  et al.  Triage nurse initiation of corticosteroids in pediatric asthma is associated with improved emergency department efficiency.   Pediatrics. 2012;129(4):671-680. doi:10.1542/peds.2011-2347 PubMedGoogle ScholarCrossref
6.
Kinlin  LM, Bahm  A, Guttmann  A, Freedman  SB.  A survey of emergency department resources and strategies employed in the treatment of pediatric gastroenteritis.   Acad Emerg Med. 2013;20(4):361-366. doi:10.1111/acem.12108 PubMedGoogle ScholarCrossref
7.
Health Quality Ontario. Under pressure: emergency department performance in Ontario. 2016. Accessed January 29, 2022. http://www.hqontario.ca/portals/0/Documents/system-performance/under-pressure-report-en.pdf
8.
Dewhirst  S, Zhao  Y, MacKenzie  T, Cwinn  A, Vaillancourt  C.  Evaluating a medical directive for nurse-initiated analgesia in the emergency department.   Int Emerg Nurs. 2017;35:13-18. doi:10.1016/j.ienj.2017.05.005 PubMedGoogle ScholarCrossref
9.
Cabilan  CJ, Boyde  M.  A systematic review of the impact of nurse-initiated medications in the emergency department.   Australas Emerg Nurs J. 2017;20(2):53-62. doi:10.1016/j.aenj.2017.04.001 PubMedGoogle ScholarCrossref
10.
Ho  JKM, Chau  JPC, Cheung  NMC.  Effectiveness of emergency nurses’ use of the Ottawa Ankle Rules to initiate radiographic tests on improving healthcare outcomes for patients with ankle injuries: a systematic review.   Int J Nurs Stud. 2016;63:37-47. doi:10.1016/j.ijnurstu.2016.08.016 PubMedGoogle ScholarCrossref
11.
Hwang  CW, Payton  T, Weeks  E, Plourde  M.  Implementing triage standing orders in the emergency department leads to reduced physician-to-disposition times.   Adv Emerg Med. 2016;2016:7213625. doi:10.1155/2016/7213625 Google Scholar
12.
Størdal  K, Wyder  C, Trobisch  A, Grossman  Z, Hadjipanayis  A.  Overtesting and overtreatment—statement from the European Academy of Paediatrics (EAP).   Eur J Pediatr. 2019;178(12):1923-1927. doi:10.1007/s00431-019-03461-1 PubMedGoogle ScholarCrossref
13.
Donofrio  JJ, Horeczko  T, Kaji  A, Santillanes  G, Claudius  I.  Most routine laboratory testing of pediatric psychiatric patients in the emergency department is not medically necessary.   Health Aff (Millwood). 2015;34(5):812-818. doi:10.1377/hlthaff.2014.1309 PubMedGoogle ScholarCrossref
14.
Li  P, To  T, Parkin  PC, Anderson  GM, Guttmann  A.  Association between evidence-based standardized protocols in emergency departments with childhood asthma outcomes: a Canadian population-based study.   Arch Pediatr Adolesc Med. 2012;166(9):834-840. doi:10.1001/archpediatrics.2012.1195 PubMedGoogle ScholarCrossref
15.
Saddler  N, Harvey  G, Jessa  K, Rosenfield  D.  Clinical decision support systems: opportunities in pediatric patient safety.   Curr Treat Options Pediatr. 2020;6(6):1-11. doi:10.1007/s40746-020-00206-3Google Scholar
16.
Kwok  R, Dinh  M, Dinh  D, Chu  M.  Improving adherence to asthma clinical guidelines and discharge documentation from emergency departments: implementation of a dynamic and integrated electronic decision support system.   Emerg Med Australas. 2009;21(1):31-37. doi:10.1111/j.1742-6723.2008.01149.x PubMedGoogle Scholar
17.
Bell  LM, Grundmeier  R, Localio  R,  et al.  Electronic health record–based decision support to improve asthma care: a cluster-randomized trial.   Pediatrics. 2010;125(4):e770-e777. doi:10.1542/peds.2009-1385 PubMedGoogle ScholarCrossref
18.
Soldaini  L, Goharian  N. QuickUMLS: a fast, unsupervised approach for medical concept extraction. MedIR Work sigir. 2016. Accessed January 29, 2022. http://medir2016.imag.fr/data/MEDIR_2016_paper_16.pdf
19.
Bodenreider  O.  The Unified Medical Language System (UMLS): integrating biomedical terminology.   Nucleic Acids Res. 2004;32(database issue):D267-D270. doi:10.1093/nar/gkh061 PubMedGoogle Scholar
20.
Shrikumar  A, Greenside  P, Kundaje  A. Learning important features through propagating activation differences. In: 34th International Conference on Machine Learning, August 6-11, 2017; Sydney, NSW, Australia.
21.
Lundberg  SM, Lee  SI.  A unified approach to interpreting model predictions.  Preprint. Posted online November 25, 2017. arXiv 1705.07874.
22.
Williams  K, Thomson  D, Seto  I,  et al; StaR Child Health Group.  Standard 6: age groups for pediatric trials.   Pediatrics. 2012;129(suppl 3):S153-S160. doi:10.1542/peds.2012-0055I PubMedGoogle ScholarCrossref
23.
Yu  KH, Kohane  IS.  Framing the challenges of artificial intelligence in medicine.   BMJ Qual Saf. 2019;28(3):238-241. doi:10.1136/bmjqs-2018-008551 PubMedGoogle ScholarCrossref
24.
Char  DS, Abràmoff  MD, Feudtner  C.  Identifying ethical considerations for machine learning healthcare applications.   Am J Bioeth. 2020;20(11):7-17. doi:10.1080/15265161.2020.1819469 PubMedGoogle ScholarCrossref
25.
Cutillo  CM, Sharma  KR, Foschini  L, Kundu  S, Mackintosh  M, Mandl  KD; MI in Healthcare Workshop Working Group.  Machine intelligence in healthcare-perspectives on trustworthiness, explainability, usability, and transparency.   NPJ Digit Med. 2020;3:47. doi:10.1038/s41746-020-0254-2 PubMedGoogle ScholarCrossref
26.
Tonekaboni  S, Joshi  S, McCradden  MD, Goldenberg  A.  What clinicians want: contextualizing explainable machine learning for clinical end use.  May 2019. https://arxiv.org/abs/1905.05134
27.
Wiens  J, Saria  S, Sendak  M,  et al.  Do no harm: a roadmap for responsible machine learning for health care.   Nat Med. 2019;25(9):1337-1340. doi:10.1038/s41591-019-0548-6 PubMedGoogle ScholarCrossref
28.
Obermeyer  Z, Mullainathan  S.  Dissecting racial bias in an algorithm that guides health decisions for 70 million people.   Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342PubMedGoogle ScholarCrossref
29.
Pantell  MS, Adler-Milstein  J, Wang  MD, Prather  AA, Adler  NE, Gottlieb  LM.  A call for social informatics.   J Am Med Inform Assoc. 2020;27(11):1798-1801. doi:10.1093/jamia/ocaa175 PubMedGoogle ScholarCrossref
30.
Nagaraj  S, Harish  V, McCoy  LG,  et al.  From clinic to computer and back again: practical considerations when designing and implementing machine learning solutions for pediatrics.   Curr Treat Options Pediatr. 2020;6:336-349. doi:10.1007/s40746-020-00205-4Google ScholarCrossref
Original Investigation
Emergency Medicine
March 16, 2022

Assessment of Machine Learning–Based Medical Directives to Expedite Care in Pediatric Emergency Medicine

Author Affiliations
  • 1Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
  • 2The Hospital for Sick Children, Toronto, Ontario, Canada
  • 3Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
  • 4DATA Team, Techna Institute, University Health Network, Toronto, Ontario, Canada
  • 5Vector Institute, Toronto, Ontario Canada
  • 6Canadian Institute for Advanced Research, Toronto, Ontario Canada
JAMA Netw Open. 2022;5(3):e222599. doi:10.1001/jamanetworkopen.2022.2599
Key Points

Question  Can machine learning–based medical directives (MLMDs) be used to autonomously order testing at triage for common pediatric presentations in the emergency department?

Findings  This decision analytical model analyzing 77 219 presentations of children to an emergency department noted that the best-performing MLMD models obtained high area-under-receiver-operator curve and positive predictive values across 6 pediatric emergency department use cases. The implementation of MLMD using these thresholds may help streamline care for 22.3% of all patient visits.

Meaning  The findings of this study suggest MLMDs can autonomously order diagnostic testing for pediatric patients at triage with high positive predictive values and minimal overtesting; model explainability can be provided to clinicians and patients regarding why a test is ordered, allowing for transparency and trust to be built with artificial intelligence systems.

Abstract

Importance  Increased wait times and long lengths of stay in emergency departments (EDs) are associated with poor patient outcomes. Systems to improve ED efficiency would be useful. Specifically, minimizing the time to diagnosis by developing novel workflows that expedite test ordering can help accelerate clinical decision-making.

Objective  To explore the use of machine learning–based medical directives (MLMDs) to automate diagnostic testing at triage for patients with common pediatric ED diagnoses.

Design, Setting, and Participants  Machine learning models trained on retrospective electronic health record data were evaluated in a decision analytical model study conducted at the ED of the Hospital for Sick Children Toronto, Canada. Data were collected on all patients aged 0 to 18 years presenting to the ED from July 1, 2018, to June 30, 2019 (77 219 total patient visits).

Exposure  Machine learning models were trained to predict the need for urinary dipstick testing, electrocardiogram, abdominal ultrasonography, testicular ultrasonography, bilirubin level testing, and forearm radiographs.

Main Outcomes and Measures  Models were evaluated using area under the receiver operator curve, true-positive rate, false-positive rate, and positive predictive values. Model decision thresholds were determined to limit the total number of false-positive results and achieve high positive predictive values. The time difference between patient triage completion and test ordering was assessed for each use of MLMD. Error rates were analyzed to assess model bias. In addition, model explainability was determined using Shapley Additive Explanations values.

Results  There was a total of 42 238 boys (54.7%) included in model development; mean (SD) age of the children was 5.4 (4.8) years. Models obtained high area under the receiver operator curve (0.89-0.99) and positive predictive values (0.77-0.94) across each of the use cases. The proposed implementation of MLMDs would streamline care for 22.3% of all patient visits and make test results available earlier by 165 minutes (weighted mean) per affected patient. Model explainability for each MLMD demonstrated clinically relevant features having the most influence on model predictions. Models also performed with minimal to no sex bias.

Conclusions and Relevance  The findings of this study suggest the potential for clinical automation using MLMDs. When integrated into clinical workflows, MLMDs may have the potential to autonomously order common ED tests early in a patient’s visit with explainability provided to patients and clinicians.

Introduction

Overcrowding in emergency departments (EDs) and prolonged wait times are common challenges associated with poor health outcomes globally.1,2 A retrospective cross-sectional study reported that few EDs achieve recommended wait times,3 and innovative strategies to improve patient flow are required to address these challenges.

The typical pathway for patients with stable vital signs in an ED involves a triage assessment followed by transfer to a waiting area. As ED assessment rooms become available, patients are moved into the department and then wait until they can be assessed by a health care practitioner (HCP). From here, an HCP will order tests to rule in or out suspected differential diagnoses as needed. This process triggers another sequence of waiting for the test to be conducted and for results to be processed before further reassessment and treatment are provided.

One strategy for increasing patient flow through an ED is the use of nursing medical directives at triage. Medical directives allow nurses to follow protocols to order investigations at the time of initial triage for patients with specific presentations. This directive allows testing to be completed while patients are waiting, making test results available at the time of their initial assessment by an HCP. In doing so, diagnosis and disposition planning can be streamlined, which may reduce the length of stay for many patients.4-7 Nursing medical directives at triage are common in adult EDs but limited in the total number that can be implemented at any one time.8-10 In pediatric EDs, concerns of overtesting and the relatively invasive nature of some tests in children have been barriers to their increased use.11,12 For example, obtaining a urine sample in an adult is typically simple; however, negotiating with a child to urinate into a cup can be far more time consuming or require a urinary catheter procedure. In these cases, initiating testing immediately after triage can accelerate care. The use of nursing medical directives is also subject to human practice variation, is inconsistently applied, and has been shown to be ineffective in improving outcomes compared with computerized workflows for some disease presentations.13-16

With the proliferation of electronic health record (EHR) systems and the large amount of data available, there exists an opportunity to develop automated machine learning (ML)–based medical directives (MLMDs) to improve patient flow through EDs and overcome some of the limitations inherent to traditional medical directives. Machine learning–based models trained with triage data can be used to predict the need for specific testing in a patient and autonomously order the appropriate confirmatory tests before assessment by an HCP (Figure 1). This process achieves the same value gained from nursing medical directives with the added benefit of not being subject to human practice variation, maintaining performance during high patient volumes, and having an ability to alter model behavior without the need for retraining nursing staff as policies and practices change. Machine learning–based medical directives also avoid HCP alert fatigue by automating the ordering of investigations rather than providing traditional decision support alerts.17 This function can be scaled to include a large number of testing use cases beyond what can typically be managed by triage nurses and is a generalizable approach given the homogeneous nature of triage data globally for children and adults.

The automation enabled by MLMDs allows for testing to be predicted and ordered early in the patient journey while they are typically waiting for evaluation by an HCP (Figure 1). With this automated process, HCPs have test results available at the time of initial patient assessment, allowing for improved clinical decision-making and streamlined care.

Methods
Data Selection

In this retrospective decision analytic model study, data were collected from an EHR system at the Hospital for Sick Children, a tertiary care hospital in Toronto, Canada, for all patients presenting to the ED from July 1, 2018, to June 30, 2019 (77 219 total patient visits) with research ethics board approval. Missing values for continuous features were imputed using mean values and missing values for categorical features were imputed with most frequent categorical features. Detailed ED descriptive statistics including data missingness can be found in eFigure 2 in the Supplement. Training (55%), validation (15%), and held-out test (30%) sets were generated in a time-based sequential fashion to simulate prospective implementation of our models. Time-based sequential ordering was done instead of random splitting to mimic model performance if used prospectively during the study time period. Use cases we considered included abdominal ultrasonography, electrocardiogram, urine dipstick, testicular ultrasonography, bilirubin level testing, and forearm radiographs. Clinical access to these tests was available 24 hours a day. The number of tests and associated diagnoses for each ML directive are reported in the Table. Together, 22.3% of all patients observed in the ED within the 12-month period received 1 or more of these tests.

Model input features included the following for each patient: heart rate, respiratory rate, oxygen saturation, blood pressure, body temperature, patient weight and age, presenting symptoms, Canadian Triage and Acuity Scale score, date and time of triage, preferred language, distance from hospital to home address, and free-text triage notes; data on race and ethnicity were not collected at the time of the study. Patient triage notes were processed using the QuickUMLS18 module in Python to extract concept unique identifier codes. The 2019AB release of the Unified Medical Language System database was used.19 Patients with either no triage note or no concept unique identifier terms in their note were encoded with a missing token. After data preprocessing, a total of 6513 model input features were used.

Our models predict whether a patient will have a diagnosis associated with the tests described above. For example, the forearm radiograph model is trained to predict whether a patient will have 1 of the following final diagnoses: buckle fracture, fracture of radius, fracture of ulna, and wrist fracture. The diagnosis predictions are then mapped to their corresponding test (a forearm radiograph), rather than predicting the test directly. This function helped us avoid the negative effect on model performance caused by noise within the data due to HCP practice variation and bias when ordering tests. Diagnoses for each patient were obtained using the final diagnosis selected by physicians at the time a patient’s EHR file was completed (eTable in the Supplement). The study was approved by the Hospital for Sick Children with a waiver of informed consent. This study followed the relevant portions of Consolidated Health Economic Evaluation Reporting Standards (CHEERS) reporting guideline for decision analytical model studies.

MLMD Model Training

Logistic regression, random forest, and fully connected feed forward deep artificial neural network (NN) models were trained with class weighting. Logistic regression models were fit to training data using a ridge regularization penalty. The random forest models were developed by selecting ideal hyperparameters informed by a random grid search approach using 5-fold cross-validation on training data. In addition, our NN models consisted of 5 fully connected layers using a rectified linear activation function for the first 4 layers and a sigmoid activation function in the final output layer. All NN models were optimized using stochastic gradient descent with a learning rate of 1e-4, a weight decay of 1e-6, a momentum of 0.9, and ridge regularization. A binary cross-entropy loss function was used for all NN models. Early stopping was implemented to help prevent model overfitting.

MLMD Model Evaluation

All model-related outcome metrics were generated using a held-out test set (Table). Models were evaluated using area under the receiver operator curve (AUROC), true-positive rate (TPR), false-positive rate (FPR), and PPV. True-positive (TP), false-positive (FP), true-negative, and false-negative (FN) cases were defined as logical functions as follows: TP indicates positive prediction when a test is completed OR an associated differential diagnosis is present; true negative, negative prediction when no test is completed AND no associated differential diagnosis is present; FP, positive prediction when no test is completed AND no associated differential diagnosis is present; and FN, negative prediction when a test is completed OR an associated differential diagnosis is present.

Final outcome metrics factor in both if a diagnosis is present and if an HCP ordered an associated test. This function is included because a negative test result often is clinically important in an ED setting (ie, a forearm radiograph to rule out fracture after a hard fall). Therefore, the outcome metrics generated consider both final diagnosis and physician test ordering.

After model training, decision thresholds for each MLMD model were determined to limit the total number of FPs and achieve high PPVs. High PPVs are essential because our end goal is autonomous ordering of testing. As a consequence, TPRs are relatively reduced overall. The amount of excess testing that would potentially occur at the selected decision threshold is also computed as (TP + FN + FP) / (TP + FN). In addition, the time difference between when a patient completed triage and when testing was ordered was assessed for each of the MLMD use cases to measure the potential acceleration in care.

Model Explainability

Shapley Additive Explanations (SHAP) values were computed for the top-performing models to allow for model explainability.20,21 SHAP values help quantify the contribution of each feature toward an MLMD model prediction. These findings can be used to help determine model validity by analyzing whether our MLMD models are identifying relevant features that are in line with our own clinical judgment when ordering tests. We also use the same method to determine model explainability for each patient’s prediction (eFigure 1 in the Supplement). By computing explainability on a patient-to-patient basis, our models can begin to autonomously communicate to HCPs and patients why a test is being ordered in reference to their personal symptoms and features.

Statistical Analysis

To uncover potential sex biases within our models, we used the Pearson χ2 test to look at the difference in error rates across sex and age. Error rates were defined as the rate of FP predictions over the total number of positives. With 2-tailed, unpaired testing, a significant difference (P < .05) in these error rates based on the χ2 test suggests that there may be a bias.

Results
MLMD Model Outcome Metrics

Data on 42 238 boys (54.7%) and 34 981 girls (45.3%) were included in model development; mean (SD) age was 5.4 years (4.8) years. Deep NN models in general attained the highest absolute PPVs across each of the different MLMD use cases, although all 3 considered methods performed well overall. All models achieved high AUROCs as noted in the Table and Figure 2. The model for predicting bilirubin level testing obtained the best performance with AUROC (0.99), PPV (0.94), and TPR (0.90). The most useful MLMD use case was the urinary testing NN model given the high frequency of urinary testing completed in the ED combined with the PPV of 0.90 at a TPR of 0.30 in the model. This finding equates to potentially automating urine testing for approximately 2900 patients (3.8% of all patients in the ED) and a potential reduction of 183 minutes in waiting for these patients at the current model threshold.

Potential Time Savings Assessment

To quantify the initial association of MLMDs with wait times for patients, we measured the time between triage completion (ie, the time when an MLMD would activate) and when a test order is made in the EHR by an HCP. As noted in the Table, this time difference represents the potential efficiency gained within this segment of the patient journey if the directive is ordered at triage. Using these data, the weighted mean reduction was approximately 165 minutes per patient when the MLMD was activated. There are many factors associated with efficiency within an ED with changing bottlenecks that affect patient flow. A prospective time- and activity-based costing analysis to thoroughly evaluate autonomous ML systems is a next step we will take to fully evaluate the advantages of MLMDs.

Model Explainability

As seen in the SHAP value feature importance plots in Figure 3, each of the deep NN models is affected by clinically relevant features related to the differential diagnoses associated with the MLMD tests. For example, in the abdominal ultrasonography plot, when patients lack abdominal pain, the SHAP value is low and therefore reduces the model’s likelihood of ordering an abdominal ultrasonography. When concern for appendicitis is present in the triage note, the SHAP value is higher and pushes the model toward ordering an abdominal ultrasonography (Figure 3).

Sex Bias Analysis

Because race and ethnicity information was not being collected at the Hospital for Sick Children, our ability to evaluate model bias was limited to age and sex. We assessed differences that were determined to be significant between the number of FP errors that occurred in boys vs girls within our test set compared with the total number of positive cases present for each sex (Figure 4). When comparing FP error rates between boys and girls, a significant difference of 0.04 was found for the urine dipstick model and a significant difference of 0.021 was found for the abdominal ultrasonography model when χ2 tests were conducted (Figure 4A). Differential diagnoses and testing frequencies are expected to vary by age groupings in pediatric medicine.22 Given this expectation, further error analysis was completed for these 2 use cases comparing FP errors in boys and girls by age group as seen in Figure 4B and C. The only significant difference in FP errors was found for girls aged 2 to 10 years in the urine dipstick MLMD model. Although a significant P value indicates a difference in the distributions, a nonsignificant one can also indicate insufficient data to detect a difference; we therefore intend to review bias features as more data become available.

Discussion

Machine learning medical directives are a balanced generalizable approach to autonomously ordering testing in an ED while simultaneously avoiding overtesting. These models are not designed to identify all patients who require a specific test, but rather to enable clinical automation for a subset of patients identified by the mode at a high PPV, so as to avoid FP results. Patients who require a test but are not identified simply undergo the standard of care pathway in the ED (Figure 1). Our dual pathway system can streamline care for a large proportion (22.3%) of patients in the ED, while the traditional pathway remains to safely ensure cases are not being missed.

We expect that as more time passes and we have more training data, MLMD systems could streamline care for a progressively larger number of patients in the ED. The number of use cases may also grow beyond what is typically managed by ED triage nurses, thus adding value to sites with and without preexisting medical directive protocols. Identifying testing needs for patients at the start of their ED visit can also assist with administrative planning for key stakeholders, such as diagnostic imaging departments, that often face staffing challenges when responding to surges in ED imaging requests. Models can also be used to trigger the automated delivery of targeted health education directly to patients and families in real time to help build health literacy and awareness of what to expect along their journey in the ED. Most importantly, this work helps to provide a framework for how any ED with an EHR can approach augmenting patient care with ML and clinical automation.

Determining the ideal operating threshold for any given MLMD model requires consideration of both patient-centric and system-level risks as well as costs that may occur as a result of FP testing.23,24 For example, autonomously ordering radiograph-based tests with high FP rates will not be deemed acceptable owing to radiation exposure. Autonomously ordering abdominal ultrasonography, despite being a radiation-free test, also cannot tolerate high FP rates owing to the relatively high resource use and costs associated with obtaining and interpreting the scans.

Explainability of ML decision-making on an individual patient level for each prediction is also an important goal to achieve, as HCPs and patients/caretakers need transparency as to why a particular test is being completed.25,26 The SHAP values are one option for computing how each patient’s specific feature inputs contribute to a model’s prediction and can thus be used to generate real-time explanations for HCPs and families (eFigure 1 in the Supplement). With this method, we can overcome the traditional black box limitations of not knowing why an ML model has made a decision and use transparent, patient-centric artificial intelligence solutions in the ED.

Before mass acceptance of clinical MLMD models, bias assessments must be completed.27 In particular, ensuring equity of model performance among different subgroups is critical as there is a risk that ML systems will propagate systemic bias forward.28,29 Without information related to race and ethnicity, our bias assessment was focused on sex and did not show any significant difference in error rates, except in our urine dipstick testing model. A significant difference was found in FP error rates for urine dipstick testing with more errors identified in the results for girls than boys between the ages of 2 and 10 years. Despite the number of FP results being low overall, the pursuit for equitable model performance is a priority. Adjusting model class weighting for girls vs boys, training models specifically for each sex, and exploring further statistical analysis on the distributions of features for each sex and their influence on model output will be explored in future work.

We are beginning to collect race and ethnicity data in our hospital, and this information will be used for bias analysis in subsequent data sets. The addition of nonbinary gender labels to EHRs is another essential step required to promote further bias assessments and diverse model equity. We will also look at longer-term outcomes for patients to evaluate any inherent clinician biases present within our training data that are subsequently recapitulated in any MLMD workflow.

Our next steps also include human factors and human-computer interaction testing for patients and HCPs leading to a clinical trial of MLMDs in a subset of patients.30 This prospective work will be important in not only building confidence in patients and HCPs in the use of autonomously acting models in health care, but will also be essential to inform legal, ethical, and regulatory bodies on associated policy development. Such work will be required before model acceptance at a scale beyond the context of research studies.

Limitations

The study has limitations. It was conducted in a single hospital ED spanning only 1 year and served as an initial study assessing the utility of ML models in predicting testing needs for common ED presentations. To confirm generalizability, prospective validation is required using data over multiple years from a variety of hospital sites. The association of model drift with overall performance, along with the frequency of model retraining to maintain performance, has yet to be determined and will require a multiyear data set. In addition, a cost-effectiveness analysis is required before implementation beyond clinical trials.

Conclusions

The findings of this study suggest that segments of health care in EDs can be automated, adding both efficiency and consistency to the way care is delivered to patients at scale. This service can be achieved through the development of MLMDs programmed to have high PPVs and low FPRs. When integrated into clinical workflow using an augmented dual-pathway system, automation can be achieved without overtesting.

Back to top
Article Information

Accepted for Publication: January 6, 2022.

Published: March 16, 2022. doi:10.1001/jamanetworkopen.2022.2599

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2022 Singh D et al. JAMA Network Open.

Corresponding Author: Michael Brudno, PhD, Department of Computer Science, King’s College Road, Pratt Building, Room 286C, Toronto, ON M5S 3G4, Canada (brudno@cs.toronto.edu).

Author Contributions: Dr Singh and Mr Drysdale had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Singh, Nagaraj, Drysdale, Fischer, Brudno.

Acquisition, analysis, or interpretation of data: Singh, Nagaraj, Mashouri, Goldenberg, Brudno.

Drafting of the manuscript: Singh, Nagaraj, Drysdale, Fischer, Brudno.

Critical revision of the manuscript for important intellectual content: Singh, Nagaraj, Mashouri, Goldenberg, Brudno.

Statistical analysis: Singh, Nagaraj, Mashouri, Goldenberg.

Obtained funding: Singh, Goldenberg, Brudno.

Administrative, technical, or material support: Singh, Drysdale, Fischer, Brudno.

Supervision: Fischer, Goldenberg, Brudno.

Conflict of Interest Disclosures: None reported.

Funding/Support: Funding was provided by the Canadian Institutes of Health Research and Genome Canada (Dr Brudno) and the SickKids Foundation (Drs Singh and Goldenberg). Drs Goldenberg and Brudno are Canadian Institute for Advanced Research artificial intelligence chairs.

Role of the Funder/Sponsor: The funding organizations had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Additional Contributions: We thank members of the Centre for Computational Medicine at the Hospital for Sick Children for useful discussions and computational resources.

Additional Information: The software code used to complete this study can be found on the following publicly available GitHub repository: https://github.com/DrDevSK/MLMD.

References
1.
Doan  Q, Wong  H, Meckler  G,  et al; for Pediatric Emergency Research Canada (PERC).  The impact of pediatric emergency department crowding on patient and health care system outcomes: a multicentre cohort study.   CMAJ. 2019;191(23):E627-E635. doi:10.1503/cmaj.181426 PubMedGoogle ScholarCrossref
2.
Bernstein  SL, Aronsky  D, Duseja  R,  et al; Society for Academic Emergency Medicine, Emergency Department Crowding Task Force.  The effect of emergency department crowding on clinically oriented outcomes.   Acad Emerg Med. 2009;16(1):1-10. doi:10.1111/j.1553-2712.2008.00295.x PubMedGoogle ScholarCrossref
3.
Horwitz  LI, Green  J, Bradley  EH.  US emergency department performance on wait time and length of visit.   Ann Emerg Med. 2010;55(2):133-141. doi:10.1016/j.annemergmed.2009.07.023 PubMedGoogle ScholarCrossref
4.
Deforest  EK, Thompson  GC.  Advanced nursing directives: integrating validated clinical scoring systems into nursing care in the pediatric emergency department.   Nurs Res Pract. 2012;2012:596393. doi:10.1155/2012/596393 PubMedGoogle Scholar
5.
Zemek  R, Plint  A, Osmond  MH,  et al.  Triage nurse initiation of corticosteroids in pediatric asthma is associated with improved emergency department efficiency.   Pediatrics. 2012;129(4):671-680. doi:10.1542/peds.2011-2347 PubMedGoogle ScholarCrossref
6.
Kinlin  LM, Bahm  A, Guttmann  A, Freedman  SB.  A survey of emergency department resources and strategies employed in the treatment of pediatric gastroenteritis.   Acad Emerg Med. 2013;20(4):361-366. doi:10.1111/acem.12108 PubMedGoogle ScholarCrossref
7.
Health Quality Ontario. Under pressure: emergency department performance in Ontario. 2016. Accessed January 29, 2022. http://www.hqontario.ca/portals/0/Documents/system-performance/under-pressure-report-en.pdf
8.
Dewhirst  S, Zhao  Y, MacKenzie  T, Cwinn  A, Vaillancourt  C.  Evaluating a medical directive for nurse-initiated analgesia in the emergency department.   Int Emerg Nurs. 2017;35:13-18. doi:10.1016/j.ienj.2017.05.005 PubMedGoogle ScholarCrossref
9.
Cabilan  CJ, Boyde  M.  A systematic review of the impact of nurse-initiated medications in the emergency department.   Australas Emerg Nurs J. 2017;20(2):53-62. doi:10.1016/j.aenj.2017.04.001 PubMedGoogle ScholarCrossref
10.
Ho  JKM, Chau  JPC, Cheung  NMC.  Effectiveness of emergency nurses’ use of the Ottawa Ankle Rules to initiate radiographic tests on improving healthcare outcomes for patients with ankle injuries: a systematic review.   Int J Nurs Stud. 2016;63:37-47. doi:10.1016/j.ijnurstu.2016.08.016 PubMedGoogle ScholarCrossref
11.
Hwang  CW, Payton  T, Weeks  E, Plourde  M.  Implementing triage standing orders in the emergency department leads to reduced physician-to-disposition times.   Adv Emerg Med. 2016;2016:7213625. doi:10.1155/2016/7213625 Google Scholar
12.
Størdal  K, Wyder  C, Trobisch  A, Grossman  Z, Hadjipanayis  A.  Overtesting and overtreatment—statement from the European Academy of Paediatrics (EAP).   Eur J Pediatr. 2019;178(12):1923-1927. doi:10.1007/s00431-019-03461-1 PubMedGoogle ScholarCrossref
13.
Donofrio  JJ, Horeczko  T, Kaji  A, Santillanes  G, Claudius  I.  Most routine laboratory testing of pediatric psychiatric patients in the emergency department is not medically necessary.   Health Aff (Millwood). 2015;34(5):812-818. doi:10.1377/hlthaff.2014.1309 PubMedGoogle ScholarCrossref
14.
Li  P, To  T, Parkin  PC, Anderson  GM, Guttmann  A.  Association between evidence-based standardized protocols in emergency departments with childhood asthma outcomes: a Canadian population-based study.   Arch Pediatr Adolesc Med. 2012;166(9):834-840. doi:10.1001/archpediatrics.2012.1195 PubMedGoogle ScholarCrossref
15.
Saddler  N, Harvey  G, Jessa  K, Rosenfield  D.  Clinical decision support systems: opportunities in pediatric patient safety.   Curr Treat Options Pediatr. 2020;6(6):1-11. doi:10.1007/s40746-020-00206-3Google Scholar
16.
Kwok  R, Dinh  M, Dinh  D, Chu  M.  Improving adherence to asthma clinical guidelines and discharge documentation from emergency departments: implementation of a dynamic and integrated electronic decision support system.   Emerg Med Australas. 2009;21(1):31-37. doi:10.1111/j.1742-6723.2008.01149.x PubMedGoogle Scholar
17.
Bell  LM, Grundmeier  R, Localio  R,  et al.  Electronic health record–based decision support to improve asthma care: a cluster-randomized trial.   Pediatrics. 2010;125(4):e770-e777. doi:10.1542/peds.2009-1385 PubMedGoogle ScholarCrossref
18.
Soldaini  L, Goharian  N. QuickUMLS: a fast, unsupervised approach for medical concept extraction. MedIR Work sigir. 2016. Accessed January 29, 2022. http://medir2016.imag.fr/data/MEDIR_2016_paper_16.pdf
19.
Bodenreider  O.  The Unified Medical Language System (UMLS): integrating biomedical terminology.   Nucleic Acids Res. 2004;32(database issue):D267-D270. doi:10.1093/nar/gkh061 PubMedGoogle Scholar
20.
Shrikumar  A, Greenside  P, Kundaje  A. Learning important features through propagating activation differences. In: 34th International Conference on Machine Learning, August 6-11, 2017; Sydney, NSW, Australia.
21.
Lundberg  SM, Lee  SI.  A unified approach to interpreting model predictions.  Preprint. Posted online November 25, 2017. arXiv 1705.07874.
22.
Williams  K, Thomson  D, Seto  I,  et al; StaR Child Health Group.  Standard 6: age groups for pediatric trials.   Pediatrics. 2012;129(suppl 3):S153-S160. doi:10.1542/peds.2012-0055I PubMedGoogle ScholarCrossref
23.
Yu  KH, Kohane  IS.  Framing the challenges of artificial intelligence in medicine.   BMJ Qual Saf. 2019;28(3):238-241. doi:10.1136/bmjqs-2018-008551 PubMedGoogle ScholarCrossref
24.
Char  DS, Abràmoff  MD, Feudtner  C.  Identifying ethical considerations for machine learning healthcare applications.   Am J Bioeth. 2020;20(11):7-17. doi:10.1080/15265161.2020.1819469 PubMedGoogle ScholarCrossref
25.
Cutillo  CM, Sharma  KR, Foschini  L, Kundu  S, Mackintosh  M, Mandl  KD; MI in Healthcare Workshop Working Group.  Machine intelligence in healthcare-perspectives on trustworthiness, explainability, usability, and transparency.   NPJ Digit Med. 2020;3:47. doi:10.1038/s41746-020-0254-2 PubMedGoogle ScholarCrossref
26.
Tonekaboni  S, Joshi  S, McCradden  MD, Goldenberg  A.  What clinicians want: contextualizing explainable machine learning for clinical end use.  May 2019. https://arxiv.org/abs/1905.05134
27.
Wiens  J, Saria  S, Sendak  M,  et al.  Do no harm: a roadmap for responsible machine learning for health care.   Nat Med. 2019;25(9):1337-1340. doi:10.1038/s41591-019-0548-6 PubMedGoogle ScholarCrossref
28.
Obermeyer  Z, Mullainathan  S.  Dissecting racial bias in an algorithm that guides health decisions for 70 million people.   Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342PubMedGoogle ScholarCrossref
29.
Pantell  MS, Adler-Milstein  J, Wang  MD, Prather  AA, Adler  NE, Gottlieb  LM.  A call for social informatics.   J Am Med Inform Assoc. 2020;27(11):1798-1801. doi:10.1093/jamia/ocaa175 PubMedGoogle ScholarCrossref
30.
Nagaraj  S, Harish  V, McCoy  LG,  et al.  From clinic to computer and back again: practical considerations when designing and implementing machine learning solutions for pediatrics.   Curr Treat Options Pediatr. 2020;6:336-349. doi:10.1007/s40746-020-00205-4Google ScholarCrossref
×