AI indicates artificial intelligence; EHR, electronic health record.
eFigure 1. Article Search Methods
Customize your JAMA Network experience by selecting one or more topics from the list below.
Identify all potential conflicts of interest that might be relevant to your comment.
Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.
Err on the side of full disclosure.
If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.
Not all submitted comments are published. Please see our commenting policy for details.
Loftus TJ, Tighe PJ, Filiberto AC, et al. Artificial Intelligence and Surgical Decision-making. JAMA Surg. 2020;155(2):148–158. doi:10.1001/jamasurg.2019.4917
Surgeons make complex, high-stakes decisions under time constraints and uncertainty, with significant effect on patient outcomes. This review describes the weaknesses of traditional clinical decision-support systems and proposes that artificial intelligence should be used to augment surgical decision-making.
Surgical decision-making is dominated by hypothetical-deductive reasoning, individual judgment, and heuristics. These factors can lead to bias, error, and preventable harm. Traditional predictive analytics and clinical decision-support systems are intended to augment surgical decision-making, but their clinical utility is compromised by time-consuming manual data management and suboptimal accuracy. These challenges can be overcome by automated artificial intelligence models fed by livestreaming electronic health record data with mobile device outputs. This approach would require data standardization, advances in model interpretability, careful implementation and monitoring, attention to ethical challenges involving algorithm bias and accountability for errors, and preservation of bedside assessment and human intuition in the decision-making process.
Conclusions and Relevance
Integration of artificial intelligence with surgical decision-making has the potential to transform care by augmenting the decision to operate, informed consent process, identification and mitigation of modifiable risk factors, decisions regarding postoperative management, and shared decisions regarding resource use.
Surgeons make complex, high-stakes decisions when offering an operation, addressing modifiable risk factors, managing complications and optimizing resource use, and conducting an operation. Diagnostic and judgment errors are the second most common cause of preventable harm incurred by surgical patients.1 Surgeons report that lapses in judgment are the most common cause of their major errors.2 Surgical decision-making is dominated by hypothetical deductive reasoning and individual judgment, which are highly variable and ill-suited to remedy these errors. Traditional clinical decision support tools, such as the National Surgical Quality Improvement Program (NSQIP) Surgical Risk Calculator, can reduce variability and mitigate risks, but their clinical adoption is hindered by suboptimal accuracy and time-consuming manual data acquisition and entry requirements.3-8
Although decision-making is one of the most difficult and important tasks that surgeons perform, there is a relative paucity of research investigating surgical decision-making and strategies to improve it. The objectives of this review are to describe challenges in surgical decision-making, review traditional clinical decision-support systems and their weaknesses, and propose that artificial intelligence models fed with live-streaming electronic health record data (EHR) would obviate these weaknesses and should be integrated with bedside assessment and human intuition to augment surgical decision-making.
PubMed and Cochrane Library databases were searched from their inception to February 2019 (eFigure in the Supplement). Articles were screened by reviewing their abstracts for the following criteria: (1) published in English, (2) published in a peer-reviewed journal, and (3) primary literature or a review article. Articles were selected for inclusion by manually reviewing abstracts and full texts for these criteria: (1) topical relevance, (2) methodologic strength, and (3) novel or meritorious contribution to existing literature. Articles of interest cited by articles identified in the initial search were reviewed using the same criteria. Forty-nine articles were included and assimilated into relevant categories (Table 1).1-49
The quality of surgical decision-making is influenced by patient values and emotions, patient-surgeon interactions, decision-making volume and complexity, time constraints, uncertainty, hypothetical deductive reasoning, and individual judgment. There are effective and ineffective methods for dealing with each of these factors, which lead to positive and negative outcomes, respectively (Figure 1).
Quiz Ref IDIn the hypothetical-deductive decision-making model that dominates surgical decision-making, initial patient presentations are assessed to develop a list of possible diagnoses that are differentiated by diagnostic testing or response to empirical therapy. This depends on the surgeon’s ability to form a complete list of all likely diagnoses, all life-threatening diagnoses, and all unlikely diagnoses that may be considered if the initial workup excludes other causes. It also requires recognition of strengths and limitations of available tests. Once the diagnosis is established, the surgeon must recommend a plan using sound judgment. Each step introduces variability and opportunities for error.40
Patient values are individualized by nature, precluding the creation of a criterion standard of optimal decision-making. Understanding and incorporating these values is essential to an effective shared decision-making process.50 This may be accomplished by simply asking patients and caregivers about their goals of care and what they value most in life. Shared decision-making improves patient satisfaction and compliance and may reduce costs associated with undesired tests and treatments. However, patients, caregivers, and clinicians often misunderstand one another, their goals may differ, and patients and caregivers are often expected to make decisions with limited background knowledge and no medical training.13,33,50 Surgical diseases may evoke fear and anger, which influence perceptions of risks and benefits.23,51 Emotions surrounding an acute surgical condition may also create a sense of urgency and pressure on surgeons to perform futile operations.37
Surgical decision-making is often hindered by uncertainty owing to missing or incomplete data. This occurs when decisions regarding an urgent or emergent condition must be made before all relevant data can be gathered and analyzed. Nonurgent decisions may be hindered by time constraints and uncertainty owing to sheer decision-making volume, the time-consuming nature of manual data acquisition, and team dynamics. Academic intensivists make approximately 56 patient care and resource use decisions per day.36 In an assessment of medical student and resident intensive care unit (ICU) patient presentations, potentially important data were omitted from 157 of 157 presentations.10Quiz Ref ID Even when data collection and analysis are complete, high decision-making volume begets decision fatigue, manifesting as procrastination, less persistence when facing adversity, decreased physical stamina, and lower quality and quantity of mathematic calculations.49 These impairments are exacerbated by acute and chronic sleep deprivation, which occurs in as many as two-thirds of all acute care surgeons taking in-house call.52,53 For a surgical oncologist with a busy outpatient clinic, automated production of prognostic data from artificial intelligence models could improve efficiency and preserve face-to-face patient-surgeon interactions by obviating manual data acquisition and entry into prognostic models.
When facing time constraints and uncertainty, decision-making may be influenced by heuristics or cognitive shortcuts.54,55 Heuristics may lead to bias or predictable and systematic cognitive errors, as described in Table 2.16,35
Decision aids provide specific patient populations with background information, options for diagnosis and treatment, risks and benefits for each option, and outcome probabilities. In a systematic review44 including 31 043 patients facing screening or treatment decisions, patients exposed to decision aids felt more knowledgeable and played a more active role in the decision-making process. In a systematic review of 17 studies investigating decisions made by surgical patients, decision aids were associated with more knowledge regarding treatment options, preference for less invasive treatments, and no observable differences in anxiety, quality of life, morbidity, or mortality.30Quiz Ref ID However, because decision aids apply to heterogeneous patient populations with 1 common clinical presentation or choice, they do not consider individual patient physiology and risk factors.
Traditional prognostic scoring systems use regression modeling on aggregate patient populations to identify static variable risk factor thresholds, which are applied to individual patients. For example, elevated serum levels of C-reactive protein (CRP) are associated with anastomotic leak after colorectal surgery. A meta-analysis43 found that the optimal postoperative day 3 CRP cutoff value was 172 mg/L (to convert to nanomoles per liter, multiply by 9.524). This is easy to apply at the bedside but does not accurately reflect pathophysiology. Serum CRP has a relatively constant half-life, and its production is directly associated with with inflammation along a continuum.56 If 4 different patients have CRP levels of 10 mg/L, 171 mg/L, 173 mg/L, and 1000 mg/L 3 days after a colectomy, few clinicians would group these patients according to the 172 mg/L cutoff. The negative predictive value was 97%, such that a low value usually indicates no leak, but the positive predictive value was 21%.
Most diseases are not driven by a single physiologic parameter; therefore, prognostic scoring systems often incorporate multiple parameters for tasks such as measuring illness severity and predicting stroke and severe gastrointestinal bleeding.24,45,57 Parametric regression prognostic scoring systems assume that relationships among input variables are linear.22,29 When the relationships are nonlinear, the scoring system is similar to a coin toss.11
To facilitate clinical adoption, prognostic scoring systems have been implemented as online risk calculators. The NSQIP Surgical Risk Calculator is a prominent example. Calculator use may increase the likelihood that patients will participate in risk-reduction strategies such as prehabilitation.3 However, input variables must be entered manually, and its predictive accuracy is suboptimal, especially for nonelective operations, representing opportunities for improvement.4-7
In 1970, William B. Schwartz published a Special Article in the New England Journal of Medicine stating, “Computing science will probably exert its major effects by augmenting and, in some cases, largely replacing the intellectual functions of the physician.”58 Despite extraordinary advances in computer technology, this vision has not been realized. Several factors may contribute. Traditional clinical decision-support systems require time-consuming manual data acquisition and entry, which impairs their adoption.8,33 Even the most successful and widely used static variable cutoff values do not accurately represent individual patient pathophysiology, as reflected by their suboptimal accuracy.34,43,56 Parametric regression equations also fail to represent the complex, nonlinear associations among input variables, further limiting the accuracy of traditional multivariable regression models.22,29 The weaknesses of traditional approaches may be overcome by artificial intelligence models fed with livestreaming intraoperative and EHR data to augment surgical decision-making through preoperative, intraoperative, and postoperative phases of care (Figure 2).
Artificial intelligence refers to computer systems that mimic human cognitive functions such as learning and problem-solving. In the broadest sense, a computer program using simple decision tree functions can mimic human intelligence. However, artificial intelligence usually refers to computer systems that learn from raw data with some degree of autonomy, as occurs with machine learning, deep learning, and reinforcement learning (Figure 3). Quiz Ref IDWhereas traditional clinical decision-support systems use rules to generate codes and algorithms, artificial intelligence models learn from examples. Herein lies the strength of artificial intelligence for predictive analytics in medicine: human disease is simply too broad and complex to be explained and interpreted by rules.59,60
Machine learning is a subfield of artificial intelligence in which a computer system performs a task without explicit instructions. Supervised machine learning models require human domain expertise and computer engineering to design handcrafted feature extractors capable of transforming raw data into desired representations. The algorithm learns associations between input data and prescribed output categories. Once trained, a supervised model is capable of classifying new unseen input data. With unsupervised techniques, input data have no corresponding annotated output categories; the algorithm creates its own output categories according to the structure and distribution of the input data. This approach allows discovery of patterns and phenotypes that were unrecognized prior to model development.
Machine learning has been used to accurately predict sepsis, in-hospital mortality, and acute kidney injury using intraoperative time-series data.9,21,27,32 Each machine learning algorithm has distinct advantages and disadvantages for different tasks such that performance depends on fit between algorithm and task. To capitalize on this phenomenon, SuperLearner ranks a set of candidate algorithms20,28,38,39 by their performance and applies an optimal weight to each, creating ensemble algorithms that can accurately predict transfusion requirements and mortality among trauma patients.20,28,38,39 Supervised and unsupervised machine learning input features must be handcrafted using domain knowledge. In deep learning, features are extracted by the model itself.
Deep learning is a subfield of machine learning in which computer systems learn and represent highly dimensional data by adjusting weighted associations among input variables across a layered hierarchy of neurons or artificial neural network. Early warning systems that alert clinicians to unstable vital signs illustrate data dimensionality. As the number of vital sign data sources increases linearly, the combinations of alarm parameters that trigger early warning system alarms increase exponentially, resulting in frequent false alarms. Even without a corresponding exponential increase in observations, data are highly dimensional when many variables are used to represent a single patient or event, especially when the number of patients or events in the data set is relatively low, producing unique and rare mixtures of data. Prediction models are less effective when classifying mixtures of data that are rare or absent in the development or training data set. The ability of deep models to represent highly dimensional data is important to their application to surgical decision-making.
In deep models, the initial input and final output layers are connected by hidden layers containing hidden nodes. Each hidden node is assigned a weight that is influenced by previous layers, affects the output from that neuron, and has the potential to affect the outcome classification of the entire network. An algorithm optimizes and updates weights as the model is trained to achieve the strongest possible association between input and output layers. This structure allows accurate representation of chaotic and nonlinear yet meaningful relationships among input features. Deep models automatically learn optimal feature representations from raw data without handcrafted feature engineering, providing a logistical advantage over machine learning models that require time-intensive feature engineering.61 Automatic feature extraction also promotes discovery of novel patterns and phenotypes that may have been overlooked by handcrafted feature selection techniques.
Clinical applications of deep learning benefit from the ability to include multiple different types and sources of data as inputs for a single model, including wearable sensors and cameras capturing patient movements and facial expressions with computer vision, an artificial intelligence subfield in which deep models use pixels from images and videos as inputs.60,62,63 Deep models have successfully performed patient phenotyping, disease prediction, and mortality prediction tasks.19,26,41,64 When applied to the same variable set used to calculate SOFA scores, deep models outperform traditional SOFA modeling in predicting in-hospital mortality for ICU patients.42 Preliminary data suggest that deep models are theoretically capable of accurately predicting risk for perioperative and postoperative complications and augmenting recommendations for operative management and the informed consent process. Despite their utility for predictive analytics, deep learning only provides outcome probabilities that loosely correspond to specific decisions and actions. In contrast, reinforcement learning is well suited to support specific decisions made by patients, caregivers, and surgeons.
Reinforcement learning is an artificial intelligence subfield in which computer systems identify actions yielding the highest probability of an outcome. Reinforcement models can be trained by series of trial and error scenarios, exposing the model to expert demonstrations, or a combination of these strategies. This occurs in a Markov decision process framework, consisting of a set of states, a set of actions, the probability that a certain action in a certain state will lead to a new state, and the reward that results from the new state. Using this framework, the system creates a policy that identifies the choice or action with the highest probability of a desired outcome, assessing total rewards attributable to multiple actions performed over time and the relative importance of present and future rewards, facilitating application of reinforcement learning to clinical scenarios that evolve over time.
Quiz Ref IDReinforcement learning has been used to recommend optimal fluid resuscitation and vasopressor administration strategies for patients with sepsis.31 Ninety-day mortality was lowest when care provided by clinicians was concordant with model recommendations. Reinforcement learning has also been used to recommend basal and bolus insulin administration for virtual type 1 diabetics.46 The algorithm performed as well as standard intermittent self-monitoring and continuous glucose monitoring methods, but with fewer episodes of hypoglycemia. Similar methods could be applied to augment the decision to operate.
The Health Information Technology for Economic and Clinical Health Act of 2009 incentivized adoption of EHR systems.65 Within 6 years, more than 4 of 5 US hospitals adopted EHRs.66 The volume of data generated by EHRs is staggering and will likely increase over time. Approximately 153 billion GB of data were generated in 2013, with projected growth of 48% per year.67 This data volume is ideal for artificial intelligence models, which thrive on large data sets.
Because EHRs are continuously updated as patient data become available, artificial intelligence models can provide real-time predictions and recommendations. Works published within the last year demonstrate the feasibility of this approach. The MySurgeryRisk platform uses EHR data for 285 variables to predict 8 postoperative complications with an area under the curve (AUC) of 0.82-0.94 and to predict mortality at 1, 3, 6, 12, and 24 months with an AUC of 0.77-0.83.15 Electronic health record data feed the algorithm automatically, obviating manual data search and entry and overcoming a major obstacle to clinical adoption. In a prospective study, the algorithm predicted postoperative complications with greater accuracy than physicians.17
To optimize clinical utility and facilitate adoption, automated model outputs could be provided to mobile devices. This would require several elements that communicate with one another reliably and efficiently, including robust quality filters, a public key infrastructure, and encryption that can only be deciphered by the intended receiver.68 Model outputs could be provided to mobile devices equipped with the appropriate RestAPI client-server relationship and security clearance or through Google Cloud Messaging. To our knowledge, automated surgical risk predictions with mobile device outputs have not yet been reported. However, efforts to use manual data entry to feed machine learning models for surgical risk prediction on mobile devices have been successful.14
Human intuition seems to arise from dopaminergic limbic system neurons that modify their connections with one another when a certain pattern or situation leads to a reward or penalty such as pleasure or pain.69,70 Subsequently, similar patterns or situations evoke positive and negative emotions, or gut feelings, which are powerful and effective decision-making tools. In a sentinel investigation12 of intuitive decision-making, participants drew cards from 1 of 4 decks for a cash reward. Two decks were rigged to be advantageous and 2 were rigged to be disadvantageous. Participants could explain differences between decks after drawing 80 cards, but demonstrated measurable anxiety and perspiration when reaching for a disadvantageous deck after drawing 10 cards and began to favor the advantageous deck after 50 cards before they could consciously explain what they were doing or why they were doing it. Similar phenomena occur in fight-or-flight survival responses, naval warfare, and financial decision-making.71,72 Intuition can also identify patients with life-threatening conditions that would be underappreciated by traditional clinical parameters alone.47,48
To produce models that may be integrated with any EHR in any setting, data must be standardized. The Fast Healthcare Interoperability Resources framework establishes standards for health information exchange using a set of universal components assembled into systems that facilitate data sharing across EHRs and cloud-based communications. In addition, the Epic EHR that dominates the market has exclusive rights to develop new functions. To avoid legal conflicts, virtual models can live outside the EHR.15 However, this requires technology infrastructure that is not currently available in all clinical settings.
Diligent clinicians and informed patients will want to know why a computer program made a certain prediction or recommendation. Several techniques address this challenge, including attention mechanisms that reveal periods during which model inputs contributed disproportionately to the output, plotting pairwise similarities between data points to display phenotypic clusters, and training models on labeled patient data and then a linear gradient-boosting tree so that the model will assign relative importance to patient data input features.18,42,73
If model inputs are flawed or model outputs are not carefully monitored by data scientists and interpreted by astute clinicians, many patients could be harmed in a short time frame. Artificial intelligence models trained on erroneous or misrepresentative data are likely to obscure the truth. Because studies with positive results are more likely to be submitted and published, artificial intelligence literature may be overly optimistic. Prior to clinical implementation, machine and deep learning models must be rigorously analyzed in a retrospective fashion and externally validated to ensure generalizability. Performing a stress test of artificial intelligence models by simulating erroneous and rare model inputs and assessing how the model responds may allow clinicians to better understand how and why failures occur. Initial prospective implementation should occur on a small scale under close monitoring, similar to phase 1 and 2 clinical trials for experimental medications, with analysis of how decision-support tools affect decisions across populations and among individual patients.74 In cooperation with the International Medical Device Regulators Forum, the US Food and Drug Administration created the Software as Medical Device category and developed a voluntary Software Precertification Program to aid health care software developers in creating, testing, and implementing Software as Medical Device. Medicolegal regulation of Software as Medical Device is not rigidly defined.
When algorithms are trained on data sets that are influenced by bias, algorithm outputs will likely reflect similar bias. In 1 prominent example, a model designed to augment judicial decision-making by predicting the likelihood of crime recidivism demonstrated predilection for racial/ethnic discrimination.75 When data used to train an algorithm are predominantly derived from patient populations with different demographics than the patient for whom the algorithm is applied, accuracy may suffer. For example, the Framingham heart study primarily included white participants. A model trained on this data may reflect racial and ethnic bias because associations between cardiovascular risk factors and events differ by race and ethnicity.25 Accountability for errors poses another challenge. Our justice system is well-equipped to address scenarios in which an individual clinician is responsible for making an errant decision, but it may prove difficult to assign blame to a computer program and its developers.
Surgical decision-making is impaired by time constraints, uncertainty, complexity, decision fatigue, hypothetical-deductive reasoning, and bias, leading to preventable harm. Traditional decision-support systems are compromised by time-consuming manual data entry and suboptimal accuracy. Automated artificial intelligence models fed with livestreaming EHR data can address these weaknesses. Successful integration of artificial intelligence with surgical decision-making would require data standardization, advances in model interpretability, careful implementation and monitoring, attention to ethical challenges, and preservation of bedside assessment and human intuition in the decision-making process. Artificial intelligence models must be rigorously analyzed in a retrospective fashion with robust external validation prior to prospective clinical application under the close scrutiny of astute clinicians and data scientists. Properly applied, artificial intelligence has the potential to transform surgical care by augmenting the decision to operate, the informed consent process, identification and mitigation of modifiable risk factors, recognition and management of complications, and shared decisions regarding resource use.
Corresponding Author: Azra Bihorac, MD, MS, Precision and Intelligent Systems in Medicine, Division of Nephrology, Hypertension, and Renal Transplantation, Department of Medicine, University of Florida Health, PO Box 100224, Gainesville, FL 32610-0224 (email@example.com).
Accepted for Publication: August 24, 2019.
Published Online: December 11, 2019. doi:10.1001/jamasurg.2019.4917
Author Contributions: Dr Bihorac had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Loftus, Tighe, Efron, Mohr, Rashidi, Bihorac.
Acquisition, analysis, or interpretation of data: Filiberto, Brakenridge, Mohr, Upchurch.
Drafting of the manuscript: Loftus, Efron, Rashidi.
Critical revision of the manuscript for important intellectual content: Tighe, Filiberto, Efron, Brakenridge, Mohr, Upchurch, Bihorac.
Obtained funding: Brakenridge, Rashidi.
Administrative, technical, or material support: Loftus, Tighe, Efron, Brakenridge, Mohr, Upchurch.
Supervision: Tighe, Filiberto, Efron, Rashidi, Upchurch, Bihorac.
Conflict of Interest Disclosures: Dr Tighe reported grants from the National Institutes of Health during the conduct of the study. Dr Rashidi reported patents to Method and Apparatus for Pervasive Patient Monitoring pending and Systems and Methods for Providing an Acuity Score for Critically Ill or Injured Patients pending. Dr Bihorac reported grants from the National Institutes of Health and the National Science Foundation during the conduct of the study; in addition, Dr Bihorac has a patent to Systems and Methods for Providing an Acuity Score for Critically Ill or Injured Patients pending. No other disclosures were reported.
Funding/Support: Dr Efron was supported by R01 GM113945-01 from the the National Institute of General Medical Sciences (NIGMS). Drs Bihorac and Rashidi were supported by R01 GM110240 from the NIGMS. Drs Bihorac and Efron were supported by P50 GM-111152 from the NIGMS. Dr Rashidi was supported by CAREER award NSF-IIS 1750192 from the National Science Foundation, Division of Information and Intelligent Systems, and by the National Institute of Biomedical Imaging and Bioengineering (grant R21EB027344-01). Dr Tighe was supported by R01GM114290 from the NIGMS. Dr Loftus was supported by a postgraduate training grant (T32 GM-008721) in burns, trauma, and perioperative injury from the NIGMS.
Role of the Funder/Sponsor: The National Institute of General Medical Sciences, National Science Foundation, and the National Institute of Biomedical Imaging and Bioengineering had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Create a personal account or sign in to: