Assessment of a Simulated Case-Based Measurement of Physician Diagnostic Performance

Key Points Question Can automated scoring by an online case-based simulator be used as a valid measure of diagnostic performance? Findings This cohort study found that health care professionals with more experience and training demonstrated higher diagnostic performance scores, as measured on an online case simulator, The Human Diagnosis Project (Human Dx). Attending physicians were most efficient and accurate in diagnostic performance compared with residents, interns, and medical students. Meaning Online case-based physician performance measurement has the potential to be a practical and scalable method in the assessment of diagnostic performance.


Introduction
The Institute of Medicine, in its 2015 report Improving Diagnosis in Health Care, 1 estimated that most people will experience at least 1 diagnostic error in their lifetime and showed that "getting the diagnosis right" is a crucial component of effective health care. Autopsy, in-hospital adverse event monitoring, and malpractice claim data all demonstrate unacceptably high rates of diagnostic errors, often resulting in significant morbidity and mortality. [2][3][4] Accrediting bodies, such as the Accreditation Council for Graduate Medical Education and the American Board of Internal Medicine, emphasize the foundational importance of diagnosis and clinical reasoning; these are intimately related to the core competencies of medical knowledge and patient care. 5 Foundational research on clinical reasoning found that no single superior reasoning strategy differentiated novice from expert clinicians. 6 Diagnostic expertise likely requires distinction in multiple cognitive domains: illness scripts and pattern recognition, decision trees, Bayesian reasoning, basic science, and physiology knowledge. 7 Possibly because of the multidimensional cognitive approach required for expert diagnostic performance, in both undergraduate and graduate medical education, diagnostic performance is typically inferred rather than measured; impressions of evaluating faculty are based on direct, and more often indirect, supervision of clinical behavior. 8,9 These observations are necessarily subjective and influenced by myriad biases including context and content specificity, frame of reference and personal characteristics of the observer, personal knowledge, and experience. 10 As defined by the Institute of Medicine, diagnostic error is a "failure to establish an accurate and timely explanation of the patient's health problem…," 1 so a combination of accuracy and efficiency can be considered the hallmark of optimal diagnostic performance. An ideal objective measure of diagnostic performance would (1) incorporate these 2 foundational components of diagnosis (accuracy and efficiency), (2) generate an assessment in real time with immediate feedback, (3) span a variety of content and contexts, and (4) be valid across experience levels, from students to practicing physicians. Case simulations that reveal information sequentially have the potential to measure both accuracy and efficiency, as they can assess not only whether the final diagnosis is correct (accuracy) but also whether the clinician is able to arrive at the correct diagnosis with less information (efficiency) (Figure). We conducted a study to validate the diagnostic performance scores derived from The Human Diagnosis Project (Human Dx).

Design and Participants
This retrospective cohort study used data from individuals who are registered with Human Dx. Most users are attending physicians, residents, and medical students. For this study, international users were not included, as nomenclature of training level can vary. We also excluded residents and attending physicians from fields other than internal medicine, family medicine, and emergency medicine.
Basic access to Human Dx is free to any medical professional; upon registration, users create a profile in which they self-report their name, specialty, training level, and institutional affiliation. Use of deidentified user data for research purposes is part of Human Dx's terms of service. Additional information about Human Dx can be found in the eAppendix in the Supplement. This study was approved by the Johns Hopkins University institutional review board. expertise in medical education. Cases cover a range of inpatient and outpatient presentations from general adult medicine and subspecialty disciplines. Case analyses were completed on all GMR cases solved from January 21, 2016, through January 15, 2017.
To allow for solver familiarity with the Human Dx scoring system and a technology learning curve, solvers with fewer than 2 case solves were excluded from analysis. Furthermore, only credible solve attempts were used in our analyses by excluding solves in which there was no attempt to create a differential diagnosis prior to revealing the last finding (1% of solves).
The raw data collected on each case solve include the ranked differential diagnosis at each step of the case mapped to major medical ontologies including the World Health Organization's

Diagnostic Performance Outcomes
Two key metrics of diagnostic performance were analyzed for each user relative to other solvers on the same case: efficiency, a percentile score calculated based on the proportion of findings revealed before the user first includes the correct diagnosis in his or her differential diagnosis; and accuracy, analyzed 2 ways: (1) a percentile calculated from how high on the differential the correct diagnosis was at the end of the case and (2) a binary measure calculated with credit given for having the correct diagnosis in the first position of the user's differential diagnosis at the end of the case (eg, with all findings revealed). Solve Dr R's case A 65-year-old man presents to the emergency department with fever, malaise, and rash of 2 weeks duration. The symptoms have been progressive and no inciting event is recollected.

Symptom
Rash

Add working diagnosis
A case simulation on The Human Diagnosis Project includes a brief initial description in the case title and initial features of the presentation, listed under "case." A differential diagnosis is free texted by the user into the "assessment" box. There are also several findings that can be revealed with mouse clicks: "reveal past medical Hx," "reveal social Hx," and "reveal diagnostic." The differential diagnosis can then be reordered using the ladder next to the text and added to using the "add working diagnosis" feature. BP indicates blood pressure; DRESS, drug reaction with eosinophilia and systemic symptoms; HR, heart rate; Hx, history; RR, respiratory rate; SpO 2 , peripheral capillary oxygen saturation; VS, vital signs.
Because high efficiency can be achieved with low accuracy (thinking of a diagnosis with less information but not putting it high on the final differential) and high accuracy can occur with low efficiency (not thinking of a diagnosis until all suggestive data are provided but putting it high on the final differential), a third composite metric, Diagnostic Acumen Precision Performance (DAPP), was developed. It is calculated from an equally weighted average of the percentiles of both accuracy and efficiency for each solve attempt. This calculation for DAPP was conceived a priori, and it is consistent with the Institute of Medicine's emphasis on both timely and accurate diagnosis.

Establishing Validity Evidence for Human Dx
The research team assembled for this study has expertise in medical education, clinical excellence, diagnostic reasoning, and assessment. Furthermore, in considering the ways in which to measure and score diagnostic performance on the Human Dx platform, we presented ideas at institutional research conferences and met with consultants-both of which resulted in iterative revisions. These steps along with a comprehensive literature review confer content validity evidence to the scoring system. 11 With respect to internal structure validity, we understood that the diagnostic performance score would need to discriminate between those who had completed training with the most clinical experience and presumably the most knowledge (attending physicians) and novices (medical students The inclusion of only highly rated GMR cases for this study, judged to be particularly clear, corroborates response process validity evidence.

Statistical Analysis
Descriptive characteristics, including means and standard errors for the applicable variables, were computed. Linear mixed models were used to compare accuracy, efficiency, and DAPP among solvers of different levels of training (attending, resident, intern, and medical student) and affiliated academic institution (top 25 ranking for NIH grant funding or not and top 25 USNWR-ranked medical school or not), with random case and solver effects. No fixed effect other than solver tenure was adjusted. The models were fitted using restrictive maximum likelihood, and nominal P values were calculated using t statistics. For binary accuracy, we used generalized linear mixed models with logit link and random case effects. The models were fitted using pseudolikelihood, and nominal P values were calculated using t statistics. To adjust for multiple comparisons, we used the Tukey-Kramer method for pairwise comparisons among solver groups. To assess internal consistency between accuracy and efficiency we calculated the Cronbach α. 14 The intraclass correlation coefficient (ICC) for DAPP was calculated using the ratio of the variance between solvers and the sum of the variance between solvers and the residual variance, which were estimated from a random-effects model with a random solver effect. 15 Because the ICC for a single solve was low and not reflective of the design of the platform, we then used Spearman-Brown prophecy formula to calculate the ICC when 10 solves were averaged. 16 The Bonferroni correction was used for analysis of the top 25 USNWR and
Residents, interns, and medical students had similar efficiency performance (Table 3).

Internal Consistency
The internal consistency between accuracy and efficiency using the Cronbach α was acceptable (attending: 0.688, resident: 0.644, intern: 0.623, and medical student: 0.753). 16 The ICCs for the DAPP scores were fair to good according to conventional standards when averaged over 10 solves

Discussion
Using a data set of online case simulations, we have established validity evidence for a novel measure of diagnostic performance that yields automated scoring in real time. Evaluating more than 11 023 case simulations, we found that those with more clinical experience have higher scores on 2 key components of diagnostic performance, accuracy and efficiency. Key features of these case simulations include accessibility (can be solved on a variety of online devices, including tablets, desktops, and smartphones), peer-reviewed cases, brevity (cases averaged <3 minutes to solve), computerized summative scoring of open-ended responses, and immediate feedback on performance.

Limitations
Several limitations of this study should be considered. First, although poor-quality solve attempts were dropped from the data set, effort by users was likely variable; this is not unexpected as the software is currently set up as a low-stakes, self-directed learning experience. We have no reason to believe that attention to detail by a particular group, for example, students, would have been sufficiently recurrently lower to translate into a systematic bias. Second, the demographic data, including institutional affiliation, are self-described by users enrolling on Human Dx. Even though at the time of the data pull most users had been using Human Dx for less than 12 months, if a user did not update their level of training on the platform as they advanced, they could be misclassified. Third, unlike actual patient encounters where clinical information must be gathered and synthesized and a diagnosis pursued, case simulations used in our analyses provide predetermined clinical data; consequently, our assessment of diagnostic performance may not correlate with bedside diagnostic skill. Fourth, Human Dx participation is voluntary, which could result in selection bias and limit generalizability. Additionally, because GMR cases are created for teaching purposes, they often contain a pathognomonic finding as the final clue, which may explain the absence of larger differences in accuracy performance across users. Given that the Human Dx platform is virtual and automated, there may be performance bias in favor of more technologically advanced individuals. In an attempt to minimize any such biases, initial solve attempts were excluded for all users.

Conclusions
Diagnostic acumen is paramount in providing optimal patient care. Rather than attempting to measure abstract reasoning processes, this analysis focused on concrete and actionable assessment of diagnostic performance outcomes-accuracy and efficiency in diagnosis. The online case simulations used in this analysis permitted rapid measurement of 2 critical components of diagnosis to validly assess performance. Consistent with deliberate practice, this technology provides immediate feedback based on performance, offers case-specific teaching points, and quantitatively compares performance to that of all other solvers; these features are extremely rare in medical education. 27 The platform is currently being used by medical trainees and physicians electively as part of their self-directed learning plan; this formative assessment has the potential to assist with professional development. This advance may represent an effective step in improving diagnosis in health care through robust measurement of diagnostic performance.