The Institute of Medicine recently highlighted that physician diagnostic error is common and information technology may be part of the solution.1 Given advancements in computer science, computers may be able to independently make accurate clinical diagnoses.2 While studies have compared computer vs physician performance for reading electrocardiograms,3 the diagnostic accuracy of computers vs physicians remains unknown. To fill this gap in knowledge, we compared the diagnostic accuracy of physicians with computer algorithms called symptom checkers.
Symptom checkers are websites and apps that help patients with self-diagnosis. After answering a series of questions, the user is given a list of rank-ordered potential diagnoses generated by a computer algorithm. Previously, we evaluated the diagnostic accuracy of 23 symptom checkers using 45 clinical vignettes.4 The vignettes included the patient’s medical history and had no physical examination or test findings. In this study we compared the diagnostic performance of physicians with symptom checkers for those same vignettes using a unique online platform called Human Dx.
Human Dx is a web- and app-based platform on which physicians generate differential diagnoses for clinical vignettes. Since 2015, Human Dx has been used by over 2700 physicians and trainees from 40 countries who have addressed over 100 000 vignettes.
The 45 vignettes, previously developed for the systematic assessment of online symptom checkers,4 were disseminated by Human Dx between December 2015 and May 2016 to internal medicine, family practice, or pediatrics physicians who did not know which vignettes were part of the research study. There were 15 high, 15 medium, and 15 low-acuity condition vignettes and 26 common and 19 uncommon condition vignettes.4 Physicians submitted free text ranked differential diagnoses for each case. Each vignette was solved by at least 20 physicians.
Given that physicians provided free text responses, 2 physicians (S.N. and D.M.L.) hand-reviewed the submitted diagnoses and independently decided whether the participant listed the correct diagnosis first or in the top 3 diagnoses. Interrater agreement was high (Cohen κ, 96%), and a third study physician (A.M.) resolved discrepancies (n = 60).
We used χ2 tests of significance to compare in physicians’ performance. Physician diagnosis accuracy was compared with previously reported symptom checker accuracy for these same vignettes using 2-sample tests of proportion.4 The study was exempt from Harvard’s institutional review board and participants were not compensated.
Of the 234 physicians who solved at least 1 vignette, 211 (90%) were trained in internal medicine and 121 (52%) were fellows or residents (Table 1).
Physicians listed the correct diagnosis first more often across all vignettes compared with symptom checkers (72.1% vs 34.0%, P < .001) as well as in the top 3 diagnoses listed (84.3% vs 51.2%, P < .001) (Table 2).
Across physicians, they were more likely to list the correct diagnosis first for high-acuity vignettes (vs low-acuity vignettes) and for uncommon vignettes (vs common vignettes). In contrast, symptom checkers were more likely to list the correct diagnosis first for low-acuity vignettes and common vignettes (Table 2).
In what we believe to be the first direct comparison of diagnostic accuracy, physicians vastly outperformed computer algorithms in diagnostic accuracy (84.3% vs 51.2% correct diagnosis in the top 3 listed).4 Despite physicians’ superior performance, they provided the incorrect diagnosis in about 15% of cases, similar to prior estimates (10%-15%) for physician diagnostic error.5 While in this project we compared diagnostic performance, future work should test whether computer algorithms can augment physician diagnostic accuracy.6
Key limitations included our use of clinical vignettes, which likely do not reflect the complexity of real-world patients and did not include physical examination or test results. Physicians who chose to use Human Dx may not be a representative sample of US physicians and therefore may differ in diagnostic accuracy. Symptom checkers are only 1 form of computer diagnostic tools, and other tools may have superior performance.
Corresponding Author: Ateev Mehrotra, MD, MPH, Health Care Policy, Harvard Medical School, 180 Longwood Ave, Boston, MA 02115 (mehrotra@hcp.med.harvard.edu).
Published Online: October 10, 2016. doi:10.1001/jamainternmed.2016.6001
Author Contributions: Dr Mehrotra had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: All authors.
Acquisition, analysis, or interpretation of data: Semigran, Levine, Nundy.
Drafting of the manuscript: Semigran.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Semigran.
Administrative, technical, or material support: Levine, Mehrotra.
Study supervision: Nundy, Mehrotra.
Conflict of Interest Disclosures: Dr Nundy is an equity holder of The Human Diagnosis Project, the creators of Human Dx. No other disclosures are reported.
1.The National Academies of Science Engineering and Medicine. Improving Diagnosis in Health Care. Washington, DC: The National Academies Press; 2015.
2.Topol
EJ. The Future of Medicine Is in Your Smartphone. The Wall Street Journal; January 9, 2015; The Saturday Essay.
3.Poon
K, Okin
PM, Kligfield
P. Diagnostic performance of a computer-based ECG rhythm algorithm.
J Electrocardiol. 2005;38(3):235-238.
PubMedGoogle ScholarCrossref 4.Semigran
HL, Linder
JA, Gidengil
C, Mehrotra
A. Evaluation of symptom checkers for self diagnosis and triage: audit study.
BMJ. 2015;351:h3480.
PubMedGoogle ScholarCrossref 6.Bond
WF, Schwartz
LM, Weaver
KR, Levick
D, Giuliano
M, Graber
ML. Differential diagnosis generators: an evaluation of currently available computer programs.
J Gen Intern Med. 2012;27(2):213-219.
PubMedGoogle ScholarCrossref