[Skip to Navigation]
Sign In
Figure.  Study Design
Study Design

Description of how data were used to construct the model, how subsets were labeled, and where metrics were calculated.

Table.  Examples of Predictions on Various Snippets
Examples of Predictions on Various Snippets
1.
Verghese  A, Shah  NH, Harrington  RA.  What this computer needs is a physician: humanism and artificial intelligence.  JAMA. 2018;319(1):19-20. doi:10.1001/jama.2017.19198PubMedGoogle ScholarCrossref
2.
Chiu  C-C, Tripathi  A, Chou  K,  et al. Speech Recognition for Medical Conversations. In: Interspeech 2018. ISCA: ISCA; 2018. https://www.isca-speech.org/archive/Interspeech_2018/abstracts/0040.html. Accessed December 8, 2018.
3.
Sutskever  I, Vinyals  O, Le  QV. Sequence to Sequence Learning with Neural Networks. In: Ghahramani  Z, Welling  M, Cortes  C, Lawrence  ND, Weinberger  KQ, eds.  Advances in Neural Information Processing Systems. vol 27. 2014.http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf. Accessed December 8, 2018.
4.
Cho  K, van Merriënboer  B, Gülçehre  Ç,  et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics; 2014:1724-1734.
5.
Kannan  A, Chen  K, Jaunzeikare  D, Rajkomar  A. Semi-supervised Learning for Information Extraction from Dialogue. In: Interspeech 2018. ISCA: ISCA; 2018. https://www.isca-speech.org/archive/Interspeech_2018/abstracts/1318.html. Accessed December 8, 2018.
Research Letter
March 25, 2019

Automatically Charting Symptoms From Patient-Physician Conversations Using Machine Learning

Author Affiliations
  • 1Google LLC, Mountain View, California
JAMA Intern Med. 2019;179(6):836-838. doi:10.1001/jamainternmed.2018.8558

Automating clerical aspects of medical record keeping through speech recognition during a patient’s visit1 could allow physicians to dedicate more time directly with patients. We considered the feasibility of using machine learning to automatically populate a review of systems (ROS) of all symptoms discussed in an encounter.

Methods

We used 90 000 human-transcribed, deidentified medical encounters described previously.2 We randomly selected 2547 from primary care and selected medical subspecialties to undergo labeling of 185 symptoms by scribes. The rest were used for unsupervised training of our model, a recurrent neural network3,4 that has been commonly used for language understanding. We reported model details previously.5

Because some mentions of symptoms were irrelevant to the ROS (eg, a physician mentioning “nausea” as a possible adverse effect), scribes assigned each symptom mention a relevance to the ROS, defined as being directly related to a patient's experience. Scribes also indicated if the symptom was experienced or not. A total of 2547 labeled transcripts were randomly split into training (2091 [80%]) and test (456 [20%]) sets.

From the test set, we selected 800 snippets containing at least 1 of 16 common symptoms that would be included in the ROS, and asked 2 scribes to independently assess how likely they would include the initially labeled symptom in the ROS. When both said “extremely likely” we defined this as a “clearly mentioned” symptom. All other symptom mentions were considered “unclear.”

The input to the machine learning model was a sliding window of 5 conversation turns (snippets), and its output was each symptom mentioned, its relevance, and if the patient experienced it. We assessed the sensitivity and positive-predictive value, across the entire test set. We additionally calculated the sensitivity of identifying the symptom and the accuracy of correct documentation, in clearly vs unclearly mentioned symptoms. The Figure outlines the study design. The study was exempt from institutional review board approval because of the retrospective deidentified nature of the data set and the snippets presented in this manuscript are synthetic snippets modeled after real spoken language patterns, but are not from the original dataset and contain no data derived from actual patients.

Results

In the test set, there were 5970 symptom mentions. Of these 5970, 4730 (79.3%) were relevant to the ROS and 3510 (74.2%) were experienced.

Across the full test set, the sensitivity of the model to identify symptoms was 67.7% (5172/7637) and the positive predictive value of a predicted symptom was 80.6% (5172/6417). We show examples of snippets and model predictions in the Table.

From human review of the 800 snippets, slightly less than half of symptom mentions were clear (387/800 [48.4%]), with fair agreement between raters on the likelihood to include a symptom as initially labeled in the ROS (κ = 0.32, P < .001). For clearly mentioned symptoms the sensitivity of the model was 92.2% (357/387). For unclear ones, it was 67.8% (280/413).

The model would accurately document—meaning correct identification of a symptom, correct classification of relevance to the note, and assignment of experienced or not—in 87.9% (340/387) of symptoms mentioned clearly and 60.0% (248/413) in ones mentioned unclearly.

Discussion

Previous discussions of autocharting take for granted that the same technologies that work on our smartphones will work in clinical practice. By going through the process of adapting such technology to a simple ROS autocharting task, we report a key challenge not previously considered: a substantial proportion of symptoms are mentioned vaguely, such that even human scribes do not agree how to document them. Encouragingly, the model performed well on clearly mentioned symptoms, but its performance dropped significantly on unclearly mentioned ones. Solving this problem will require precise, though not necessarily jargon heavy, communication. Further research will be needed to assist clinicians with more meaningful tasks such as documenting the history of present illness.

Back to top
Article Information

Corresponding Author: Alvin Rajkomar, MD, Google LLC, 1600 Amphitheatre Pkwy, Mountain View, CA 94043 (alvinrajkomar@google.com).

Accepted for Publication: December 8, 2018.

Published Online: March 25, 2019. doi:10.1001/jamainternmed.2018.8558

Open Access: This article is published under the JN-OA license and is free to read on the day of publication.

Author Contributions: Dr Rajkomar had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study concept and design: All authors.

Acquisition, analysis, or interpretation of data: Rajkomar, Kannan, Chen, Vardoulakis, Chou.

Drafting of the manuscript: Kannan, Chen, Rajkomar, Vardoulakis.

Critical revision of the manuscript for important intellectual content: Rajkomar, Kannan, Chen, Chou, Cui, Dean.

Statistical analysis: Rajkomar, Kannan, Chen, Vardoulakis.

Obtained funding: Chou.

Administrative, technical, or material support: Chou.

Study supervision: Rajkomar, Chou, Cui, Dean.

Conflict of Interest Disclosures: All authors are employed by and own stock in Google. In addition, as part of a broad-based equity portfolio intending to mirror the US and International equities markets (eg, MSCI All Country World, Russell 3000), Jeff Dean holds individual stock positions in many public companies in the health care and pharmacological sectors, and also has investments in managed funds that also invest in such companies, as well as limited partner and direct venture investments in private companies operating in these sectors. All other health care–related investments are managed by independent third parties (institutional managers) with whom Jeff Dean has no direct contact and over whom Jeff Dean has no control. The authors have a patent pending for the machine learning tool described in this study. No other conflicts are reported.

Additional Contributions: We thank Kathryn Rough, PhD, and Mila Hardt, PhD, for helpful discussions on the manuscript; Mike Pearson, MBA, Ken Su, MBA, MBH, and Kasumi Widner, MS, for data collection; Diana Jaunzeikare, BA, Chris Co, PhD, Daniel Tse, MD, and Nina Gonzalez, MD, for labeling; Linh Tran, PhD, Nan Du, PhD, Yu-hui Chen, PhD, Yonghui Wu, PhD, Kyle Scholz, BS, Izhak Shafran, PhD, Patrick Nguyen, PhD, Chung-cheng Chiu, PhD, Zhifeng Chen, PhD, for helpful discussions on modeling; and Rebecca Rolfe, MSc, for illustrations. All individuals work at Google. They were not compensated outside of their normal duties for their contributions.

References
1.
Verghese  A, Shah  NH, Harrington  RA.  What this computer needs is a physician: humanism and artificial intelligence.  JAMA. 2018;319(1):19-20. doi:10.1001/jama.2017.19198PubMedGoogle ScholarCrossref
2.
Chiu  C-C, Tripathi  A, Chou  K,  et al. Speech Recognition for Medical Conversations. In: Interspeech 2018. ISCA: ISCA; 2018. https://www.isca-speech.org/archive/Interspeech_2018/abstracts/0040.html. Accessed December 8, 2018.
3.
Sutskever  I, Vinyals  O, Le  QV. Sequence to Sequence Learning with Neural Networks. In: Ghahramani  Z, Welling  M, Cortes  C, Lawrence  ND, Weinberger  KQ, eds.  Advances in Neural Information Processing Systems. vol 27. 2014.http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf. Accessed December 8, 2018.
4.
Cho  K, van Merriënboer  B, Gülçehre  Ç,  et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics; 2014:1724-1734.
5.
Kannan  A, Chen  K, Jaunzeikare  D, Rajkomar  A. Semi-supervised Learning for Information Extraction from Dialogue. In: Interspeech 2018. ISCA: ISCA; 2018. https://www.isca-speech.org/archive/Interspeech_2018/abstracts/1318.html. Accessed December 8, 2018.
×