Analysis of Errors in Dictated Clinical Documents Assisted by Speech Recognition Software and Professional Transcriptionists

IMPORTANCE Accurate clinical documentation is critical to health care quality and safety. Dictation servicessupportedbyspeechrecognition(SR)technologyandprofessionalmedicaltranscriptionists are widely used by US clinicians. However, the quality of SR-assisted documentation has not been thoroughly studied. OBJECTIVE To identify and analyze errors at each stage of the SR-assisted dictation process. DESIGN, SETTING, AND PARTICIPANTS This cross-sectional study collected a stratified random sample of 217 notes (83 office notes, 75 discharge summaries, and 59 operative notes) dictated by 144 physicians between January 1 and December 31, 2016, at 2 health care organizations using Dragon Medical 360 | eScription (Nuance). Errors were annotated in the SR engine–generated document (SR), the medical transcriptionist–edited document (MT), and the physician’s signed note (SN). Each document was compared with a criterion standard created from the original audio recordings and medical record review.


Introduction
Clinical documentation is among the most time-consuming and costly aspects of using an electronic health record (EHR) system. 1,2Speech recognition (SR), the automatic translation of voice into text, has been a promising technology for clinical documentation since the 1980s.A recent study reported that 90% of hospitals plan to expand their use of SR technology. 3There are 2 primary ways that SR can assist the clinical documentation process.In this study, we evaluated back-end SR (Figure , A), in which physicians' dictations are captured and converted to text by an SR engine.The SR-generated text is edited by a professional medical transcriptionist (MT), then sent back to the physician for review.The other type is commonly called front-end SR (Figure , B).Here, physicians dictate directly into free-text fields of the EHR and edit the transcription before saving the document.
A recent study in Australia evaluated the type and prevalence of errors in documents created using a front-end SR system and those created using a keyboard and mouse. 4A higher prevalence of errors was found in notes created with SR, both overall and across most error types included in the analysis.6][7] However, to our knowledge, the quality and accuracy of clinical documents created using back-end SR have not been thoroughly studied.
Medical errors largely result from failed communication. 8Clinical documentation is essential for communication of a patient's diagnosis and treatment and for care coordination between clinicians.Documentation errors can put patients at significant risk of harm. 9An analysis of medical malpractice cases found that incorrect information (eg, faulty data entry) was the top EHR-related contributing factor, contributing to 20% of reviewed cases. 10,11[14][15] As more medical institutions adopt SR software, we need to better understand how it can be used safely and efficiently.
In this study, we analyzed errors at different processing stages of clinical documents collected from 2 institutions using the same back-end SR system.We hypothesized that error rates would be highest in original SR transcriptions, lower in notes edited by transcriptionists, and lower still in physicians' signed notes (SN).We also expected significant differences in mean error rates between

B
There are two 2 primary ways that speech recognition (SR) can assist the clinical documentation process.In back-end SR, clinicians' dictations, the audio original (AO), are captured and converted to text by an SR engine.The SR-generated text is edited by a professional medical transcriptionist (MT), then sent back to the clinician for review and a signed note (SN).
In front-end SR, clinicians dictate directly into freetext fields of the electronic health record and edit the transcription.
note types and between notes created by physicians of different specialties.We expected no significant difference in mean error rates among physicians of different sexes or from different institutions.

Methods
This cross-sectional study was conducted and reported following the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline. 16Approval for this study was obtained from the Partners Human Research Committee and the Colorado Multiple Institutional Review Board.The study was determined by both institutional review boards to meet the criteria for a waiver of informed consent.Analysis was conducted from June 15, 2016, to November 17, 2017.

Clinical Setting and Data Collection
This study used 217 documents dictated between January 1 and December We reviewed each note at the 4 main processing stages of dictation.This included (1) our own transcription of the original audio recording (used as the criterion standard, described in the following section), ( 2) the note generated by the SR engine of the vendor transcription service (SR note), (3) the note following revision by a professional MT (MT note), and (4) the note after having been reviewed and signed by the physician (SN).

Criterion Standard, Annotation Schema, and Annotation Process
To create the criterion standard for each note, a PharmD candidate or medical student, under the supervision of 2 practicing physicians, created a transcription of the note while listening to the original audio and using the MT note as a reference.The audio was played repeatedly, at different speeds, to ensure the transcription's accuracy.Medical record review was conducted to validate notes' content, such as by referring to a patient's structured medication list to verify a medication order that was partially inaudible in the original audio recording.
A team of clinical informaticians, computational linguists, and clinicians developed a schema for identifying and classifying errors iteratively over multiple annotation rounds.The schema includes 12 general types (eg, insertion), 14 semantic types (eg, medication), and a binary classification of clinical significance. 17An error was considered clinically significant if it could plausibly change a note's interpretation, thereby potentially affecting a patient's future care either directly (eg, by influencing clinical decisions or treatment options) or indirectly (eg, by causing billing errors or affecting litigation proceedings).The complete annotation schema is shown in Table 1.The schema includes brief descriptions of each type of error.Each error type is also accompanied by 1 or more examples found during the course of our annotation.
We used Knowtator, 18 an open-source annotation tool, to annotate notes at each stage.Two annotators (1 computational linguist and 1 medical student) independently annotated the SR-transcribed, transcriptionist-edited, and signed versions of each note for errors.Each document was further annotated for the presence or absence the following changes: automatic abbreviation expansion by the SR system, disfluencies or misspoken words on the part of the dictating physician, stylistic changes (eg, rewording a grammatically incorrect sentence) by the transcriptionist and the signing physician, rearranging of the note's content by the transcriptionist and the physician, and the addition and removal of content by the physician prior to signing.Two practicing physicians independently evaluated errors for clinical significance, and disagreements were reconciled through discussion.

Measures
We determined the time required to dictate a note, along with each note's turnaround time and clinician review time.We defined turnaround time as the length of time between the original dictation's completion and when the transcriptionist-revised document was sent back to the EHR.
Physician review time was the length of time between when the transcription was returned to the physician and when the physician signed the note.
For each version of each note, we analyzed the differences between that note and the corresponding criterion standard note.We determined the error rate (ie, the number of errors per 100 words), the median error rate with interquartile ranges, the mean number of errors per note, the frequency of each error type (the number of errors of a specific type divided by the total number of errors), and the percentage of notes containing at least 1 error.We conducted these analyses for all errors and for just those errors that were found to be clinically significant.Throughout our analyses, a document's error rate is defined as the total number of errors it contains (or, equivalently, the total number of insertions, deletions, and substitutions) divided by the number of words in the corresponding criterion standard.
We calculated interannotator agreement using a randomly selected subset of 33 notes, which included 7 SR-transcribed notes and 26 transcriptionist-edited notes, considering these stages' variations in error complexity (eg, transcriptionists' edits often involve subtle rewordings, which must be distinguished from true errors).Agreement was defined as the percentage of errors for which both annotators selected the same general and semantic type.For each error, we required only that the spans selected by each annotator overlap with one another to some degree, rather than requiring exact span matches.For clinical significance, agreement was defined as the percentage overlap between the 2 physicians' classifications.a A number error is a more specific type of insertion, deletion, or substitution.
b A punctuation error is a more specific type of insertion or substitution.

Statistical Analysis
Analyses were conducted in R statistical software (R Project for Statistical Computing) 19 with t tests used to identify significant differences in mean error rates at each stage by sex, specialty, and note type.For comparisons involving more than 2 groups (eg, specialty), each group's mean error rate was compared with that of all other groups combined.We calculated the Pearson correlation coefficient (r) to measure the strength of associations between error rate and physician age and between error rate and document length.We considered 2-sided P values of less than .05 to be statistically significant.
The effect of human review on note accuracy becomes more pronounced when considering just those errors that are clinically significant, rather than treating all errors as equally meaningful.Prior to human revision, 138 of 217 notes (63.6%) had at least 1 clinically significant error, with a mean (SD) of 2.2 (2.7) errors per note.After being edited by an MT, 32 notes (14.7%) had clinically significant errors, and only 17 SNs (7.8%) contained such errors.However, the proportion of errors involving clinical information increased from 15.8% to 26.9% after transcriptionist revision, although it decreased slightly to 25.9% in SNs.Similarly, the proportion of errors that were clinically significant increased from 5.7% in the original SR transcriptions to 8.9% after being edited by an MT, then decreased to 6.4% in SNs.
Table 3 shows the number and proportion of each error type across the 3 processing stages for each note type and for all notes combined.At all stages, deletion was the most prevalent general type (34.7%), followed by insertion (27.0%).The most frequent semantic type was general English.
Medication was the most common clinical semantic type in the original SR transcriptions, while diagnosis was most common in the transcriptionist-edited and signed versions.
Transcriptionists made stylistic changes to 180 (82.9%) of the 217 notes and rearranged the contents of 37 notes (17.1%), usually at the request of the dictating physician.Physicians made stylistic changes to 71 notes (32.7%) and rearranged the contents of 8 notes (3.7%).Finally, there were 59 notes (27.2%) to which the signing physician added information, and 37 (17.1%) from which the physician deleted information.
Errors in original SR transcriptions occurred at similar frequencies for male and female physicians, with 7.5 and 7.7 mean errors per 100 words, respectively (difference, 0.2%; 95% CI, −1.2% to 1.6%; P = .78).A modest negative correlation was observed between age and error rate in original SR transcriptions (r = −0.20;95% CI, −0.35 to −0.04; P = .01),with average error rates decreasing as physician age increased.

Discussion
This study is among the first, to our knowledge, to analyze errors at the different processing stages of documents created with a back-end SR system.We defined a comprehensive schema to systematically classify and analyze errors across multiple note types.The comparatively large sample and the variety of clinicians and hospitals represented increase the robustness of our findings vs those of previous studies.Of 33 studies included in 2 recent systematic reviews of SR use in health care, 5,7 most evaluated the productivity of SR-assisted dictation compared with traditional transcription and typing.2][33] In contrast, our findings are based on dictations from 144 physicians across a wide range of clinical settings and at 2 geographically distinct institutions.
Speech recognition technology is being adopted at increasing rates at health care institutions across the country owing to its many advantages.Documentation is one of the most time-consuming parts of using EHR technology, and SR technology promises to improve documentation efficiency and save clinicians time.In back-end systems, SR software automatically converts clinicians' dictations to text that MTs can quickly review and edit, reducing turnaround time and increasing productivity; however, it should be noted that turnaround times are typically stipulated in the contract with the transcriptionist vendor and may vary widely for this reason.Additionally, some notes remained unsigned for weeks or months, although they can still be viewed by other EHR users during this time.Many hospitals are adopting front-end dictation systems, where clinicians must review and edit their notes themselves, either as they dictate or at a later time.Clinicians face pressure to decrease documentation time and often only superficially review their notes before signing them. 9Fully shifting the editing responsibility from transcriptionists to clinicians may lead to increased documentation errors if clinicians are unable to adequately review their notes.
Basma et al 32 reported that SR-generated breast imaging reports were 8 times more likely than conventionally dictated reports (23% vs 4% before adjusting confounders) to contain major errors that could affect the understanding of a report or alter patient care.Our study also identified errors involving clinical information that could have such unintended impacts.For example, we found an SN that incorrectly listed a patient as having a "grown mass" instead of a "groin mass" because of an uncorrected error in the original SR transcription.We also found evidence suggesting some clinicians may not review their notes thoroughly, if they do so at all.Transcriptionists typically mark portions of the transcription that are unintelligible in the original audio recording with blank spaces (eg, ??__??), which the physician is then expected to fill in.However, we found 16 SNs (7.4%) that retained these marks, and in 3 instances, the missing word was discovered to be clinically significant.While additional medical record review found no evidence for the persistence of these omissions in subsequent documentation, such a risk still exists.Although adoption of SR technology is intended to ease some of the burden of documentation, that even readily apparent pieces of information at times remain uncorrected raises concerns about whether physicians have sufficient time and resources to review their dictated notes, even to a superficial degree.As previously mentioned, a recent study in Australia reported that emergency department clinicians needed 18% more time for documentation when using SR than when using a keyboard and mouse. 4The authors also observed 4.3 times as many errors in SR-generated documents compared with those created with a keyboard and mouse.We observed a similar trend;

JAMA Network Open | Health Informatics
the SR transcriptions we reviewed had a mean (SD) of 7.4 (4.8) errors per 100 words, while in an earlier study we found errors in typed notes at a rate of 0.45 errors per 100 words.However, SR technology is continually improving, while clinicians' skills with and attention to keyboard and mouse documentation may not be improving at a similar rate.
In general, health information technology and the EHR have introduced a number of potential sources for error.A recent study found higher rates of errors in the EHR than in paper records, possibly attributable to EHR-specific functionality such as templates and the ability to copy and paste text. 34Taken together, these findings demonstrate the necessity of further studies investigating clinicians' use of and satisfaction with SR technology, its ability to integrate with clinicians' existing workflows, and its effect on documentation quality and efficiency compared with other documentation methods.In addition, these findings indicate a need not only for clinical quality assurance and auditing programs, but also for clinician training and education to raise awareness of these errors and strategies to reduce them.

Limitations
The notes in our analysis were all created using the same back-end SR service.Furthermore, while larger in scale than many previous studies, our analysis was still conducted on a relatively small set of notes created in a limited number of clinical settings.As such, our findings may not be generalizable to SR-assisted documentation as a whole.Additionally, sex and specialty information was unavailable in the data sources to which we had access for 10 and 26 physicians, respectively.These missing data may limit our ability to draw conclusions about the effect these characteristics may have on error rates.
Despite the iterative testing and revision that preceded the annotation schema's finalization, there are some additional error types we may wish to include in subsequent work.For example, the lack of a body location semantic type resulted in some confusion over how errors involving these words should be annotated, potentially leading to inconsistent annotations.In some cases, it may have been useful to divide an existing type into more granular subtypes.In particular, the stop word semantic type, which was included to distinguish short, frequently used words from other general English terms, may have inadvertently masked the true prevalence of highly specific but still commonly observed errors, such as those involving pronouns (eg, he or she) or negations.
Because of the time-intensive nature of the annotation task, we calculated interannotator agreement using only a small subset (33 of 651 [5.0%]), rather than requiring both individuals to annotate the full set of notes.This subset also included primarily notes that had been edited by MTs (26 of 33 [78.7%]), owing to the fact that errors in these notes are often more difficult to identify and may generate more disagreement.

Future Directions
These findings lay the groundwork for many subsequent research activities.First, the developed schema can be used to annotate more notes, obtained from a wider variety of clinical domains, to create a robust corpus of errors in clinical documents created with SR technology.The benefits of such a corpus are considerable.Not only will it allow for more reliable error prevalence estimates, but it can also serve as training data for the development of an automatic error detection system.With the rapid adoption of SR in clinical settings, there is a need for automated methods based on natural language processing for identifying and correcting errors in SR-generated text.Such methods are

Figure
Figure.Stages of Back-End and Front-End Dictation 31, 2016, from hospitals at 2 health care organizations: Partners HealthCare System in Boston, Massachusetts, and University of Colorado Health System (UCHealth) in Aurora, Colorado.Both organizations use Dragon Medical 360 | eScription (Nuance).Because hospitals use dictation for different note types, we collected a stratified random sample based on the different note types dictated at each hospital.The sample includes 44 operative notes, 83 office notes, and 40 discharge summaries from Partners HealthCare and 15 operative notes and 35 discharge summaries from UCHealth.We collected data for dictating physicians' age, sex, and specialty.

Table 1 .
Annotation Schema SR: driving an Academy and hit another carNonsenseA substitution that is so far off that it is unclear which category (if any) it falls under AO: follow up in 3 to 5 d SR: neck veins are evaluated No. a Any error involving a number, whether it is written as a digit (2) or as a word (two) AO: the patient is a 17-year-old female SR: the patient is a 70-year-old female Punctuation b A period, comma, or other punctuation mark was present where it should not have been AO: at discharge she had no flank tenderness SR: at discharge.She had no flank tenderness Semantic type General English Any English words that do not fit into the categories below AO: which she would otherwise forget SR: which she would otherwise for gas that Stop word Common English words 17 AO: intermittent pain under the right breast SR: intermittent pain in the right breast Medication Medication names and dose information AO: initiated on lamotrigine therapy SR: initiated on layman will try therapy Diagnosis Any words that are part of a specific medical diagnosis AO: Dengue SR: DKA [diabetic ketoacidosis] MT: no foreign ??__?? was identified (continued)
[59-1911] words) per document.The average dictation duration was 5 minutes, 46 seconds (median [range], 4 minutes, 45 seconds [21 seconds to 31 minutes, 35 seconds]).The average turnaround time was 3 hours, 37 minutes (median [range], 1 hour, 1 minute [2 minutes to 38 hours, 45 minutes]).The average physician review time was 4 days, 13 hours, 16 minutes (median [range], 23 hours, 25 minutes [0 minutes to 146 days, 4 hours, 54 minutes]).There were 329 errors in the 33-note subset.For the 171 errors that were identified by both annotators, interannotator agreement was 71.9%.Each of the annotators failed to identify a mean (SD) of 21.7% (1.5%) of the errors that were annotated by the other.Of the errors identified by only 1 annotator, 32 errors (20.3%) pertained to clinical information; the remaining 126 (79.7%) involved minor changes to general English words.Agreement for clinical significance was 85.7%.Examples of errors of each type that were identified in this data set can be found in Table1.Detailed results of our error analysis are shown in Table2 and Table 3. Errors were prevalent in original SR transcriptions, with an overall mean (SD) error rate of 7.4% (4.8%).The rate of errors decreased substantially following revision by MTs, to 0.4%.Errors were further reduced in SNs, which had an overall error rate of 0.3%.The number of notes containing at least 1 error also decreased with each processing stage.Of the 217 original SR transcriptions, 209 (96.3%) had errors.

Table 2 .
Summary of Error Rates by Note Type and Processing Stage Abbreviations: IQR, interquartile range; MT, medical transcriptionist-edited notes; SN, signed notes; SR, automatic transcriptions by the speech recognition system.

Table 3 .
Error Types in Dictated Notes by Note Type and Processing Stage a Percentages are equal to the number of errors of each type divided by the total number of errors; percentages may not sum to 100 because of rounding.b Includes suffix, prefix, dictionary, and spelling errors.c Includes general English, stop word, and date errors.d Includes patient or physician identifier, interpretation, psychological test, and ??? (unintelligible or otherwise unclassifiable) errors.

Table 4 .
Mean Error Rates Compared by Institution, Note Type, Physician Sex, and Specialty Abbreviations: MT, medical transcriptionist-edited notes; SN, signed notes; SR, automatic transcriptions by the speech recognition system.aNumberomitted to preserve deidentification.bPhysiciansex was missing for 10 notes.cPhysician specialty was missing for 26 notes.