Comparison of International Classification of Diseases and Related Health Problems, Tenth Revision Codes With Electronic Medical Records Among Patients With Symptoms of Coronavirus Disease 2019

Key Points Question Do International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) codes accurately capture presenting symptoms of fever, cough, and dyspnea among patients being tested for coronavirus disease 2019 (COVID-19)? Findings In this cohort study, an electronic medical record review of 2201 patients tested for COVID-19 between March 10 and April 6, 2020, found that ICD-10 codes had poor sensitivity and negative predictive value for capturing fever, cough, and dyspnea. Meaning These findings suggest that symptom-specific ICD-10 codes do not accurately capture COVID-19–related symptoms and should not be used to populate symptoms in electronic medical record–based cohorts.


Introduction
Health care organizations need rapid access to high-quality, multicenter data to support scientific discovery during the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, the causative agent for coronavirus disease 2019 . Electronic medical record (EMR) data could be repurposed to populate COVID-19 registries and surveillance systems. Several organizations are moving quickly to aggregate EMR data across multiple institutions to meet data needs. 1 However, some critical data elements specific to COVID-19 may be unreliably captured by standard terminologies used in EMRs. The International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) is a widely used terminology, in which each code represents a clinical concept. 2 Some codes may lack accuracy for the intended condition, 3,4 a challenge that is germane to COVID-19-related symptoms. The goal of this project was to compare ICD-10 codes with manual EMR review in capturing symptoms of fever, cough, and dyspnea among patients being tested for SARS-CoV-2 infection.

Methods
This cohort study was approved by the University of Utah institutional review board, which waived the requirement for informed consent because the study was retrospective and posed no more than minimal risk to participants. This study follows the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline for cohort studies. 5 Candidate patients for this analysis included 9355 patients tested for SARS-CoV-2 at University of Utah Health from March 10 to April 6, 2020. University of Utah Health is a tertiary academic health care system in the Mountain West that includes inpatient care and regional community clinics. The system maintains an operational dashboard of all patients tested for SARS-CoV-2. These analyses were built off this dashboard, linking the medical record numbers to the Enterprise Data Warehouse (EDW) to capture ICD-10 billing codes. The EDW aggregates data across the health system, to create a central resource for operations and research. We included anyone who was tested at our center, regardless of where the test was conducted (eg, emergency department, drive-through, or inpatient). Symptoms were not a prerequisite for testing, because institutional policy changed during the study period from only testing symptomatic patients with a known exposure to SARS-CoV-2 to testing any patient with suspected SARS-CoV-2 infection. Patients were tested using direct quantitative reverse transcriptase-polymerase chain reaction detection of SARS-CoV-2 RNA, predominantly from nasopharyngeal swabs. Serology-based testing was not used during the study period.

Review Classification Process
The symptoms of interest were fever, cough, and dyspnea, which are common in COVID-19. 6-8 A convenience sample of 2201 patient EMRs was reviewed. Early in the pandemic, a REDCap registry 9 was prepopulated with tested patients in nearly real time, including the text from clinical notes during the 24-hour period before or after the time of the test. By default, in Python statistical software, the patients were sorted by EMR number. Each patient's EMR was reviewed by 1 of 7 reviewers and was labeled as symptoms present, absent, or unmentioned, which served as the reference standard. After the initial review phase on March 31, 2020, we calculated the proportion of patients reviewed per day among all tested patients; additional patients were reviewed as needed to achieve approximately 20% reviewed per day (range, 18%-50%). Fewer patients were reviewed after March 31 through April 6, 2020 (171 patients), randomly selected from the registry.
for dyspnea) were selected on the basis of the specifications suggested by the National COVID Cohort Collaborative. 10 The asterisk (*) denotes that any code starting with the specified alphanumeric sequence would be included (ie, R06.03 is included for cough). Using this approach, the following codes were present in our data: R50.9, R50.81, R05, R06.02, R06.00, R06.03, and R06.09. Patients with at least 1 code in a given category were classified as having the symptom according to ICD-10 code.

Statistical Analysis
Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated by comparing ICD-10 codes with the reference standard; 95% CIs were calculated by bootstrapping the point estimates using random resampling and replacement to create 1000 samples of the same size as the original group. The empirical bootstrap distribution was then used to calculate the 95% CIs for each performance characteristic. Symptom present was compared with symptom absent or unmentioned, combined. Unmentioned symptoms are more likely to be absent than present, but given the uncertainty, we performed a sensitivity analysis in which unmentioned symptoms were assumed to be present.
Performance characteristics were calculated overall and stratified by subgroups, including SARS-CoV-2 test result, sex, age group (<50, 50-64, and >64 years), and inpatient status. SARS-CoV-2 test results and demographic characteristics were captured through routine documentation for clinical care and were extracted from documented values in the EDW. Patients could be classified as inpatient status in 1 of 2 ways: the test was performed in an inpatient unit or the patient was hospitalized within 14 days of the testing period. We chose this approach because testing frequently occurs in drive-through units, with distinct visit numbers. However, if a patient is ill enough, they may be hospitalized soon after under a different visit number. Clinically, these patients should be classified as inpatients, as a marker of disease severity, which was the rationale for our approach.
We compared the observed number of false-positive, false-negative, true-positive, and truenegative ICD-10-based classifications for each subgroup using 1-sided Pearson χ 2 tests. A P < .05 was considered significant, and all analyses were performed using Python statistical software version 3.6 (Python). Data analysis was performed in April 2020.
High false-negative rates were the main contributor to poor ICD-10 code performance. The proportion of patients with a false-negative ICD-10 code result ranged from 35.8% for fever among patients older than 64 years to 54.5% for fever among patients who tested positive for SARS-CoV-2 infection.

Discussion
Symptoms are an essential part of data collection for SARS-CoV-2 and COVID-19 surveillance and research, but symptom-specific ICD-10 codes lack sensitivity and fail to capture many patients with relevant symptoms; the false-negative rate is unacceptably high. Common data models and other aggregation tools rely heavily on ICD-10 codes to capture clinical concepts; inaccuracy has implications for any downstream scientific discovery or surveillance. 10,11 For example, symptom surveillance could be important to detect subsequent waves of COVID-19, similar to the US Outpatient Influenza-Like Illness Surveillance Network. 12 A substantial number of patients would be missed if ICD-10 codes were used for this task.
ICD-10 codes are known to lack accuracy for clinical diagnoses and concepts. For example, ICD-10 codes perform poorly to identify patients with atrial fibrillation, with a sensitivity of 88% and a specificity of 42%. 4 Similar inaccuracies have been reported for other conditions, such as stroke and acute kidney injury. 13,14 Our work represents clinician documentation of symptoms, and clinicians may not document all symptoms for all patients, particularly when patient volume is high or in drivethrough testing scenarios. In other words, clinician documentation is not necessarily the "gold standard," but rather a reference standard. Other strategies include checklist type data entry to support standardized data collection or capturing symptoms directly from the patient. Several public health agencies are developing smartphone applications that allow people to report symptoms directly to appropriate officials. 15 For health care systems, patient-reported outcomes may allow more reliable symptom capture, without reliance on billing codes or clinician documentation. 16 Our findings highlight the importance of quality control in COVID-19 data aggregation, which has become increasingly important with recent high-profile journal retractions. 17 Critical data elements require careful validation to ensure that discoveries translate into effective interventions that reduce morbidity and mortality. As with many aspects of this pandemic, we must pay careful attention to socioeconomically vulnerable populations, including racial minorities, rural patients, and low-income patients, for whom the gap between ICD-10 coding and clinical reality could be greater. 18,19 Limitations This study has limitations that should be considered. Our study included only a single center; other centers may have different ICD-10 performance characteristics. Our study also uses data from early in the pandemic, and performance characteristics could change over time. Furthermore, as noted earlier, clinicians may not document all symptoms in every case. Although we did not adjust for multiple comparisons, ICD-10 code performance is so poor that adjustment is unlikely to alter the interpretation of these results. Each case was reviewed by a single individual; because of the low complexity of the studied concepts (presence or absence of fever, cough, and dyspnea), a singlereviewer system is likely sufficient in this context. In addition, the reviewed cases were not selected randomly but rather in nearly real time as the pandemic situation evolved. This approach could introduce a bias but, again, given how poorly the codes perform, we doubt that a randomly selected sample would alter the results. Still, future studies should prespecify a plan for data validation, with a focus on sampling racial and ethnic minorities to ensure generalizable results.

Conclusions
Rapid access to well-characterized, large SARS-CoV-2 and COVID-19 cohorts is critical for scientific discovery. ICD-10 codes are a standard terminology and are attractive for data aggregation because they are uniformly used among health care systems. However, these codes perform poorly in capturing COVID-19-related symptoms. Our findings highlight the critical need for meticulous data validation to feed multicenter registries built from EMRs. Reliable, accurate data are the foundation of scientific discovery; the right data lead to the right solutions.