Bennett et al1 present the first study to use the National COVID Cohort Collaborative (N3C) database. They studied factors associated with COVID-19 severity using machine learning (ML) methods to predict severe disease outcomes with the goal of informing clinical decision-making and health policy. The strongest ML predictor of severity was pH level at hospital admission. A multivariable logistic regression model identified associations between age, male sex, liver disease, dementia, African American and Asian race, and obesity with severe outcomes.
The N3C aggregates electronic health record (EHR) data for patients who test positive for COVID-19 and controls who test negative from multiple health systems across the US. Data are translated into a common data model and stored in a central repository accessible to authorized researchers. A collaboration of this scale required navigating substantial technical, legal, security, privacy, governance, and analytic challenges, some of which are unique to the fragmented US health system. The study demonstrates how a large-scale centralized resource can facilitate application of ML methods and rapid analytic iteration.2 The N3C represents an important achievement and contribution to COVID-19 research.
Aggregating EHR data requires caution to ensure researchers match the data and methods to the intended uses. Appropriate research use of EHR data within a single institution requires local expertise and deep content knowledge to understand the often idiosyncratic way that EHR data are captured and stored. Multisite EHR-based research can amplify known challenges of using EHR data, such as cross-institutional differences in data capture and workflows, use of standards, and missingness.3 Combining data across institutions creates complex issues related to standardization, obfuscates site heterogeneity, and, most importantly, disconnects the data from the people who know them best.
The N3C has an impressive data ingestion pipeline, but the nature of EHR data makes it exceedingly difficult to identify and control for underlying cross-site heterogeneity. Characterizing and describing the heterogeneity of a multisite data resource requires substantial and ongoing investigation that cannot be automated. Reliable standardization requires deep content knowledge from each site and ongoing curation to establish consistent meaning across sites and identify heterogeneity (eg, differences in the administration and documentation of COVID-19 test results or handling of nondefinitive laboratory test results) that may affect whether a site or entire data resource is appropriate to address specific questions.
Local differences in how data are stored are a major barrier to standardization. Laboratory tests provide a case in point. Despite the use of Logical Observation Identifiers Names and Codes, many institutions apply idiosyncratic labels to specify subtly different laboratory tests, so that cross-institutional analyses require manual adjudication and coordination by local experts working together across institutions to make the laboratory result data usable for multi-institutional research.4 These laboratory data issues can be insidiously hidden to end users and left unaddressed in multisite studies. Moreover, laboratory test results demonstrate considerable ascertainment bias because practitioners may have unique criteria (often unknown to the researcher) regarding which laboratory tests are ordered in which settings and for which patients, potentially creating a situation in which data within systems and across systems are not missing at random.5 As a result, the ordering of a test and the result could be markers for the presence of an outcome of interest rather than a predictor. In our experience with numerous multisite data networks, each data refresh can require weeks or months of collaboration between the coordinating center and site data scientist to create a reasonably curated data set that meets minimum standards. Even then, some data domains (eg, laboratory test results, prescribing patterns, and vital signs) remain unusable for multisite research without substantial additional data quality review and standardization.
Data missingness presents a further challenge to EHR-based research. The experience of DeLozier et al6 modeling COVID-19 data at a single institution revealed that the pandemic created difficulties because it brought an influx of new patients who lacked retrospective data that are ordinarily used to determine comorbidities; this limitation is highlighted in the study by Bennett et al,1 in which only 49% of patients had enough data on preexisting health conditions to allow calculation of comorbidities. The pandemic also compromised electronic data capture because paper forms were used early in the pandemic to record demographic and health characteristics of patients undergoing SARS-Cov-2 testing. Some variables may appear to be missing because they are recorded in variously coded fields and/or clinical notes. For example, patients with coded data will have documented supplemental oxygen in the data repository, whereas patients with information recorded in clinical notes will look as though they did not receive supplemental oxygen. These documentation issues can introduce ascertainment bias that changes over time and are incredibly difficult to identify without local support and intensive and ongoing data curation. The current study highlights the issue because more than half of the ML model inputs were missing at rates of greater than 30%, including pH level (a main ML predictor of severity), which was 75% missing; some missing values were imputed, introducing additional concerns regarding whether the imputed values were based on site-specific or database-wide information.
The N3C dashboard7 helps illustrate some difficulties understanding data combined across institutions. The data set has information on 5 million patients (1.2 million who tested positive for COVID-19), 2.6 billion laboratory results, 949 million medication records, and 257.8 million visits across 50 sites. These numbers raise questions regarding data documentation and transformation. Do patients really have 520 laboratory results, 190 medication records, and 52 health care visits per person since 2018? Or perhaps there is variation in how sites extracted and delivered data to the N3C. Given the database’s scale and the intended simultaneous use by multiple researchers, we encourage the N3C to describe how each data refresh at each site is curated and the associated cross-institutional findings to help investigators appropriately use the data.
Responsibly using multisite data resources requires a deep understanding of the strengths and limitations of the data sources. The N3C data are well suited for studies that focus on observations during an inpatient stay and those that describe patients cross-sectionally. However, patients commonly move across health care systems, which creates discontinuity in data capture that makes EHR data poorly suited for studies that need to follow up patients across care settings and over time (eg, EHR data are not well suited to assess rates of subsequent hospitalization because discharged patients may not return to the same hospital).8 Sequelae of post–acute SARS-CoV-2 infection is another example of a topic poorly suited for EHR data because longitudinal capture will be incomplete.
The N3C is promoting transparency through use of open science principles and other community engagement activities. We encourage the N3C investigators to augment those efforts with transparency focused on practitioner-researchers who need information, such as study protocols, data curation processes, database metrics, design diagrams, and analytic specifications.9 As the N3C moves to a high-throughput clinical and epidemiologic research platform, adherence to reporting principles and guidelines from across disciplines will help build trust.
The N3C is a tribute to the dedication of individuals and institutions working together to solve difficult problems. The collaborative created a resource with potential to address important questions about COVID-19. Realizing that value requires that the data be viewed with caution, and we have enumerated several potential issues with combining EHR data across sites and note that it is difficult to know which data issues will be problematic for which study; further studies are needed to understand the contribution of local expertise to robust and reproducible research. Getting the greatest value from the N3C will require a skeptical eye, transparency, and an acknowledgment of the limitations of EHR-based research and multisite data resources.
Published: July 13, 2021. doi:10.1001/jamanetworkopen.2021.17175
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Brown JS et al. JAMA Network Open.
Corresponding Author: Jeffrey S. Brown, PhD, Harvard Pilgrim Health Care Institute, 401 Park Dr, Boston, MA 02494 (email@example.com).
Conflict of Interest Disclosures: None reported.
Identify all potential conflicts of interest that might be relevant to your comment.
Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.
Err on the side of full disclosure.
If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.
Not all submitted comments are published. Please see our commenting policy for details.
Brown JS, Bastarache L, Weiner MG. Aggregating Electronic Health Record Data for COVID-19 Research—Caveat Emptor. JAMA Netw Open. 2021;4(7):e2117175. doi:10.1001/jamanetworkopen.2021.17175
Customize your JAMA Network experience by selecting one or more topics from the list below.