On the Usage of Combined Data Structures to Study COVID-19 in Understudied Populations | Health Informatics | JAMA Network Open | JAMA Network
[Skip to Navigation]
Sign In
Views 1,648
Citations 0
Invited Commentary
Health Informatics
June 11, 2021

On the Usage of Combined Data Structures to Study COVID-19 in Understudied Populations

Author Affiliations
  • 1Precision Health Informatics Section, Center for Precision Health Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland
JAMA Netw Open. 2021;4(6):e2112874. doi:10.1001/jamanetworkopen.2021.12874

In Bourgeois et al,1 the authors demonstrate the utility of electronic health record (EHR) data structures to systematically study an otherwise understudied population in the context of an ongoing pandemic. Furthermore, they provide an example of how to perform such analyses using data stored in different data models across different countries. Through the Consortium for Clinical Characterization of COVID-19 by EHR (4CE), data from 27 hospitals in 6 countries (from a larger consortium of global data from 351 hospitals from 7 countries) were combined to study COVID-19–associated clinical outcomes in the pediatric population, uncovering findings of elevated markers of inflammation, evidence of abnormalities in coagulation, cardiac arrhythmias, viral pneumonia, and respiratory failure. This work adds further knowledge to the manifestations of COVID-19 in children and youth that have been previously studied in systematic reviews.2,3

The ongoing COVID-19 pandemic has affected how we live and work in unprecedented ways. From widespread stay-at-home orders and mask mandates to working and instructing our children from home, the COVID-19 outbreak has profoundly affected every part of the globe. One notable facet of this pandemic is the amount of research that has been undertaken to address COVID-19. As of April 2021, a PubMed search for COVID-19 returned more than 110 000 indexed results, elucidating the vast amount of COVID-19–related research undertaken in the past several months. Much of this research is cooperative among disparate entities; from the release of the initial genetic sequence of SARS-CoV-2 to the public to collaborative development of vaccinations, collaboration has the potential to save lives in a timely manner.

Collaborative research at a national or global scale can assist not only in the development of therapeutics but also in understanding the natural course of disease in otherwise understudied populations. For example, in Bourgeois et al,1 much is unknown about how COVID-19 affects children and youth because it is difficult to study the disease because of the challenges associated with including minors in clinical trials.1 In a narrative synthesis of pediatric COVID-19 evidence, Metha et al2 note that clinical data are scarce among much of the current literature, which illustrates a much-needed area of study among pediatric patients with COVID-19. Despite being difficult, it is important to study the clinical course of disease in all populations. One natural data source for studying clinical outcomes among understudied populations is EHRs, which comprise structured and unstructured data collected as part of routine clinical care.

Bourgeois et al1 write that “sites executed queries on local clinical data warehouses containing patient-level EHR data. To construct the required data files, sites used the Informatics for Integrating Biology & the Bedside platform, the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), Epic Clarity, or other clinical data warehouses.”1 EHR standards, such as OMOP, provide a common, standardized abstraction of medical concepts that translates site-specific medical concepts to a common vocabulary so that researchers may leverage data from multiple contributing sources. For example, in the case of OMOP, site-specific source condition concepts (eg, International Statistical Classification of Diseases and Related Health Problems, Tenth Revision [ICD-10], or even nonstandard hospital-specific codes) are converted to the systemized nomenclature of medicine, or SNOMED, ontology as a common vocabulary. However, there is an inherent difficulty in the merging of data sources at such a scale at the row level because individual countries may have differing data-sharing policies, and there is still heterogeneity among the CDMs that are adopted by each health care system, which is evidenced in the article by Bourgeois et al.1

The combination and standardization of EHR data to study COVID-19 is becoming more common. In addition to the 4CE consortium that contributed to the study by Bourgeois et al1, there are other examples of research initiatives using CDMs that have specifically addressed COVID-19–related outcomes.4 These resources have the potential to allow for scientific study in understudied populations. Notably, the National COVID Cohort Collaborative (N3C) “aims to aggregate and harmonize EHR data across clinical organizations in the United States, and is a novel partnership that includes the Clinical and Translational Science Awards Program hubs (60 institutions), the National Center for Advancing Translational Science, the Center for Data to Health and the community.”5 Specifically, N3C contains combined EHR information, including medications, procedures, and conditions, resulting in a data set that comprises more than 1.4 billion rows and more than 200 000 patients with COVID-19 as of November 11, 2020. The initiative states that its primary features “are national collaboration and governance, regulatory strategies, COVID-19 cohort definitions via community-developed phenotypes, data harmonization across 4 CDMs, and development of a collaborative analytics platform to support deployment of novel algorithms of data aggregated from the United States.”5 Additionally, the All of Us Research Program (All of Us),6 which aims at recruiting at least 1 million people to further precision medicine, is another example of a large-scale cohort containing EHR information in understudied populations. To address the pandemic, All of Us has implemented the COVID Participant Experience (COPE) survey through 2020 to early 2021.7 Both N3C and All of Us implemented cloud-based data access models, allowing researchers to work with a central access point to the combined data.

While data standardization moves toward ubiquity, it is necessary to ensure research quality by adopting standards of practice in how we combine EHR data across sites and recognize which kinds of analyses are valid when using combined data. In the study by Bourgeois et al,1 the authors overcome the problem of combining data from differing CDMs by settling on the site-specific extraction of ICD-10 code counts (additionally, some are subject to bucketing restrictions to preserve privacy). Although the reporting of binned counts precludes row-level analysis that can adjust for confounding variables, such as EHR length, the method illustrates the delicate trade-off necessary when considering analyses at this scale. While it is necessary to extend and improve upon foundational computational infrastructure to implement rapid scientific inquiries, especially to understudied individuals such as the pediatric population, the maintenance of data security, privacy, and research integrity is a natural consideration. Bourgeois et al1 demonstrate that proper planning and effective definition of relevant quantities can lead to important findings in an understudied population while maintaining such standards.

Back to top
Article Information

Published: June 11, 2021. doi:10.1001/jamanetworkopen.2021.12874

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Schlueter DJ. JAMA Network Open.

Corresponding Author: David Jeffrey Schlueter, PhD, Precision Health Informatics Section, Center for Precision Health Research, National Human Genome Research Institute, National Institutes of Health, 50 S Dr, Bethesda, MD 20892 (david.schlueter@nih.gov).

Conflict of Interest Disclosures: None reported.

Funding/Support: This research was supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.

Role of the Funder/Sponsor: The funder had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Disclaimer: The opinions expressed here are those of the author(s) and do not necessarily represent the view or policies of the institution to which they are affiliated.

Bourgeois  FT, Gutiérrez-Sacristán  A, Keller  MS,  et al; Consortium for Clinical Characterization of COVID-19 by EHR (4CE).  International analysis of children hospitalized with COVID-19: leveraging 4CE electronic health record data across 27 hospitals in 6 countries.   JAMA Netw Open. 2021;4(6):e2112596. doi:10.1001/jamanetworkopen.2021.12596 Google Scholar
Mehta  NS, Mytton  OT, Mullins  EWS,  et al.  SARS-CoV-2 (COVID-19): what do we know about children? a systematic review.   Clin Infect Dis. 2020;71(9):2469-2479. doi:10.1093/cid/ciaa556PubMedGoogle ScholarCrossref
Perikleous  E, Tsalkidis  A, Bush  A, Paraskakis  E.  Coronavirus global pandemic: an overview of current findings among pediatric patients.   Pediatr Pulmonol. 2020;55(12):3252-3267. doi:10.1002/ppul.25087PubMedGoogle ScholarCrossref
Dagliati  A, Malovini  A, Tibollo  V, Bellazzi  R.  Health informatics and EHR to support clinical research in the COVID-19 pandemic: an overview.   Brief Bioinform. 2021;22(2):812-822. doi:10.1093/bib/bbaa418PubMedGoogle ScholarCrossref
Haendel  MA, Chute  CG, Bennett  TD,  et al; N3C Consortium.  The National COVID Cohort Collaborative (N3C): rationale, design, infrastructure, and deployment.   J Am Med Inform Assoc. 2021;28(3):427-443. doi:10.1093/jamia/ocaa196PubMedGoogle ScholarCrossref
Denny  JC, Rutter  JL, Goldstein  DB,  et al; All of Us Research Program Investigators.  The “All of Us” research program.   N Engl J Med. 2019;381(7):668-676. doi:10.1056/NEJMsr1809937PubMedGoogle ScholarCrossref
All of Us Research Program. Coronavirus. Accessed March 23, 2021. https://www.joinallofus.org/coronavirus
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words