[Skip to Content]
Sign In
Individual Sign In
Create an Account
Institutional Sign In
OpenAthens Shibboleth
[Skip to Content Landing]
Views 7,925
Invited Commentary
Health Informatics
August 31, 2018

Subject Matter Knowledge in the Age of Big Data and Machine Learning

Author Affiliations
  • 1Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina
  • 2Center for Predictive Medicine, Duke Clinical Research Institute, Durham, North Carolina
  • 3Department of Civil and Environmental Engineering, Duke University Pratt School of Engineering, Durham, North Carolina
  • 4Department of Medicine, Duke University School of Medicine, Durham, North Carolina
JAMA Netw Open. 2018;1(4):e181568. doi:10.1001/jamanetworkopen.2018.1568

The traditional paradigm of clinical research involves the analysis of well-curated data sets. In its ideal form, the theoretical underpinnings of associations between exposures and outcomes would be evaluated by collecting and analyzing data to evaluate a priori hypotheses. As clinical research catches up with other fields and finds itself immersed in the era of big data, the opportunity to apply more computational and data-driven techniques increases. While these techniques date back to neural networks proposed in the 1950s, it is only with recent advances in computing hardware that their full potential has been realized. Machine learning and, most recently, deep learning have become the standard bearers for modern computational methods. These approaches were first used in nonmedical fields where data were readily available, and now they are leveraged to conduct clinical research. Currently, deep learning models are being developed to analyze clinical notes,1 assess radiologic images,2 and predict clinical outcomes.3

The study by Maharana and Nsoesie4 shows how deep learning methods can be used to discover features of the built environment (ie, both natural and modified elements of the physical environment). The authors analyzed 150 000 satellite images from 6 US cities. They used convolutional neural networks—a form of deep learning—to extract features of interest. They then related these features to obesity rates in different neighborhoods. In their analysis, they identified features previously known to be associated with obesity, such as green spaces,5 and identified features not previously identified, such as prevalence of pet stores.

As deep learning methods become more ingrained in clinical research, important questions arise regarding the role of subject matter expertise. Can we simply feed all and any available data into machine learning algorithms and obtain reasonable insights? What is the best way to generate new hypotheses? Do big data make experimentation unnecessary?

Our belief is that as deep learning becomes more powerful, subject matter knowledge expertise will become more, not less, essential. It is necessary to have an intricate understanding of the relationships being modeled, less we make erroneous “discoveries.” As Maharana and Nsoesie note, their identified relationships are not necessarily causal relationships and are potentially confounded by socioeconomic factors. Therefore, care must be taken in not overinterpreting any results. Even so, in the same way a biomarker may serve as a useful indicator of disease risk, these neighborhood factors can serve as a valuable indicator of health outcomes. As we move into an age of clinical research based on electronic health records, the opportunity to integrate information from external sources—particularly those that speak to patients’ social environment—becomes more important. A significant body of literature supports the link between positive and negative aspects of the built environment and the health of its residents.6 To advance the field, we need approaches to identify specific aspects of the neighborhood that may have an impact on health, methods to include longitudinal changes in exposure, outcome, and individual-level residence, and a more nuanced understanding of the interaction of multiple aspects of the built environment that may have an impact on health. To this end, there is a growing consensus across the spectrum of stakeholders, including academic researchers, health care professionals, and payers, that to fully understand the components of healthy living, electronic health records must incorporate aspects of the neighborhood.7

More generally, this work points to how big data and machine learning can be integrated into clinical research. There is a wealth of publicly available information to be explored. This includes information not only on the built environment but also on climate and weather data available through government sources and health behavior data available through social media. Integrating these sources with clinical data can provide new insights into the health consequences of air pollution,8 identify drug interactions via user web searches,9 or, in the present case, elucidate the relationship between the built environment and obesity. By integrating sophisticated analytic techniques such as deep learning with subject matter knowledge, we increase the opportunity to uncover more intricate relationships.

However, this does not mean analysis alone can provide all of the answers. At their core, these analytic techniques only point to features, and providing meaning to them requires subject matter insight. While there have been efforts to provide interpretability to machine learning models,10 these analyses will likely always require humans to understand them. As the authors suggest in their discussion, a naive analysis of the data could lead to conclusions that are not supported by the data and that mistake markers of obesity for causal factors.

So, what is the role of big data and machine learning in clinical research? As the study by Maharana and Nsoesie highlights, big data and machine learning have become integral to the discovery and hypothesis generation process. By using an agnostic, data-driven approach and “discovering” features already known, Maharana and Nsoesie illustrate the value of using machine learning to develop hypotheses. Going forward, it is likely that machine learning methods will be integral to discovering features associated with disease—likely features never previously suspected. However, the work does not end here. We will still need well-designed studies and experiments as well as well-curated data sets to confirm those insights.

Back to top
Article Information

Published: August 31, 2018. doi:10.1001/jamanetworkopen.2018.1568

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2018 Goldstein BA et al. JAMA Network Open.

Corresponding Author: Benjamin A. Goldstein, PhD, Department of Biostatistics and Bioinformatics, Duke University, 2424 Erwin Rd, Ste 11041, Durham, NC 27705 (ben.goldstein@duke.edu).

Conflict of Interest Disclosures: Dr Carlson reported grants from the National Institutes of Health, Stylli Translational Neuroscience Award, and Marcus Foundation. No other disclosures were reported.

Chen  MC, Ball  RL, Yang  L,  et al.  Deep learning to classify radiology free-text reports.  Radiology. 2018;286(3):845-852. doi:10.1148/radiol.2017171115PubMedGoogle ScholarCrossref
McBee  MP, Awan  OA, Colucci  AT,  et al.  Deep learning in radiology  [published online March 29, 2018].  Acad Radiol. doi:10.1016/j.acra.2018.02.018PubMedGoogle Scholar
Xiao  C, Ma  T, Dieng  AB, Blei  DM, Wang  F.  Readmission prediction via deep contextual embedding of clinical concepts.  PLoS One. 2018;13(4):e0195024. doi:10.1371/journal.pone.0195024PubMedGoogle ScholarCrossref
Maharana  A, Nsoesie  EO.  Use of deep learning to examine the association of the built environment with prevalence of neighborhood adult obesity.  JAMA Netw Open. 2018;1(4):e181535. doi:10.1001/jamanetworkopen.2018.1535Google ScholarCrossref
Lachowycz  K, Jones  AP.  Greenspace and obesity: a systematic review of the evidence.  Obes Rev. 2011;12(5):e183-e189. doi:10.1111/j.1467-789X.2010.00827.xPubMedGoogle ScholarCrossref
Northridge  ME, Sclar  ED, Biswas  P.  Sorting out the connections between the built environment and health: a conceptual framework for navigating pathways and planning healthy cities.  J Urban Health. 2003;80(4):556-568. doi:10.1093/jurban/jtg064PubMedGoogle ScholarCrossref
Hughes  LS, Phillips  RL  Jr, DeVoe  JE, Bazemore  AW.  Community vital signs: taking the pulse of the community while caring for patients.  J Am Board Fam Med. 2016;29(3):419-422. doi:10.3122/jabfm.2016.03.150172PubMedGoogle ScholarCrossref
Di  Q, Wang  Y, Zanobetti  A,  et al.  Air pollution and mortality in the Medicare population.  N Engl J Med. 2017;376(26):2513-2522.PubMedGoogle ScholarCrossref
White  RW, Tatonetti  NP, Shah  NH, Altman  RB, Horvitz  E.  Web-scale pharmacovigilance: listening to signals from the crowd.  J Am Med Inform Assoc. 2013;20(3):404-408. doi:10.1136/amiajnl-2012-001482PubMedGoogle ScholarCrossref
Zeng  J, Ustun  B, Rudin  C.  Interpretable classification models for recidivism prediction.  J R Stat Soc Ser A Stat Soc. 2017;180(3):689-722. doi:10.1111/rssa.12227Google ScholarCrossref
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words