The traditional paradigm of clinical research involves the analysis of well-curated data sets. In its ideal form, the theoretical underpinnings of associations between exposures and outcomes would be evaluated by collecting and analyzing data to evaluate a priori hypotheses. As clinical research catches up with other fields and finds itself immersed in the era of big data, the opportunity to apply more computational and data-driven techniques increases. While these techniques date back to neural networks proposed in the 1950s, it is only with recent advances in computing hardware that their full potential has been realized. Machine learning and, most recently, deep learning have become the standard bearers for modern computational methods. These approaches were first used in nonmedical fields where data were readily available, and now they are leveraged to conduct clinical research. Currently, deep learning models are being developed to analyze clinical notes,1 assess radiologic images,2 and predict clinical outcomes.3
The study by Maharana and Nsoesie4 shows how deep learning methods can be used to discover features of the built environment (ie, both natural and modified elements of the physical environment). The authors analyzed 150 000 satellite images from 6 US cities. They used convolutional neural networks—a form of deep learning—to extract features of interest. They then related these features to obesity rates in different neighborhoods. In their analysis, they identified features previously known to be associated with obesity, such as green spaces,5 and identified features not previously identified, such as prevalence of pet stores.
As deep learning methods become more ingrained in clinical research, important questions arise regarding the role of subject matter expertise. Can we simply feed all and any available data into machine learning algorithms and obtain reasonable insights? What is the best way to generate new hypotheses? Do big data make experimentation unnecessary?
Our belief is that as deep learning becomes more powerful, subject matter knowledge expertise will become more, not less, essential. It is necessary to have an intricate understanding of the relationships being modeled, less we make erroneous “discoveries.” As Maharana and Nsoesie note, their identified relationships are not necessarily causal relationships and are potentially confounded by socioeconomic factors. Therefore, care must be taken in not overinterpreting any results. Even so, in the same way a biomarker may serve as a useful indicator of disease risk, these neighborhood factors can serve as a valuable indicator of health outcomes. As we move into an age of clinical research based on electronic health records, the opportunity to integrate information from external sources—particularly those that speak to patients’ social environment—becomes more important. A significant body of literature supports the link between positive and negative aspects of the built environment and the health of its residents.6 To advance the field, we need approaches to identify specific aspects of the neighborhood that may have an impact on health, methods to include longitudinal changes in exposure, outcome, and individual-level residence, and a more nuanced understanding of the interaction of multiple aspects of the built environment that may have an impact on health. To this end, there is a growing consensus across the spectrum of stakeholders, including academic researchers, health care professionals, and payers, that to fully understand the components of healthy living, electronic health records must incorporate aspects of the neighborhood.7
More generally, this work points to how big data and machine learning can be integrated into clinical research. There is a wealth of publicly available information to be explored. This includes information not only on the built environment but also on climate and weather data available through government sources and health behavior data available through social media. Integrating these sources with clinical data can provide new insights into the health consequences of air pollution,8 identify drug interactions via user web searches,9 or, in the present case, elucidate the relationship between the built environment and obesity. By integrating sophisticated analytic techniques such as deep learning with subject matter knowledge, we increase the opportunity to uncover more intricate relationships.
However, this does not mean analysis alone can provide all of the answers. At their core, these analytic techniques only point to features, and providing meaning to them requires subject matter insight. While there have been efforts to provide interpretability to machine learning models,10 these analyses will likely always require humans to understand them. As the authors suggest in their discussion, a naive analysis of the data could lead to conclusions that are not supported by the data and that mistake markers of obesity for causal factors.
So, what is the role of big data and machine learning in clinical research? As the study by Maharana and Nsoesie highlights, big data and machine learning have become integral to the discovery and hypothesis generation process. By using an agnostic, data-driven approach and “discovering” features already known, Maharana and Nsoesie illustrate the value of using machine learning to develop hypotheses. Going forward, it is likely that machine learning methods will be integral to discovering features associated with disease—likely features never previously suspected. However, the work does not end here. We will still need well-designed studies and experiments as well as well-curated data sets to confirm those insights.
Published: August 31, 2018. doi:10.1001/jamanetworkopen.2018.1568
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2018 Goldstein BA et al. JAMA Network Open.
Corresponding Author: Benjamin A. Goldstein, PhD, Department of Biostatistics and Bioinformatics, Duke University, 2424 Erwin Rd, Ste 11041, Durham, NC 27705 (email@example.com).
Conflict of Interest Disclosures: Dr Carlson reported grants from the National Institutes of Health, Stylli Translational Neuroscience Award, and Marcus Foundation. No other disclosures were reported.
Identify all potential conflicts of interest that might be relevant to your comment.
Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.
Err on the side of full disclosure.
If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.
Not all submitted comments are published. Please see our commenting policy for details.
Goldstein BA, Carlson D, Bhavsar NA. Subject Matter Knowledge in the Age of Big Data and Machine Learning. JAMA Netw Open. Published online August 31, 20181(4):e181568. doi:10.1001/jamanetworkopen.2018.1568
Customize your JAMA Network experience by selecting one or more topics from the list below.
Create a personal account or sign in to: