[Skip to Content]
[Skip to Content Landing]
Views 1,824
Invited Commentary
March 8, 2019

Using Machine Learning to Identify Heterogeneous Effects in Randomized Clinical Trials—Moving Beyond the Forest Plot and Into the Forest

Author Affiliations
  • 1Department of Biostatistics & Bioinformatics, Duke University School of Medicine, Durham, North Carolina
  • 2Quantitative Sciences Unit, Stanford University School of Medicine, Palo Alto, California
JAMA Netw Open. 2019;2(3):e190004. doi:10.1001/jamanetworkopen.2019.0004

In the reporting of randomized clinical trial results, it is standard practice to show a forest plot—indicating potential effect heterogeneity. In many respects, the hope is that the analysis is null, allowing one to report the average treatment effect and create uniform treatment recommendations. In truth it is rare for a treatment effect to be perfectly homogeneous. Often such heterogeneity consist of quantitative interactions—interactions where the effect across subgroups go in the same direction but simply have different magnitudes.1 In these cases the ability to detect an interaction is merely a function of sample size (ie, power) and one can argue that the heterogeneity is not really of interest. However, there are also examples of true qualitative interactions—effects where 1 subgroup has either no treatment effect or the effect goes in the opposite direction.2

Identification of heterogeneous treatment effects typically involves 2 separate but associated goals: (1) identification of subgroups with different treatment effects and (2) estimation of individual treatment effects, ie, individual risk prediction. Machine learning methods have recently been developed and applied to health data to address each of these approaches.3,4 Broadly, machine learning refers to methods that make no a priori assumptions about the model. Instead, the algorithm (ie, machine) is able to find the best model (ie, learn). While such approaches often require more care via the usage of training and testing sets, they also allow the investigator to discover things that may not be known otherwise.

Scarpa et al5 present a study that is an application of machine learning to detect subgroups with different treatment effects. The authors applied the relatively new approach of random forest analysis,6 an extension of the now classic algorithm random forests. In random forest analyses, many decision trees are constructed by dividing the data into subgroups of individuals with similar treatment effects estimates. The final nodes of the tree, referred to as the leaves, constitute a subpopulation with a similar event rate. For example, a leaf may be men aged 60 years or older who smoke and whose blood pressure is higher than 140 mm Hg. By examining many such leaves, the forest is able to identify subgroups with different event rates. One can consider these different subgroups as representative of complex interactions.

Scarpa et al5 applied random forest analysis to the Systolic Blood Pressure Intervention Trial (SPRINT). The SPRINT results indicated that, among patients without diabetes, it is better to target systolic blood pressure control to a lower target (<120 mm Hg) than a higher target (<140 mm Hg).7 Moreover, based on a forest plot, the original trial did not report any subgroup heterogeneity. In their analysis, Scarpa et al5 did find effect heterogeneity. The random forest analysis identified 5250 subgroups across the collection of trees. Of these, the majority (4911 [94%]) exhibited differential treatment benefits, ie, quantitative interactions. However, 6% (n = 339) of the subgroups identified exhibited harm from intensive treatment—qualitative interactions. Most of these subgroups were associated with baseline risk factors for kidney and cardiovascular disease. After validation—including controls for multiple testing—the authors identified 1 subgroup who were associated with adverse events of treatment: current smokers with a baseline systolic blood pressure greater than 144 mm Hg. While most individuals benefited from treatment, these individuals had a greater risk of events (hazard ratio, 10.6; 95% CI, 1.3-86.1). Essentially this suggests a 3-way interaction with treatment—an association that one would not know to look for and that could easily be masked in a forest plot. It is noteworthy that based on the original forest plot, smokers alone had a slightly elevated but not significant event rate (hazard ratio, 1.65; 95% CI, 0.84-3.26), while those with higher systolic blood pressure had a nonsignificant treatment effect (hazard ratio, 0.75; 95% CI, 0.51-1.10). This is suggestive of the uncovered interaction. While this is inherently a post hoc analysis, given the degree of validation performed, the results also should not be easily dismissed.

This work highlights the role that machine learning can play in the analysis of clinical trials. Finding effect heterogeneity is a notoriously difficult statistical problem.8 Nonetheless, it is of great research interest to fully interrogate clinical trials to understand how effects may look in different subpopulations.9 As the authors show, machine learning can aid in these assessments. However, this does not mean that machine learning can replace thoughtful study design.10 Even with validation, any assessment of effect heterogeneity is prone to false discoveries, since the randomization is effectively broken. To validate identified subgroups, researchers have suggested checking treated vs control balance of baseline variables within subgroup and have suggested a method that may identify balanced subgroups using matching plus classification and regression trees.3 Ultimately, prospective and targeted assessments will be necessary to truly validate these findings.

One of the promises of personalized medicine is that treatments will be tailored to one’s particular set of clinical characteristics. Randomized clinical trials—as the current criterion standard for evaluation of treatment effects—have an important role to play in the realization of that vision. While forest plots are a reasonable first attempt to detect treatment heterogeneity, one should not enter into such an analysis hoping to conclude that the average treatment effect is sufficient. Instead, a paradigm shift is needed where we embrace the underlying treatment heterogeneity and hope to discover subgroups who may or may not benefit from the therapy. Methods like random forest analyses should be embraced as they provide investigators with tools to find such effects. While this will ultimately make the provisioning of therapy more challenging, it also has the potential to make it more effective.

Back to top
Article Information

Published: March 8, 2019. doi:10.1001/jamanetworkopen.2019.0004

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2019 Goldstein BA et al. JAMA Network Open.

Corresponding Author: Benjamin A. Goldstein, PhD, Department of Biostatistics & Bioinformatics, Duke University School of Medicine, 2424 Erwin Rd, Ste 9023, Durham, NC 27705 (ben.goldstein@duke.edu).

Gail  M, Simon  R.  Testing for qualitative interactions between treatment effects and patient subsets.  Biometrics. 1985;41(2):361-372. doi:10.2307/2530862PubMedGoogle ScholarCrossref
Mahaffey  KW, Wojdyla  DM, Carroll  K,  et al; PLATO Investigators.  Ticagrelor compared with clopidogrel by geographic region in the Platelet Inhibition and Patient Outcomes (PLATO) trial.  Circulation. 2011;124(5):544-554. doi:10.1161/CIRCULATIONAHA.111.047498PubMedGoogle ScholarCrossref
Rigdon  J, Baiocchi  M, Basu  S.  Preventing false discovery of heterogeneous treatment effect subgroups in randomized trials.  Trials. 2018;19(1):382. doi:10.1186/s13063-018-2774-5PubMedGoogle ScholarCrossref
Lu  M, Sadiq  S, Feaster  DJ, Ishwaran  H.  Estimating individual treatment effect in observational data using random forest methods.  J Comput Graph Stat. 2017;27(1):209-219. doi:10.1080/10618600.2017.1356325PubMedGoogle ScholarCrossref
Scarpa  J, Bruzelius  E, Doupe  P, Le  M, Faghmous  J, Baum  A.  Assessment of risk of harm associated with intensive blood pressure management among patients with hypertension who smoke: a secondary analysis of the Systolic Blood Pressure Intervention Trial.  JAMA Netw Open. 2019;2(3): e190005. doi:10.1001/jamanetworkopen.2019.0005Google Scholar
Wager  S, Athey  S.  Estimation and inference of heterogeneous treatment effects using random forests.  J Am Stat Assoc. 2018;113(523):1228-1242. doi:10.1080/01621459.2017.1319839Google ScholarCrossref
Wright  JT  Jr, Williamson  JD, Whelton  PK,  et al; SPRINT Research Group.  A randomized trial of intensive versus standard blood-pressure control.  N Engl J Med. 2015;373(22):2103-2116. doi:10.1056/NEJMoa1511939PubMedGoogle ScholarCrossref
Kent  DM, Steyerberg  E, van Klaveren  D.  Personalized evidence based medicine: predictive approaches to heterogeneous treatment effects.  BMJ. 2018;363:k4245. doi:10.1136/bmj.k4245PubMedGoogle ScholarCrossref
Goldstein  BA, Phelan  M, Pagidipati  NJ, Holman  RR, Pencina  MJ, Stuart  EA.  An outcome model approach to transporting a randomized controlled trial results to a target population. https://arxiv.org/ftp/arxiv/papers/1806/1806.09692.pdf. Accessed February 13, 2019.
Goldstein  BA, Carlson  D, Bhavsar  NA.  Subject matter knowledge in the age of big data and machine learning.  JAMA Netw Open. 2018;1(4):e181568. doi:10.1001/jamanetworkopen.2018.1568PubMedGoogle ScholarCrossref
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words