Grieve et al1 investigate a statistically challenging question, namely, estimating heterogeneous treatment associations. This question is at the core of the precision medicine movement, so it is likely that these kinds of methods will become more common in the literature. This commentary does not engage the results of this particular study, but instead exists to help readers be more comfortable in critically engaging the statistical analysis in the study by Grieve et al.1
The need for new types of evidence (eg, personalized medicine’s need for estimating differential responses to treatment) requires sophisticated statistical methods. These new methods can seem opaque and leave readers disengaged from their responsibility to engage critically. Although there are more steps taken in these sophisticated methods, the fundamentals of good causal inference have not changed. The key question in critically engaging a study is still questioning why we should believe that the groups being contrasted will help us understand the outcomes of treatment. This question can usually be decomposed into 2 parts. First, does the study design make the groups as similar as possible in terms of pretreatment covariates (ie, “control”)? Second, is there good reason to believe that the sorting of patients into the contrasting groups was done in ways that are unlikely to bias the estimate of association or outcome (eg, “randomness”)? Sections 1 and 4 in a 2012 study by Rosenbaum2 provide a wonderfully readable foundation for why these 2 questions are so foundational in assessing connections. To help us judge whether the groups are similar enough in the observed covariates, analysts usually provide a balance table that summarizes the means of covariates in the treatment and control groups. Grieve et al1 provide this covariate summary in the initial table in their article. The second question is usually addressed by pointing to some source of “randomness” (the quotation marks are there to indicate that I am using a term in its causal inference sense rather than in its most rigorous sense) in the assignment to the contrast groups. Unsurprisingly, the second question is the crux of most debates in observational studies and is where this commentary will spend most of its time.
Grieve et al1 use an instrumental variable (IV) study design—a kind of natural experiment—that explicitly identifies a source of randomness in the assignment to treatment or control and then uses this randomness as a means for reassuring the reader that the sorting of patients was not systematically biasing. One of the clearest discussions of the evidentiary strength of a natural experiment design comes from a 2014 study by Zubizarreta et al.3
Grieve et al1 use near-far matching to hone their natural experiment. The “near” part of near-far matching refers to matching patients based on their pretreatment covariates so that each patient has at least 1 patient in the contrasting groups who is near in their observed covariates (ie, “they look alike”). The “far” part of near-far matching refers to the preference that matched sets be constructed so that patients have quite different values of the IV; this notion is perhaps easiest to understand by using the authors’ example. In their study, near-far matching would tend to match 2 patients who look alike when the care team was considering admitting to the intensive care unit (ICU); however, one patient presented when there were many beds available in the ICU, while the other patient presented when there were no beds available in the ICU. The usefulness of designing a study in this way is that, if the 2 patients deviate in the care they receive, we might reasonably point to the availability of the beds as instrumental in how the patients’ type of care was determined, and we are thus less concerned that some unobserved, prognostically important difference between the 2 patients gave rise to the difference in care type. Near-far matching is similar to traditional matching but departs by incorporating an explicit measurement of randomness (ie, the “instrumental variable”), whereas traditional matching hinges on the assumption (also known as “hope”) that the unobserved variables that led to the treatment type acted randomly. The use of an explicit measurement of randomness is core to IV procedures.4
For the rest of this commentary, I focus on how to critically interrogate the IV approach and its assumptions. First, it is important to state that a system that is not well understood is not the same as a system that functions randomly. The reader benefits from the analyst’s use of an IV because the source of randomness is explicitly stated and can be critically interrogated. Let us interrogate the IV used in the study by Grieve et al.1 (Just to make sure you do not feel like there is some magic going on here, although the following critiques hopefully make sense from the context, the 3 points about to be raised are derived from the IV assumptions. Critical engagement is easiest if you are familiar with the assumptions on which the method relies. See the study by Baiocchi et al4 for more detailed discussions of assumptions and common challenges to these assumptions.) In the study by Grieve et al,1 it makes sense that bed availability is likely to change the chances that a given patient will receive care in the ICU. This means the IV is likely useful in understanding how patients ended up getting the type of care we observed (a necessary assumption of the IV design). But our first challenge is whether bed availability is really “random”; that is, if knowing the number of beds available tells you something about the patient, then there are serious issues, and we cannot use bed availability as an IV. A second concern is that bed availability directly changes the care the patient will receive. Let us formulate these 2 concerns into context-specific challenges. Challenge 1: the availability of beds may vary in a systematic way over the course of a week (eg, perhaps patients tend to be discharged at high rates before the weekend), and the time of the week may also tell you about the patient (eg, weekend days may have more alcohol-related issues than weekdays). This kind of challenge calls into question the “randomness” that the IV is purported to have. Challenge 2: perhaps the level of care changes due to the number of beds available (eg, as the number of beds occupied increases, the level of care delivered in the ICU goes down). This is a challenge to the “exclusion restriction” assumption (ie, the assumption that the IV is associated with the outcome only through its ability to change the treatment received).
A word about identifying these challenges: finding a conceivable challenge is not the same thing as destroying the validity of an analysis. It provides no benefit to the literature for a critic to postulate a challenge and then strut about as if an enemy has been vanquished. A useful critical assessment will offer a well-articulated challenge that (1) discusses how frequently the challenge occurs in the real world (eg, Is the challenge a rarity in practice? Are there conditions under which it happens quite a bit?) and (2) speculates at how impactful the challenge is (eg, Is this challenge roughly as impactful as the treatment under assessment?). Taken together, these 2 parts of the challenge should help readers say whether the challenge biases the study away from the null or toward the null. Forming a well-articulated challenge is not trivial; if methodology is not your bailiwick, then it may require reaching out to a methodological expert (eg, your best biostatistician buddy) to help you judge how impactful the challenge is (this interdependency is not a failure of our system; instead, it is an example of Émile Durkheim’s “organic solidarity”). Once the challenges are formulated, it is the researchers’ job to muster counterarguments. Sometimes these counterarguments can be formalized, and the researchers can provide sensitivity analyses that use the data to quantify how much the challenge could affect their conclusions. For a general introduction to sensitivity analyses, see the 2013 study by Liu et al5; for a sensitivity analysis specific to near-far matching, see the 2010 study by Baiocchi et al.6 Other times, a challenge can be legitimately addressed by expert knowledge of how the system under study works in practice. Either way, the analysis is further grounded in its context, and the argumentation is improved.
The local IV (LIV) is another level of complexity on top of the usual IV study design. Grieve et al1 are right to deploy the LIV here because it has a lot of promise in the example under study. If one can find a reliable IV, then heterogeneous treatment associations are obtainable. The authors provide one of the literature’s clearest descriptions of how an LIV works. If you read through their description and feel comfortable, then you likely understand LIVs well enough to critique the plausibility of the required assumptions. If you read their excellent description of LIVs in the article and think you need more, then as a first approximation much of the intuition behind how an LIV works can be captured by considering what it would be like if several similar randomized clinical trials were run. In this hypothetical scenario, the difference between the trials is that each trial tries to recruit patients with different levels of clinical equipoise. Consider a study that targets patients whose characteristics mean they are at perfect clinical equipoise, namely, that any care team would be indifferent between putting these types of patients into treatment vs control. With this kind of patient, it would not be that hard and not require much encouragement to get the patients to change their treatment type. Consequently, for example, this hypothetical study might not need to pay that much for patients to change their type of care. In contrast, consider a study that targets a patient population that many (although not all) care teams believe would benefit from treatment over control. A study with this kind of target population might need to encourage participants quite a bit to let the researchers assign the patient’s type of care: perhaps this hypothetical study would require paying patients quite a lot to allow themselves to be randomized. The level of encouragement offered by these contrasting studies is approximately equal to the strength of the IV and is tied to whether or not the patients in the study will comply with the random assignment the researchers are using to get unbiased estimation of the treatment outcome. The LIV design looks at subgroups of patients with different levels of the IV and contrasts these subgroups in a way that (if the IV assumptions hold) is analogous to the hypothetical randomized trials described above. With this analogy for LIVs in mind, think of the differences that would emerge from these contrasting studies. Different kinds of patients are being enrolled in these studies. There might be quite different responses to the treatment (ie, heterogeneity of treatment outcome across the studies) and different adherence rates. More is going on with LIVs, but this analogy provides most of the intuition. Critically engaging the LIV in the study by Grieve et al,1 one might reasonably wonder whether challenge 2 (ie, that the quality of care in the ICU changes as the number of beds available changes) is an issue, and if so, then the LIV may be even more problematic.
The article by Grieve et al1 is a well-designed study. The authors have anticipated the usual challenges and provided reasoning and data for why one should be credulous of the reliability of their design. This is a methodologically challenging study, requiring a high level of sophistication to execute properly. But the use of these sophisticated methods does not diminish the reader’s critical role: as the sophistication of the study design increases, many of the subtleties of the analysis move beyond the nontechnical readership’s purview, but the fundamentals have not changed. Although this study by Grieve et al1 may appear different on the surface, these new methods rely on many of the same principles of possible causal inference; the nonstatistical reader has the insights necessary to engage critically with the most important aspects of the analysis.
Published: February 15, 2019. doi:10.1001/jamanetworkopen.2018.7698
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2019 Baiocchi M. JAMA Network Open.
Corresponding Author: Michael Baiocchi, PhD, Stanford Prevention Research Center, 1265 Welch Rd, Stanford, CA 94305 (email@example.com).
Conflict of Interest Disclosures: None reported.
Baiocchi M. Which Deteriorating Ward Patients Benefit From Transfer to the Intensive Care Unit? Critically Engaging Methods in a Well-Designed Natural Experiment. JAMA Netw Open. 2019;2(2):e187698. doi:10.1001/jamanetworkopen.2018.7698
Customize your JAMA Network experience by selecting one or more topics from the list below.