To paraphrase Yogi Berra and perhaps others,1 prediction is hard, especially about the future. Chaudhary and colleagues2 therefore should be commended for producing a moderately accurate prediction model of sustained postoperative opioid use. Consistent with current standards,3 they transparently reported the model development process and coefficients. They also translated the model into a practical and accessible scoring system, the Stopping Opioids After Surgery (SOS) score, making it more likely to be used for discharge planning. Others who develop prediction models should emulate these features.
Clinical prediction models like the SOS score2 are intended to inform treatment decisions for individual patients. Therefore, the same rules of evidence and skepticism should apply as for all health care interventions. Because of the complex and technical nature of prediction model development and evaluation, being a critical user of these models is often more difficult than producing the models in the first place. Overall accuracy statistics are important, but they are only a small fraction of what the critical user should consider. In this short commentary, I offer only 3 of the many important questions critical users should ask before using or implementing the SOS score or any other clinical prediction model.
Does the Outcome Have the Same Meaning for All Patients?
Prediction model development and validation studies are observational cohort designs. Using preexisting or newly collected data, individuals who are at risk of the outcome are observed until the outcome occurs for some fraction of the cohort. Then baseline characteristics of the cohort are used to develop and validate a statistical prediction model. In this context, “at risk of” means “not yet experienced.” For example, people who already died should not be included in a cohort study to develop prediction models of 30-day mortality. Because the dead are not at risk of dying, the meaning of the outcome is different for cohort members depending on their baseline status (dying vs staying dead). However, similar cohort exclusion criteria can be less clear.
In the study by Chaudhary et al,2 the outcome of sustained postoperative opioid use is unambiguously a new and undesirable outcome for patients who were opioid naive or who only had minor opioid exposures in the 6 preoperative months. However, the same outcome is more difficult to interpret for the cohort members who had persistent opioid use in the preoperative period. The outcome could represent an escalation of opioid use (ie, a bad outcome), maintenance of preoperative levels (ie, not changed by surgery), or continued use at a reduced dose (ie, a good, if not ideal, outcome). For this group, the outcome is underspecified because it may not be new (not at risk of) or necessarily undesirable. Changing the outcome definition to an escalation of opioid use, or developing separate models and outcomes for different cohorts, could ensure that everyone in the development cohort was at risk of the outcome and that it has the same meaning for everyone.
Is the Model Accurate for Subgroups?
There is active debate among prediction model developers about the trade-offs between using cohorts that are very diverse in terms of patient and procedure characteristics (eg, all surgical procedures) vs more narrow and specific cohorts (eg, Medicare patients undergoing knee replacement). For example, the American College of Surgeons National Surgical Quality Improvement Program surgical risk calculator4 is reported to have generally good accuracy across a very wide range of procedures and specialties. However, other researchers5,6 have found that this tool has low accuracy when tested for specific patients and procedures that differ from the overall sample in terms of the prevalence of inputs and outcomes.
A simple thought experiment shows why this is not surprising. Imagine a cohort similar to the SOS cohort2 but composed of only 2 distinct groups (eg, opioid naive and chronic opioid users) with significantly different frequencies of the outcome (2.7% vs 30.5% in this hypothetical case). An area under the curve (AUC) value is a measure of discrimination, or the probability that a randomly chosen person who experienced the outcome will have a higher predicted probability than a randomly selected person who did not experience the outcome. An AUC value of 0.50 signifies no discrimination or random chance. An AUC value greater than 0.80 is considered very good to excellent. For our hypothetical 2-group cohort, a prediction model that only included a group membership indicator will have an AUC value of approximately 0.75, even though it has no predictive value within each group. This means that a model with excellent overall accuracy might have no accuracy for the subgroup of patients or procedures in a particular practice.
The SOS model2 is reported to have moderate overall accuracy (AUC = 0.76), but this appears to be primarily driven by distinguishing people according to whether they were preoperative long-term opioid users. Although the scoring system described by Chaudhary et al2 goes from 0 to 100, by my calculations, the maximum SOS score is actually 72 points: 36 points for being a long-term opioid user and 36 points for having the highest possible risk for all other predictors combined, because, for example, patients cannot occupy more than 1 age category. It is currently unknown how well the model performs within each group of opioid users (naive, short-term use, and long-term use). Because most of the sample was naive to opioids before surgery (54.9%),2 it is important to check that the model can discriminate among them. Including interaction terms could help tell us whether the coefficients for other model terms (eg, age or depression) need to be different depending on preoperative opioid use status. If so, the interaction terms should be included to improve both the overall and within-group accuracy. An alternative solution, as suggested already, is to develop a separate model for preoperative long-term opioid users. For the same reason, the heterogeneity of the category “major surgery,” which includes procedures known to differ in terms of persistent surgical pain and postoperative opioid use, might also be worth exploring through interaction terms and stratified analyses. This is an example of why the critical user of prediction models should look for evidence that the reported overall accuracy of models still applies for subgroups of patients with very different input or outcome prevalence.
Does the Model Give Users New Information?
Beyond accuracy, the ultimate test of the effectiveness of clinical prediction models is whether patients have better outcomes in health care systems once the model outputs are delivered to those in a position to interpret and act on them. As I have previously noted,7 the path from accurate predictions to real benefits to patients is long and demanding. One step along the path is that models must give users new information. If the main information contained in the SOS model2 is that people who are long-term users of opioids preoperatively are much more likely to be long-term users of opioids postoperatively, then it is important to test whether formalizing this knowledge through a clinical decision support tool will translate to better clinical care and patient outcomes. If users already know and act on this risk factor, then formalizing the model into clinical decision support tools is less likely to be effective.
Developing and validating accurate clinical prediction models is hard, as is being a critical user of prediction model research. I think that the developers of the SOS model2 are on the right track and may benefit from refining the cohort and/or outcome definitions before testing the clinical decision support tool’s effectiveness to improve discharge planning and patient outcomes. Prediction models are meant to change clinical practice and, therefore, must be afforded the same skepticism and rules of evidence applied to other interventions. The 3 questions posed here may help critical users to evaluate other prediction models and the clinical decision support tools they inform.
Published: July 10, 2019. doi:10.1001/jamanetworkopen.2019.6661
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2019 Harris AHS. JAMA Network Open.
Corresponding Author: Alex H. S. Harris, PhD, MS, 795 Willow Rd (MPD: 152), Menlo Park, CA 94025 (alexander.harris2@va.gov).
Conflict of Interest Disclosures: None reported.
Funding/Support: This work was supported in part by the US Department of Veterans Affairs Health Services Research and Development Service (grant RCS14-232).
Role of the Funder/Sponsor: US Department of Veterans Affairs Health Services Research and Development Service had no role in the preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Disclaimer: The views expressed do not reflect those of the US Department of Veterans Affairs or other institutions.
Additional Information: A data simulation is available from the author on request.
3.Collins
GS, Reitsma
JB, Altman
DG, Moons
KG. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD).
Ann Intern Med. 2015;162(10):735-736. doi:
10.7326/L15-5093-2PubMedGoogle ScholarCrossref 4.Bilimoria
KY, Liu
Y, Paruch
JL,
et al Development and evaluation of the universal ACS NSQIP surgical risk calculator: a decision aid and informed consent tool for patients and surgeons.
J Am Coll Surg. 2013;217(5):833-842.e3. doi:
10.1016/j.jamcollsurg.2013.07.385PubMedGoogle Scholar 6.Golden
DL, Ata
A, Kusupati
V,
et al. Predicting postoperative complications after acute care surgery: how accurate is the ACS NSQIP surgical risk calculator?
Am Surg. 2019;85(4):335-341.
PubMedGoogle Scholar