[Skip to Navigation]
Sign In
Invited Commentary
Health Informatics
April 15, 2021

Algorithmovigilance—Advancing Methods to Analyze and Monitor Artificial Intelligence–Driven Health Care for Effectiveness and Equity

Author Affiliations
  • 1Regenstrief Institute Inc, Indiana University School of Medicine, Indianapolis
JAMA Netw Open. 2021;4(4):e214622. doi:10.1001/jamanetworkopen.2021.4622

In recent years, there has been rapid growth and expansion in the use of machine learning and other artificial intelligence approaches applied to increasingly rich and accessible health data sets to develop algorithms that guide and support health care.1 As they make their way into practice, such algorithms have the potential to fundamentally transform how health care decisions are made and, therefore, how patients are diagnosed and treated.2 While such approaches hold great promise for enabling more precise, accurate, timely, and even fair decision-making when properly developed and applied, there is also growing evidence that systematic biases can lead to unintended and even severe consequences.3,4 Mirroring disparities and inequities inherent in our society and health system,5 such biases can be inherent in not only the underlying data used to develop algorithms but also how algorithmic interventions are deployed.

Elsewhere in JAMA Network Open, Park and colleagues6 present findings from a study evaluating different approaches to the debiasing of health care algorithms developed to predict postpartum depression (PPD) among a cohort of pregnant women with Medicaid coverage. The researchers, from IBM Research, leveraged the IBM MarketScan Medicaid Database, a deidentified, individual-level claim records data set with approximately 7 million Medicaid enrollees across multiple states, to derive their algorithms. They started by developing 2 sets of machine learning models trained to predict 2 outcomes: (1) diagnosis or treatment for PPD and (2) postpartum mental health service utilization. Their initial, risk-adjusted generalized linear models for each outcome demonstrated a notable difference in the cohort with binarized race, with White patients having twice the predicted likelihood of being diagnosed with PPD compared with Black patients and a significantly higher likelihood of utilizing mental health services. However, as the authors point out, current evidence indicates that while PPD incidence is higher in women with low socioeconomic status, such as the Medicaid enrollees studied, PPD rates are similar across racial and ethnic groups. Therefore, the algorithmically predicted differences derived from the real-world data set are likely the result of underlying disparities, including lower rates of diagnosis and treatment among Black patients.

The authors next examined the relative effectiveness of 3 different algorithmic debiasing methods. These included a preprocessing method called reweighing, an in-processing method called prejudice remover for logistic regression, and an approach to retraining the models without the race variable included called fairness through unawareness.6 They measured the differences using 2 metrics of fairness: disparate impact (DI), a ratio of estimated favorable outcomes between privileged and underprivileged groups, and equal opportunity difference (EOD), a comparison of true-positive rates between groups. While algorithmic prediction and debiasing performance differed across the models and methods tested, a key finding is that all of the debiasing approaches demonstrated some improvement toward fairness relative to the nondebiased models. Nevertheless, it was also notable that at least in this particular study, the approach of simply disregarding race in the model was not as effective as the other debiasing approaches, likely due to the presence of correlated variables.

While more research is certainly needed to validate and determine which debiasing methods are most appropriately applied to particular settings and conditions, this study suggests that debiasing can address underlying disparities represented in data used to develop and operationalize predictive algorithms. Moreover, while the authors set out to test debiasing approaches, they did so precisely because they anticipated that there would likely be inherent biases in algorithms based on even large, well-characterized data sets like the one used. Therefore, this demonstrates another important but perhaps less obvious point. Whether inherent biases are anticipated by the developers or not,3 evaluation and monitoring of health care algorithms for effectiveness and equity is necessary and, indeed, an ethical imperative.

As algorithmic health care continues to advance, there are myriad reasons why we must concern ourselves with the systematic evaluation and ongoing monitoring of algorithms that drive care. In that sense, this study also represents another important step toward advancing a sorely needed set of approaches and tools for the systematic evaluation and monitoring of health care algorithms.

Just as we would not think of deploying a new pharmaceutical or device into practice without first ensuring its efficacy and safety in the populations to which it will be applied, so too must we recognize the reality that algorithms have the potential for both great benefit and harm and, therefore, require study. Indeed, the processes in place for evaluating and validating therapeutics can offer a useful analogy for the systematic evaluation of artificial intelligence–enabled health care interventions.7 Building on that analogy, an approach akin to pharmacovigilance is called for. Algorithmovigilance can be defined as the scientific methods and activities relating to the evaluation, monitoring, understanding, and prevention of adverse effects of algorithms in health care. While such activities are certainly relevant during algorithmic development and prior to initial deployment, they must not be limited to that phase of algorithmic use in health care.

Perhaps to an even greater degree than among pharmaceuticals, ongoing evaluation of algorithms used in practice is critical. The nature of how algorithms are developed and deployed in practice can change their anticipated impacts and potentially lead to unintended adverse effects. This has to do with the variability with which algorithms make their way into practice. Beyond concerns shared with pharmaceuticals and devices (eg, their differential effects given individual variation among people), algorithms are also subject to different applications based on factors that often do not apply to pharmaceuticals or devices. These include how they are deployed, who interacts with them, and when in the workflow this takes place.

Beyond the technical considerations related to initial algorithmic performance and validation, it is often the case that algorithmic performance changes as it is deployed against different data, in different settings, and at different times. Moreover, how algorithms are used involves human-computer interactions that add another level of variation and complexity that can change the algorithm’s performance, including how algorithmic outputs are interpreted by different users, the lack of trust in a black-box algorithm, and concerns about becoming overly reliant on the output such that automation is used without critical thought. These and other systemic factors, many difficult to anticipate, can alter the performance characteristics of algorithms and lead to adverse effects and harm to patients and populations. Given the rapidly accelerating pace of change, demand, and promise for care that is guided by adaptive and computable algorithms,8 the inherent and systemic inequities that exist in our health care system,5 and the potential for unintended harm, it is imperative that we continue to develop, test, and disseminate tools and capabilities that enable systematic surveillance and vigilance in the development and application of algorithms in health care.

Back to top
Article Information

Published: April 15, 2021. doi:10.1001/jamanetworkopen.2021.4622

Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Embi PJ. JAMA Network Open.

Corresponding Author: Peter J. Embi, MD, MS, Regenstrief Institute Inc, Indiana University School of Medicine, 1101 W 10th St, Indianapolis, IN 46202 (pembi@regenstrief.org).

Conflict of Interest Disclosures: None reported.

Yu  KH, Beam  AL, Kohane  IS.  Artificial intelligence in healthcare.   Nat Biomed Eng. 2018;2(10):719-731. doi:10.1038/s41551-018-0305-zPubMedGoogle ScholarCrossref
Lindsell  CJ, Stead  WW, Johnson  KB.  Action-informed artificial intelligence—matching the algorithm to the problem.   JAMA. 2020;323(21):2141-2142. doi:10.1001/jama.2020.5035PubMedGoogle ScholarCrossref
Obermeyer  Z, Powers  B, Vogeli  C, Mullainathan  S.  Dissecting racial bias in an algorithm used to manage the health of populations.   Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342PubMedGoogle ScholarCrossref
Parikh  RB, Teeple  S, Navathe  AS.  Addressing bias in artificial intelligence in health care.   JAMA. 2019;322(24):2377-2378. doi:10.1001/jama.2019.18058PubMedGoogle ScholarCrossref
Bailey  ZD, Feldman  JM, Bassett  MT.  How structural racism works—racist policies as a root cause of US racial health inequities.   N Engl J Med. 2021;384(8):768-773. doi:10.1056/NEJMms2025396PubMedGoogle ScholarCrossref
Park  Y, Hu  J, Singh  M,  et al.  Comparison of methods to reduce bias from clinical prediction models of postpartum depression.   JAMA Netw Open. 2021;4(4):e213909. doi:10.1001/jamanetworkopen.2021.3909Google Scholar
Park  Y, Jackson  GP, Foreman  MA, Gruen  D, Hu  J, Das  AK.  Evaluating artificial intelligence in medicine: phases of clinical research.   JAMIA Open. 2020;3(3):326-331. doi:10.1093/jamiaopen/ooaa033PubMedGoogle ScholarCrossref
Petersen  C, Smith  J, Freimuth  RR,  et al.  Recommendations for the safe, effective use of adaptive CDS in the US healthcare system: an AMIA position paper.   J Am Med Inform Assoc. 2021;28(4):677-684. doi:10.1093/jamia/ocaa319PubMedGoogle ScholarCrossref