How does the performance of different methods to reduce bias for clinical prediction algorithms compare when measured by disparate impact and equal opportunity difference?
In a cohort study of 314 903 White and 217 899 Black female pregnant individuals with Medicaid coverage, application of a reweighing method was associated with a greater reduction in algorithmic bias for postpartum depression and mental health service utilization prediction between White and Black individuals than simply excluding race from the prediction models.
Researchers should examine clinical prediction models for bias stemming from the underlying data and consider methods to mitigate the bias.
The lack of standards in methods to reduce bias for clinical algorithms presents various challenges in providing reliable predictions and in addressing health disparities.
To evaluate approaches for reducing bias in machine learning models using a real-world clinical scenario.
Design, Setting, and Participants
Health data for this cohort study were obtained from the IBM MarketScan Medicaid Database. Eligibility criteria were as follows: (1) Female individuals aged 12 to 55 years with a live birth record identified by delivery-related codes from January 1, 2014, through December 31, 2018; (2) greater than 80% enrollment through pregnancy to 60 days post partum; and (3) evidence of coverage for depression screening and mental health services. Statistical analysis was performed in 2020.
Binarized race (Black individuals and White individuals).
Main Outcomes and Measures
Machine learning models (logistic regression [LR], random forest, and extreme gradient boosting) were trained for 2 binary outcomes: postpartum depression (PPD) and postpartum mental health service utilization. Risk-adjusted generalized linear models were used for each outcome to assess potential disparity in the cohort associated with binarized race (Black or White). Methods for reducing bias, including reweighing, Prejudice Remover, and removing race from the models, were examined by analyzing changes in fairness metrics compared with the base models. Baseline characteristics of female individuals at the top-predicted risk decile were compared for systematic differences. Fairness metrics of disparate impact (DI, 1 indicates fairness) and equal opportunity difference (EOD, 0 indicates fairness).
Among 573 634 female individuals initially examined for this study, 314 903 were White (54.9%), 217 899 were Black (38.0%), and the mean (SD) age was 26.1 (5.5) years. The risk-adjusted odds ratio comparing White participants with Black participants was 2.06 (95% CI, 2.02-2.10) for clinically recognized PPD and 1.37 (95% CI, 1.33-1.40) for postpartum mental health service utilization. Taking the LR model for PPD prediction as an example, reweighing reduced bias as measured by improved DI and EOD metrics from 0.31 and −0.19 to 0.79 and 0.02, respectively. Removing race from the models had inferior performance for reducing bias compared with the other methods (PPD: DI = 0.61; EOD = −0.05; mental health service utilization: DI = 0.63; EOD = −0.04).
Conclusions and Relevance
Clinical prediction models trained on potentially biased data may produce unfair outcomes on the basis of the chosen metrics. This study’s results suggest that the performance varied depending on the model, outcome label, and method for reducing bias. This approach toward evaluating algorithmic bias can be used as an example for the growing number of researchers who wish to examine and address bias in their data and models.
Prediction models are increasingly used in clinical decision-making to inform personalized or precision medicine. However, recent studies of machine learning and artificial intelligence applications reveal algorithmic biases that have substantial consequences for many people,1-4 particularly adverse outcomes on the health and well-being of racial and ethnic minorities.5-7 Bias in quantitative health care research refers to noncausal associations, skewed population selection, or statistical estimation errors; algorithmic bias usually refers to disparity observed in prediction model outcomes with respect to certain demographic features.
Researchers have formulated algorithmic fairness definitions and developed bias mitigation methods.8-11 However, gaps between such advances and solutions for algorithmic bias in clinical prediction persist because of technical difficulties, complexities of high dimensional health data, lack of knowledge of underlying causal structures, and challenges to algorithm appraisal.12,13 Few examples in health care to date use methods to reduce bias. Influential work by Obermeyer et al7 suggests a relabeling approach, replacing erroneous or biased target outcomes of machine learning models with alternatives. Although a relatively simple and effective solution, the authors caution that it requires deep domain knowledge and the ability to iteratively assess models, which may not be feasible if the prediction target is unmodifiable, or the alternative outcome measure is not widely recorded.
In this work, we used postpartum depression (PPD) to assess fairness and bias mitigation methods in clinical prediction. PPD affects 1 in 7 women who give birth, and early detection has substantial implications for maternal and child health.14 Incidence is higher among women with low socioeconomic status, such as Medicaid enrollees.15 Despite prior evidence indicating similar PPD rates across racial and ethnic groups, underdiagnosis and undertreatment has been observed among minorities on Medicaid.16 Varying rates of reported PPD reflect complex dynamics of perceived stigma, cultural differences, patient-physician relationships, and clinical needs in minority populations. The objective of this study of PPD is to examine data and machine learning prediction models for bias and to introduce and discuss assumptions and challenges of bias assessment and mitigation.
Data and Study Population
The cohort was constructed using the IBM MarketScan Medicaid Database (2014-2018), containing deidentified, individual-level claim records from approximately 7 million Medicaid enrollees in multiple states. Eligible participants were female individuals aged 12 to 55 years with a live birth record (diagnosis and procedure codes in eTable 1 in the Supplement). We required greater than 80% enrollment for the duration of pregnancy (273 days for full-term births or 245 days for preterm births)17 plus 60 days postdelivery, determined by hospital discharge date or earliest claim date with delivery code for nonhospital births. The rationale for 80% enrollment is that not all women who become eligible for Medicaid were enrolled from the beginning of pregnancy. We excluded patients with dual eligibility or without mental health coverage and limited the final cohort to the first eligible pregnancies from White and Black female participants (eFigure 1 in the Supplement). This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline. This study was conducted using a pre-existing, deidentified claims and, as such, did not require institutional review board approval or informed patient consent, in accordance with 45 CFR §46.
Two sets of models with different outcomes were trained. The first set predicted clinically recognized PPD (recorded International Classification of Diseases codes for major depressive disorders, other mood disorders, or PPD) or filled antidepressant prescription in the 60-day postpartum period. The second set predicted postpartum mental health service utilization (defined as ≥1 inpatient or outpatient encounter with mental health primary diagnosis). Although likely associated with the first outcome, we anticipated that utilization would capture aspects of bias not directly associated with diagnosis or treatment of PPD. After assessing the data for presence of bias, we applied methods for reducing bias to these 2 prediction models.
We use the term bias to refer to the disparity observed in both underlying data and prediction model outcomes trained with the data. We define disparity as discrepancies in measures of interest unexplained by clinical need, in line with the Institute of Medicine’s definition.18 Risk factor–adjusted associations between race and outcomes were assessed, and persistent discrepancy after adjustment was considered to reflect disparity, following the interpretation in prior work.19
We introduce fairness concepts for algorithmic bias assessment in Table 1. In this study, race is a protected attribute, with White being the privileged value as opposed to Black. The favorable label is predicted positive outcome for either PPD or mental health service use because we assume it will lead to limited resource access. We assess group fairness rather than individual fairness, meaning some statistical measures should be equivalent between groups in a fair world. Varying definitions of fairness means researchers should select the most appropriate way to quantify bias, given study context. We focus on 2 metrics used for binary classification to capture different aspects of bias. Disparate impact (DI) is a ratio of means of predicted favorable outcome between unprivileged and privileged groups.20 Although conceptually simple, DI is limited because a DI value of 1 does not necessarily represent fairness. For example, comparable DI values close to 1 with different accuracy can lead to unfair benefits distribution by producing more error in one group than the other. Equal opportunity difference (EOD) attempts to address this by comparing true-positive rates (sensitivity).21 We chose to compare true-positive rate rather than false-positive rate because positive outcome is more desirable in our study.
Algorithmic debiasing—mathematical techniques for bias mitigation—can be classified into 3 types (preprocessing, in-processing, and postprocessing) depending on the place of action. In these methods, bias generally refers to the statistical association between protected attribute and predicted outcome. Our approach was 3-fold (Table 1). First, we used reweighing,22 a preprocessing method that applies different weights to each group-label combination according to conditional probability of label by race. This can be considered a cost-sensitive learning method in which the error for certain group-label combinations becomes more expensive by assigning greater weight. Second, we used an in-processing method called Prejudice Remover for logistic regression.23 Prejudice also refers to the statistical association. By adding a regularization term to the objective function, the race feature becomes less influential in the final prediction. Finally, we trained the models without the race variable for comparison, an approach known as fairness through unawareness. This technique was used in the biased commercial algorithm for population health management.7
We also attempted a relabeling approach suggested in previous work7 by using an alternative outcome that is less sensitive to racial bias. Because of the highly correlated nature of health care data and low prevalence of candidate labels such as emergency mental health visits, we were unable to find a neutral target regarding race, so we applied the 3 methods described already to the secondary outcome of mental health utilization.
Descriptive statistics were used to compare population characteristics. A generalized linear model with logit link was used to obtain adjusted odds ratio (OR) for race with respect to outcomes. For prediction tasks, 2014 to 2017 data were split into train, validation, and test sets in a ratio of 5:3:2. Three commonly used classifiers were fit: logistic regression (LR), random forest (RF), and extreme gradient boosting (XGB). Features included demographic characteristics, pregnancy outcomes, baseline comorbidities, medication use, and health care utilization. Base models, models without race, and models trained on reweighed data were compared for balanced accuracy and fairness metrics. We report on balanced accuracy rather than area under the curve because, in practice, classification will be performed at a fixed cutoff. We present results from LR below and results from XGB and RF in the Supplement. Finally, the trained LR classifier for PPD was evaluated using 2018 data as a hypothetical deployment scenario limited to high-risk patients in the top decile of predicted risk. We compared the characteristics of these high-risk participants to identify systemic differences by race. eFigure 2 in the Supplement illustrates the overall approach.
The significance threshold was set at P < .05, but there was no explicit hypothesis testing. Statistical analysis was performed using Python 3.7 (Python Software Foundation) in 2020.
Among 573 634 female individuals initially examined for this study, the mean (SD) age was 26.1 (5.5) years and eligibility criteria yielded 314 903 White participants (54.9%) and 217 899 Black participants (38.0%). Both groups had similar mean (SD) age (White participants: 26.0 [5.4] years; Black participants: 26.2 [5.5] years) and enrollment in managed Medicaid, but White women had higher prevalence of psychiatric comorbidity and Black women had worse pregnancy outcomes over the study period (Table 2), consistent with published findings. Notable differences were observed in baseline prevalence of preterm births (24 055 White patients [7.6%] vs 22 051 Black patients [10.1%]; standardized difference, 0.1), preeclampsia (15 529 White patients [4.9%] vs 15 841 Black patients [7.3%]; standardized difference, 0.1), depression (55 543 White patients [17.6%] vs 23 175 Black patients [10.6%]; standardized difference, 0.2), bipolar disorders (12 660 White patients [4.0%] vs 5199 Black patients [2.4%]; standardized difference, 0.1), opioid use disorders (13 499 White patients [4.3%] vs 1451 Black patients [0.7%]; standardized difference, 0.2), anxiety-related disorders (40 661 White patients [12.9%] vs 12 005 Black patients [5.5%], standardized difference, 0.3), and in baseline service use levels that were generally lower among Black women (see eTable 2 in the Supplement for baseline characteristics).
White women were more likely to be diagnosed with PPD compared with Black women (52 370 White patients [16.6%] vs 15 410 Black patients [7.1%]; standardized difference, 0.3) and have at least 1 postpartum mental health visit (34 044 White patients [10.8%] vs 12 612 Black patients [5.8%]; standardized difference, 0.2). After adjusting for risk factors, including baseline comorbidity and utilization, White women were twice as likely to be evaluated for and diagnosed with PPD (odds ratio [OR], 2.06; 95% CI, 2.02-2.10) and more likely to use mental health services (OR, 1.37; 95% CI, 1.33-1.40) in the postpartum period (Table 3). To the extent that our model can adjust for confounding, the lower odds ratio among black women is at least partly associated with underlying disparity, based on prior evidence.
Prediction and debiasing performance differed across models and methods used and outcomes (eTable 3 in the Supplement). The trained LR classifier had moderate test set balanced accuracy for predicting PPD development (0.73) or mental health service utilization (0.78). Before reweighing, PPD predictions from the base model had DI of 0.31 and EOD of −0.19 (Figure 1). After removing race, the DI value was 0.61 and the EOD value was −0.05; after reweighing (with race in the model), DI improved to 0.79 and EOD improved to 0.02. Similarly, for predicting mental health service use, the base model had DI of 0.45 and EOD of −0.11. Removing race from the model improved these metrics to 0.63 for DI and −0.04 for EOD, reweighing to 0.85 and −0.02, respectively. We underscore the fact that debiasing through reweighing did not compromise model accuracy (balanced accuracy of 0.72 for PPD and 0.78 for utilization prediction). In contrast, Prejudice Remover had lower balanced accuracy (0.68 for PPD and 0.72 for use prediction) and worse performance for EOD compared with DI at the comparable level with reweighing results. Results from XGB or RF were qualitatively similar (eFigure 3 and eFigure 4 in the Supplement, respectively). Race, the fourth important feature in XGB for PPD prediction, was no longer 1 of the 10 important features after reweighing (eFigure 5 in the Supplement).
In the resource-limited deployment scenario, the pretrained LR model with reweighing retained the level of balanced accuracy (0.73) with improved metrics (DI = 0.68; EOD = −0.02) compared with the base model (accuracy = 0.74; DI = 0.32; EOD = −0.20), reducing the gap in the proportion of predicted positive outcomes from 12.6% to 5.3% (Figure 2). In the high-risk subgroup predicted by the base model, Black women appeared to have worse pregnancy outcomes and higher comorbidity (eTable 4 in the Supplement). Debiased LR with reweighing added 1431 Black women to the high-risk subgroup and reduced differences in most risk factors examined.
Our study represents a systematic effort to assess and mitigate racial bias using observational data. Our results found that in a Medicaid claims-based pregnancy cohort, White race was associated with greater PPD incidence and mental health services use after adjusting for confounding factors. Comparable PPD rates across races from population surveys suggest that the increased rates reflect likely underlying disparities in timely evaluation, screening, and symptom detection among Black women. Machine learning models trained on this data produced unequal predictions for both outcomes according to the chosen fairness metrics and favored White women who are already at an advantage for diagnosis and treatment. For those predicted to be at similarly higher risk, Black women appeared to have worse health status than White women. The approach to simply disregard race in model training was inferior, in most aspects, to applying algorithmic debiasing methods, likely because of the presence of associated variables.
Obermeyer et al7 explained that possible mechanisms of bias for cost outcome are differential access to care and “direct discrimination,” which was mostly resolved by relabeling. Medicaid enrollment means that most women in our study had equal access to health care in terms of insurance coverage. Differences in structural barriers to care, perceptions, expectations, and health system interactions in perinatal and mental health care are the more likely sources of disparity.
Our study showed that selection of debiasing method and fairness metric is associated with differences in the results. Reweighing improved metrics without compromising accuracy, whereas Prejudice Remover performed less reliably. Interestingly, the latter performed worse with EOD, which measures the difference in true-positive rate. It is possible that masking the race variable during the model fitting process succeeded in reducing only one type of metric, not both. On the basis of the results, we prefer reweighing because it preserves the original training data, unlike some other methods, and performs well across experimental settings. Whether this holds true in other cases should be examined in future research. There is no universal measure of fairness, and existing metrics often focus on specific scenarios. It is also well-known that some metrics conflict with one another and are impossible to satisfy simultaneously.24,25 Therefore, the choice of metric should be context specific and cost sensitive to different error types and trade-offs between sensitivity and specificity. Metrics focused on true-positive rate would be appropriate for screening algorithms because they put a higher penalty on missing potential cases. For algorithms classifying patients for invasive procedures, one may want to weigh false-positive results more heavily to prevent harm.
We assume no causal effect of race on outcomes other than through racial disparities.16 This assumption may not hold if there are mechanisms that predispose Black women to lower probability of PPD or if residual confounding can explain different rates. Prior evidence is inconclusive,16,26-28 but overall indicates equal or higher prevalence of PPD symptoms among Black women. Complete knowledge of causal structures and data generation is required to rule out confounding, but it is unlikely that a factor uncorrelated with health disparity can independently account for the observed association after risk adjustment. We additionally assume that equal proportions of the predicted positive outcome indicate fairness (DI = 1). This assumption would be false if other factors can explain the difference in probability of the predicted label.29 In practice, it is challenging to predefine such factors without generating a self-fulfilling prophecy. For example, different PPD rates from a predictor such as location can wrongfully justify the DI value greater than one if it reflects underlying disparity.
Importantly, we have limited insight on the reliability of the race variable and what it captures, a prerequisite for using it in predictive models to avoid worsening bias. We focused on race because pregnant women enrolled in Medicaid share similar demographic and socioeconomic backgrounds, but the logic easily extends to other attributes used in clinical predictions. For example, without knowing what gender represents (biological vs self-identified) and causal pathways between gender and target outcome, models using gender can behave unexpectedly. It is important to understand the interplay between race and factors outside of medicine, and how this contributes to medical findings used to build well-meaning but potentially erroneous algorithms. Considering these limitations, our approach should be viewed not as a yardstick for fairness but as a hypothesis generating tool to detect algorithmic bias.
It is imperative to acknowledge that data sources and algorithms represent just one dimension of bias, because overreliance on technical metrics can bring unexpected consequences.13 Prediction algorithms almost always intend to influence the clinical decision-making process, and the biases of those developing and using them would have greater impact on whether they ultimately perpetuate inequality.30,31 For example, a clinician may recognize the potential disparity in diagnosis and intentionally compensate for the algorithmic bias. If the model is then debiased without this information, it could result in reversal of bias. The impact of human behavior and clinical judgement must be taken into consideration.
Strengths and Limitations
Our study is not without limitations. Although MarketScan data cover a wide range of US health systems, findings from the Medicaid claims data may not be generalizable to non-Medicaid populations. However, the ability to study postpartum mental health in a vulnerable population is the value of this study, and this population has less variability in socioeconomic and health care access-related factors compared with a commercial insurance cohort, reducing confounding concerns. Confirmatory studies in other populations, including non-Medicaid or commercially insured patients, would be desirable. It is, however, important to note that publicly available data sets often lack race and ethnicity information. Because of limited postpartum Medicaid coverage, follow-up was only 60 days. Evidence suggests that the majority of PPD cases are identified within 6 weeks of delivery, but some are captured later or via maternal depressive symptom screening during pediatric visits, and thus are missing from our data. Claims data lack clinical details and are known for incomplete and imprecise coding. Differential diagnosis misclassification by race can potentially alter study conclusions but cannot be verified with available resources. This study only examined data and algorithmic aspects of bias. A follow-up study incorporating human decisions and patient outcomes is needed to show that debiasing indeed reduces disparity. In addition to the possible violation of assumptions discussed already, the fairness metrics we examine are not causal in nature. Causal interpretation of fairness is an active field of study that we plan to explore further.
Our work makes several contributions to the field of clinical algorithmic fairness as one of the first few systematic analyses of bias assessment and mitigation for clinical prediction algorithms.7,32 Our results also suggest that algorithms trained and debiased on historical data can have comparable performance in a prospective deployment setting, albeit a hypothetical one. Rather than using all features in the data, we use carefully curated features based on recommendations by researchers in the field.5 We rely on real-world data instead of simulated or synthetic data, to provide a clinically meaningful application and practical case study.
Observational health data are increasingly used for machine learning prediction models. In addition to the usual concerns of noncausal associations and selection bias, algorithmic bias should be better recognized and addressed in both research and applications. Proper use of algorithmic debiasing can help clinical researchers use machine learning models more effectively and more fairly.
Accepted for Publication: February 9, 2021.
Published: April 15, 2021. doi:10.1001/jamanetworkopen.2021.3909
Open Access: This is an open access article distributed under the terms of the CC-BY-NC-ND License. © 2021 Park Y et al. JAMA Network Open.
Corresponding Author: Yoonyoung Park, ScD, Center for Computational Health, IBM Research, 75 Binney St, Cambridge, MA 02142 (email@example.com).
Author Contributions: Dr Park had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Park, Hu, Singh, Das.
Acquisition, analysis, or interpretation of data: Park, Hu, Singh, Sylla, Dankwa-Mullan, Koski.
Drafting of the manuscript: Park, Singh, Sylla, Das.
Critical revision of the manuscript for important intellectual content: Park, Hu, Singh, Dankwa-Mullan, Koski, Das.
Statistical analysis: Park, Singh.
Administrative, technical, or material support: Hu, Das.
Supervision: Hu, Das.
Conflict of Interest Disclosures: None reported.
J. Fairness definitions explained. FairWare '18: Proceedings of the International Workshop on Software Fairness. Published May 2018. Accessed February 26, 2021. doi:10.1145/3194770.3194776
D. A Comparative Study of Fairness-Enhancing Interventions In Machine Learning.
ACM FAT. 2019:329-338. doi:10.1145/3287560.3287589
R. Postpartum depression prevalence and impact on infant health, weight, and sleep in low-income and ethnic minority women and infants. Matern Child Health J
. 2012;16(4):887-893. doi:10.1007/s10995-011-0812-yPubMedGoogle ScholarCrossref
S. Algorithms to estimate the beginning of pregnancy in administrative databases. Pharmacoepidemiol Drug Saf
. 2013;22(1):16-24. doi:10.1002/pds.3284PubMedGoogle ScholarCrossref
Institute of Medicine Committee on Understanding and Eliminating Racial and Ethnic Disparities in Health Care. Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care. National Academies Press; 2003.
S. Certifying and removing disparate impact. KDD '15: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August 10-13, 2015; Sydney, Australia. doi:10.1145/2783258.2783311
N. Equality of opportunity in supervised learning. arXiv. Published October 7, 2016. Accessed February 26, 2021. https://arxiv.org/abs/1610.02413
M. Building Classifiers With Independency Constraints. ICDM Workshops - IEEE International Conference on Data Mining. 2009:13-18. August 6-9, 2009; Miami, Florida.
J. Fairness-aware classifier with prejudice remover regularizer. In: Flach
PA, De Bie
N, eds. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science, vol 7524. Springer;2012.
R. Fairness through awareness. Paper presented at 3rd Innovations in Theoretical Computer Science Conference; January 8-10, 2012; Cambridge, MA. Accessed February 26, 2021. https://dl.acm.org/doi/10.1145/2090236.2090255
C. On the applicability of ML fairness notions. arXiv. Published June 30, 2020. Accessed February 26, 2021. https://arxiv.org/abs/2006.16745
E. Rates and predictors of postpartum depression by race and ethnicity: results from the 2004 to 2007 New York City PRAMS survey (Pregnancy Risk Assessment Monitoring System). Matern Child Health J
. 2013;17(9):1599-1610. doi:10.1007/s10995-012-1171-zPubMedGoogle ScholarCrossref
A. Algorithmic decision making and the cost of fairness. Paper presented at: 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13-17, 2017; Halfax, Nova Scotia. Accessed February 26, 2021. https://dl.acm.org/doi/10.1145/3097983.3098095
KN. Understanding racial bias in health using the Medical Expenditure Panel Survey data. arXiv. Published November 4, 2019. Accessed March 1, 2021. https://arxiv.org/abs/1911.01509