Comparison of Methods to Reduce Bias From Clinical Prediction Models of Postpartum Depression | Clinical Decision Support | JAMA Network Open | JAMA Network
[Skip to Navigation]
Sign In
Figure 1.  Comparison of Bias Metrics Before and After Debiasing
Comparison of Bias Metrics Before and After Debiasing

Comparison of bias metrics in the test data set using base model, model without race variable, debiased model through reweighing, and debiased model through Prejudice Remover (logistic regression). The reference value for unbiasedness is 1.0 for disparate impact (A) and 0 for equal opportunity difference (B).

Figure 2.  Number of Patients With Predicted High Risk of Postpartum Depression in Deployment Data Set Before and After Debiasing
Number of Patients With Predicted High Risk of Postpartum Depression in Deployment Data Set Before and After Debiasing

The logistic regression classifiers trained, validated, and tested on 2014 to 2017 data (532 802 patients) were deployed on a new set of data from 2018 (80 136 patients). Assuming that patient care resource is available for 10% of the total patient population, the model classifies 7166 (15.2%) White and 848 (2.6%) Black women as high risk before debiasing. After debiasing (reweighing), 5735 (12.2%) White and 2279 (6.9%) Black women are classified as high risk.

Table 1.  Concepts and Terminology in Algorithmic Fairness Research
Concepts and Terminology in Algorithmic Fairness Research
Table 2.  Selected Characteristics of Pregnant Women Enrolled in Medicaid (2014-2018)
Selected Characteristics of Pregnant Women Enrolled in Medicaid (2014-2018)
Table 3.  Age-Adjusted and Fully Adjusted ORs for White vs Black Participantsa
Age-Adjusted and Fully Adjusted ORs for White vs Black Participantsa
1.
Angwin  J, Larson  J, Mattu  S, Kirchner  L. Machine bias. Pro-Publica. May 23, 2016. Accessed July 31, 2020. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
2.
Koenecke  A, Nam  A, Lake  E,  et al.  Racial disparities in automated speech recognition.   Proc Natl Acad Sci U S A. 2020;117(14):7684-7689. doi:10.1073/pnas.1915768117Google ScholarCrossref
3.
Datta  A, Tschantz  MC.  Automated experiments on ad privacy settings  .  Proc Priv Enh Technol. 2015;1:92-112. doi:10.1515/popets-2015-0007Google Scholar
4.
Buolamwini  J, Gebru  T. Gender shades: intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research. Published 2018. Accessed February 26, 2021. http://proceedings.mlr.press/v81/buolamwini18a.html
5.
Gianfrancesco  MA, Tamang  S, Yazdany  J, Schmajuk  G.  Potential biases in machine learning algorithms using electronic health record data.   JAMA Intern Med. 2018;178(11):1544-1547. doi:10.1001/jamainternmed.2018.3763PubMedGoogle ScholarCrossref
6.
Vyas  DA, Eisenstein  LG, Jones  DS.  Hidden in plain sight—reconsidering the use of race correction in clinical algorithms.   N Engl J Med. 2020;383(9):874-882. doi:10.1056/NEJMms2004740PubMedGoogle ScholarCrossref
7.
Obermeyer  Z, Powers  B, Vogeli  C, Mullainathan  S.  Dissecting racial bias in an algorithm used to manage the health of populations.   Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342PubMedGoogle ScholarCrossref
8.
Verma  S, Rubin  J. Fairness definitions explained. FairWare '18: Proceedings of the International Workshop on Software Fairness. Published May 2018. Accessed February 26, 2021. doi:10.1145/3194770.3194776
9.
Friedler  SA, Choudhary  S, Scheidegger  C, Hamilton  EP, Venkatasubramanian  S, Roth  D.  A Comparative Study of Fairness-Enhancing Interventions In Machine Learning. ACM FAT. 2019:329-338. doi:10.1145/3287560.3287589
10.
Bellamy  RKE, Mojsilovic  A, Nagar  S,  et al  AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias.   IBM J Res Dev. 2019;63(4-5):4:1-4:15. doi:10.1147/JRD.2019.2942287Google ScholarCrossref
11.
Menon  AK, Williamson  RC. The cost of fairness in classification. ArXiv. Published May 25, 2017. Accessed February 26, 2021. https://arxiv.org/abs/1705.09055
12.
Rajkomar  A, Hardt  M, Howell  MD, Corrado  G, Chin  MH.  Ensuring fairness in machine learning to advance health equity.   Ann Intern Med. 2018;169(12):866-872. doi:10.7326/M18-1990PubMedGoogle ScholarCrossref
13.
McCradden  MD, Joshi  S, Mazwi  M, Anderson  JA.  Ethical limitations of algorithmic fairness solutions in health care machine learning.   Lancet Digit Health. 2020;2(5):e221-e223. doi:10.1016/S2589-7500(20)30065-0PubMedGoogle ScholarCrossref
14.
Wisner  KL, Chambers  C, Sit  DK.  Postpartum depression: a major public health problem.   JAMA. 2006;296(21):2616-2618. doi:10.1001/jama.296.21.2616PubMedGoogle ScholarCrossref
15.
Gress-Smith  JL, Luecken  LJ, Lemery-Chalfant  K, Howe  R.  Postpartum depression prevalence and impact on infant health, weight, and sleep in low-income and ethnic minority women and infants.   Matern Child Health J. 2012;16(4):887-893. doi:10.1007/s10995-011-0812-yPubMedGoogle ScholarCrossref
16.
Kozhimannil  KB, Trinacty  CM, Busch  AB, Huskamp  HA, Adams  AS.  Racial and ethnic disparities in postpartum depression care among low-income women.   Psychiatr Serv. 2011;62(6):619-625. doi:10.1176/ps.62.6.pss6206_0619PubMedGoogle ScholarCrossref
17.
Margulis  AV, Setoguchi  S, Mittleman  MA, Glynn  RJ, Dormuth  CR, Hernández-Díaz  S.  Algorithms to estimate the beginning of pregnancy in administrative databases.   Pharmacoepidemiol Drug Saf. 2013;22(1):16-24. doi:10.1002/pds.3284PubMedGoogle ScholarCrossref
18.
Institute of Medicine Committee on Understanding and Eliminating Racial and Ethnic Disparities in Health Care.  Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care. National Academies Press; 2003.
19.
VanderWeele  TJ, Robinson  WR.  On the causal interpretation of race in regressions adjusting for confounding and mediating variables.   Epidemiology. 2014;25(4):473-484. doi:10.1097/EDE.0000000000000105PubMedGoogle ScholarCrossref
20.
Feldman  M, Friedler  SA, Moeller  J, Scheidegger  C, Venkatasubramanian  S. Certifying and removing disparate impact. KDD '15: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August 10-13, 2015; Sydney, Australia. doi:10.1145/2783258.2783311
21.
Hardt  M, Price  E, Srebro  N. Equality of opportunity in supervised learning. arXiv. Published October 7, 2016. Accessed February 26, 2021. https://arxiv.org/abs/1610.02413
22.
Calders  T, Kamiran  F, Pechenizkiy  M.  Building Classifiers With Independency Constraints. ICDM Workshops - IEEE International Conference on Data Mining. 2009:13-18. August 6-9, 2009; Miami, Florida.
23.
Kamishima  T, Akaho  S, Asoh  H, Sakuma  J. Fairness-aware classifier with prejudice remover regularizer. In: Flach  PA, De Bie  T, Cristianini  N, eds.  Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012.  Lecture Notes in Computer Science, vol 7524. Springer;2012.
24.
Dwork  C, Hardt  M, Pitassi  T, Reingold  O, Zemel  R. Fairness through awareness. Paper presented at 3rd Innovations in Theoretical Computer Science Conference; January 8-10, 2012; Cambridge, MA. Accessed February 26, 2021. https://dl.acm.org/doi/10.1145/2090236.2090255
25.
Makhlouf  K, Zhioua  S, Palamidessi  C. On the applicability of ML fairness notions. arXiv. Published June 30, 2020. Accessed February 26, 2021. https://arxiv.org/abs/2006.16745
26.
Howell  EA, Mora  PA, Horowitz  CR, Leventhal  H.  Racial and ethnic differences in factors associated with early postpartum depressive symptoms.   Obstet Gynecol. 2005;105(6):1442-1450. doi:10.1097/01.AOG.0000164050.34126.37PubMedGoogle ScholarCrossref
27.
Gavin  AR, Melville  JL, Rue  T, Guo  Y, Dina  KT, Katon  WJ.  Racial differences in the prevalence of antenatal depression.   Gen Hosp Psychiatry. 2011;33(2):87-93. doi:10.1016/j.genhosppsych.2010.11.012PubMedGoogle ScholarCrossref
28.
Liu  CH, Tronick  E.  Rates and predictors of postpartum depression by race and ethnicity: results from the 2004 to 2007 New York City PRAMS survey (Pregnancy Risk Assessment Monitoring System).   Matern Child Health J. 2013;17(9):1599-1610. doi:10.1007/s10995-012-1171-zPubMedGoogle ScholarCrossref
29.
Corbett-Davies  S, Pierson  E, Feller  A, Goel  S, Huq  A. Algorithmic decision making and the cost of fairness. Paper presented at: 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13-17, 2017; Halfax, Nova Scotia. Accessed February 26, 2021. https://dl.acm.org/doi/10.1145/3097983.3098095
30.
Kusner  MJ, Loftus  JR.  The long road to fairer algorithms.   Nature. 2020;578(7793):34-36. doi:10.1038/d41586-020-00274-3PubMedGoogle ScholarCrossref
31.
Veinot  TC, Mitchell  H, Ancker  JS.  Good intentions are not enough: how informatics interventions can worsen inequality.   J Am Med Inform Assoc. 2018;25(8):1080-1088. doi:10.1093/jamia/ocy052PubMedGoogle ScholarCrossref
32.
Singh  M, Ramamurthy  KN. Understanding racial bias in health using the Medical Expenditure Panel Survey data. arXiv. Published November 4, 2019. Accessed March 1, 2021. https://arxiv.org/abs/1911.01509
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words
    1 Comment for this article
    EXPAND ALL
    In Artificial Intelligence You Are What You Eat
    Eran Bellin, M.D. | Albert Einstein College of Medicine Bronx NY
    Park et al. properly identify a systematic bias in their study of postpartum depression in the Black population. Like many other studies they use a dataset with diagnoses obtained in the course of accessed care to establish "need for that care" or "appropriateness of that care". Of course, if access to care is either economically not available to certain groups or is not offered because of preexisting expectations of need, the financial billing footprint will improperly identify a lower need for those improperly denied that care. AI fed on a substrate of such billings will always reflect the practice as "it is" and not as is "ought to be" based upon appropriate clinical criteria obtained in study of clinically relevant substrate. Creating additional "black box" analytics to attenuate the improper inference from Artificial Intelligence fed on a non-nutritious substrate is not the way to go. Further, creating "virtue signaling adjuster names" such as "prejudice remover" raises the temperature of what should be careful consideration and debate about the methodology. It also distracts from what inference is being sought and why. Anyway, whimsically, my preference for a name for a statistical adjuster would be "evil eliminator" with the implication that anyone who disagrees with me must be an acolyte of Satan.
    CONFLICT OF INTEREST: None Reported
    READ MORE
    Original Investigation
    Health Informatics
    April 15, 2021

    Comparison of Methods to Reduce Bias From Clinical Prediction Models of Postpartum Depression

    Author Affiliations
    • 1Center for Computational Health, IBM Research, Cambridge, Massachusetts
    • 2Center for Computational Health, IBM TJ Watson Research Center, Yorktown Heights, NY
    • 3IBM Research, IBM TJ Watson Research Center, Yorktown Heights, NY
    • 4IBM Watson Health, Cambridge, Massachusetts
    JAMA Netw Open. 2021;4(4):e213909. doi:10.1001/jamanetworkopen.2021.3909
    Key Points

    Question  How does the performance of different methods to reduce bias for clinical prediction algorithms compare when measured by disparate impact and equal opportunity difference?

    Findings  In a cohort study of 314 903 White and 217 899 Black female pregnant individuals with Medicaid coverage, application of a reweighing method was associated with a greater reduction in algorithmic bias for postpartum depression and mental health service utilization prediction between White and Black individuals than simply excluding race from the prediction models.

    Meaning  Researchers should examine clinical prediction models for bias stemming from the underlying data and consider methods to mitigate the bias.

    Abstract

    Importance  The lack of standards in methods to reduce bias for clinical algorithms presents various challenges in providing reliable predictions and in addressing health disparities.

    Objective  To evaluate approaches for reducing bias in machine learning models using a real-world clinical scenario.

    Design, Setting, and Participants  Health data for this cohort study were obtained from the IBM MarketScan Medicaid Database. Eligibility criteria were as follows: (1) Female individuals aged 12 to 55 years with a live birth record identified by delivery-related codes from January 1, 2014, through December 31, 2018; (2) greater than 80% enrollment through pregnancy to 60 days post partum; and (3) evidence of coverage for depression screening and mental health services. Statistical analysis was performed in 2020.

    Exposures  Binarized race (Black individuals and White individuals).

    Main Outcomes and Measures  Machine learning models (logistic regression [LR], random forest, and extreme gradient boosting) were trained for 2 binary outcomes: postpartum depression (PPD) and postpartum mental health service utilization. Risk-adjusted generalized linear models were used for each outcome to assess potential disparity in the cohort associated with binarized race (Black or White). Methods for reducing bias, including reweighing, Prejudice Remover, and removing race from the models, were examined by analyzing changes in fairness metrics compared with the base models. Baseline characteristics of female individuals at the top-predicted risk decile were compared for systematic differences. Fairness metrics of disparate impact (DI, 1 indicates fairness) and equal opportunity difference (EOD, 0 indicates fairness).

    Results  Among 573 634 female individuals initially examined for this study, 314 903 were White (54.9%), 217 899 were Black (38.0%), and the mean (SD) age was 26.1 (5.5) years. The risk-adjusted odds ratio comparing White participants with Black participants was 2.06 (95% CI, 2.02-2.10) for clinically recognized PPD and 1.37 (95% CI, 1.33-1.40) for postpartum mental health service utilization. Taking the LR model for PPD prediction as an example, reweighing reduced bias as measured by improved DI and EOD metrics from 0.31 and −0.19 to 0.79 and 0.02, respectively. Removing race from the models had inferior performance for reducing bias compared with the other methods (PPD: DI = 0.61; EOD = −0.05; mental health service utilization: DI = 0.63; EOD = −0.04).

    Conclusions and Relevance  Clinical prediction models trained on potentially biased data may produce unfair outcomes on the basis of the chosen metrics. This study’s results suggest that the performance varied depending on the model, outcome label, and method for reducing bias. This approach toward evaluating algorithmic bias can be used as an example for the growing number of researchers who wish to examine and address bias in their data and models.

    Introduction

    Prediction models are increasingly used in clinical decision-making to inform personalized or precision medicine. However, recent studies of machine learning and artificial intelligence applications reveal algorithmic biases that have substantial consequences for many people,1-4 particularly adverse outcomes on the health and well-being of racial and ethnic minorities.5-7 Bias in quantitative health care research refers to noncausal associations, skewed population selection, or statistical estimation errors; algorithmic bias usually refers to disparity observed in prediction model outcomes with respect to certain demographic features.

    Researchers have formulated algorithmic fairness definitions and developed bias mitigation methods.8-11 However, gaps between such advances and solutions for algorithmic bias in clinical prediction persist because of technical difficulties, complexities of high dimensional health data, lack of knowledge of underlying causal structures, and challenges to algorithm appraisal.12,13 Few examples in health care to date use methods to reduce bias. Influential work by Obermeyer et al7 suggests a relabeling approach, replacing erroneous or biased target outcomes of machine learning models with alternatives. Although a relatively simple and effective solution, the authors caution that it requires deep domain knowledge and the ability to iteratively assess models, which may not be feasible if the prediction target is unmodifiable, or the alternative outcome measure is not widely recorded.

    In this work, we used postpartum depression (PPD) to assess fairness and bias mitigation methods in clinical prediction. PPD affects 1 in 7 women who give birth, and early detection has substantial implications for maternal and child health.14 Incidence is higher among women with low socioeconomic status, such as Medicaid enrollees.15 Despite prior evidence indicating similar PPD rates across racial and ethnic groups, underdiagnosis and undertreatment has been observed among minorities on Medicaid.16 Varying rates of reported PPD reflect complex dynamics of perceived stigma, cultural differences, patient-physician relationships, and clinical needs in minority populations. The objective of this study of PPD is to examine data and machine learning prediction models for bias and to introduce and discuss assumptions and challenges of bias assessment and mitigation.

    Methods
    Data and Study Population

    The cohort was constructed using the IBM MarketScan Medicaid Database (2014-2018), containing deidentified, individual-level claim records from approximately 7 million Medicaid enrollees in multiple states. Eligible participants were female individuals aged 12 to 55 years with a live birth record (diagnosis and procedure codes in eTable 1 in the Supplement). We required greater than 80% enrollment for the duration of pregnancy (273 days for full-term births or 245 days for preterm births)17 plus 60 days postdelivery, determined by hospital discharge date or earliest claim date with delivery code for nonhospital births. The rationale for 80% enrollment is that not all women who become eligible for Medicaid were enrolled from the beginning of pregnancy. We excluded patients with dual eligibility or without mental health coverage and limited the final cohort to the first eligible pregnancies from White and Black female participants (eFigure 1 in the Supplement). This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline. This study was conducted using a pre-existing, deidentified claims and, as such, did not require institutional review board approval or informed patient consent, in accordance with 45 CFR §46.

    Experiment Setting

    Two sets of models with different outcomes were trained. The first set predicted clinically recognized PPD (recorded International Classification of Diseases codes for major depressive disorders, other mood disorders, or PPD) or filled antidepressant prescription in the 60-day postpartum period. The second set predicted postpartum mental health service utilization (defined as ≥1 inpatient or outpatient encounter with mental health primary diagnosis). Although likely associated with the first outcome, we anticipated that utilization would capture aspects of bias not directly associated with diagnosis or treatment of PPD. After assessing the data for presence of bias, we applied methods for reducing bias to these 2 prediction models.

    Bias Assessment

    We use the term bias to refer to the disparity observed in both underlying data and prediction model outcomes trained with the data. We define disparity as discrepancies in measures of interest unexplained by clinical need, in line with the Institute of Medicine’s definition.18 Risk factor–adjusted associations between race and outcomes were assessed, and persistent discrepancy after adjustment was considered to reflect disparity, following the interpretation in prior work.19

    We introduce fairness concepts for algorithmic bias assessment in Table 1. In this study, race is a protected attribute, with White being the privileged value as opposed to Black. The favorable label is predicted positive outcome for either PPD or mental health service use because we assume it will lead to limited resource access. We assess group fairness rather than individual fairness, meaning some statistical measures should be equivalent between groups in a fair world. Varying definitions of fairness means researchers should select the most appropriate way to quantify bias, given study context. We focus on 2 metrics used for binary classification to capture different aspects of bias. Disparate impact (DI) is a ratio of means of predicted favorable outcome between unprivileged and privileged groups.20 Although conceptually simple, DI is limited because a DI value of 1 does not necessarily represent fairness. For example, comparable DI values close to 1 with different accuracy can lead to unfair benefits distribution by producing more error in one group than the other. Equal opportunity difference (EOD) attempts to address this by comparing true-positive rates (sensitivity).21 We chose to compare true-positive rate rather than false-positive rate because positive outcome is more desirable in our study.

    Bias Mitigation

    Algorithmic debiasing—mathematical techniques for bias mitigation—can be classified into 3 types (preprocessing, in-processing, and postprocessing) depending on the place of action. In these methods, bias generally refers to the statistical association between protected attribute and predicted outcome. Our approach was 3-fold (Table 1). First, we used reweighing,22 a preprocessing method that applies different weights to each group-label combination according to conditional probability of label by race. This can be considered a cost-sensitive learning method in which the error for certain group-label combinations becomes more expensive by assigning greater weight. Second, we used an in-processing method called Prejudice Remover for logistic regression.23 Prejudice also refers to the statistical association. By adding a regularization term to the objective function, the race feature becomes less influential in the final prediction. Finally, we trained the models without the race variable for comparison, an approach known as fairness through unawareness. This technique was used in the biased commercial algorithm for population health management.7

    We also attempted a relabeling approach suggested in previous work7 by using an alternative outcome that is less sensitive to racial bias. Because of the highly correlated nature of health care data and low prevalence of candidate labels such as emergency mental health visits, we were unable to find a neutral target regarding race, so we applied the 3 methods described already to the secondary outcome of mental health utilization.

    Statistical Analysis

    Descriptive statistics were used to compare population characteristics. A generalized linear model with logit link was used to obtain adjusted odds ratio (OR) for race with respect to outcomes. For prediction tasks, 2014 to 2017 data were split into train, validation, and test sets in a ratio of 5:3:2. Three commonly used classifiers were fit: logistic regression (LR), random forest (RF), and extreme gradient boosting (XGB). Features included demographic characteristics, pregnancy outcomes, baseline comorbidities, medication use, and health care utilization. Base models, models without race, and models trained on reweighed data were compared for balanced accuracy and fairness metrics. We report on balanced accuracy rather than area under the curve because, in practice, classification will be performed at a fixed cutoff. We present results from LR below and results from XGB and RF in the Supplement. Finally, the trained LR classifier for PPD was evaluated using 2018 data as a hypothetical deployment scenario limited to high-risk patients in the top decile of predicted risk. We compared the characteristics of these high-risk participants to identify systemic differences by race. eFigure 2 in the Supplement illustrates the overall approach.

    The significance threshold was set at P < .05, but there was no explicit hypothesis testing. Statistical analysis was performed using Python 3.7 (Python Software Foundation) in 2020.

    Results

    Among 573 634 female individuals initially examined for this study, the mean (SD) age was 26.1 (5.5) years and eligibility criteria yielded 314 903 White participants (54.9%) and 217 899 Black participants (38.0%). Both groups had similar mean (SD) age (White participants: 26.0 [5.4] years; Black participants: 26.2 [5.5] years) and enrollment in managed Medicaid, but White women had higher prevalence of psychiatric comorbidity and Black women had worse pregnancy outcomes over the study period (Table 2), consistent with published findings. Notable differences were observed in baseline prevalence of preterm births (24 055 White patients [7.6%] vs 22 051 Black patients [10.1%]; standardized difference, 0.1), preeclampsia (15 529 White patients [4.9%] vs 15 841 Black patients [7.3%]; standardized difference, 0.1), depression (55 543 White patients [17.6%] vs 23 175 Black patients [10.6%]; standardized difference, 0.2), bipolar disorders (12 660 White patients [4.0%] vs 5199 Black patients [2.4%]; standardized difference, 0.1), opioid use disorders (13 499 White patients [4.3%] vs 1451 Black patients [0.7%]; standardized difference, 0.2), anxiety-related disorders (40 661 White patients [12.9%] vs 12 005 Black patients [5.5%], standardized difference, 0.3), and in baseline service use levels that were generally lower among Black women (see eTable 2 in the Supplement for baseline characteristics).

    White women were more likely to be diagnosed with PPD compared with Black women (52 370 White patients [16.6%] vs 15 410 Black patients [7.1%]; standardized difference, 0.3) and have at least 1 postpartum mental health visit (34 044 White patients [10.8%] vs 12 612 Black patients [5.8%]; standardized difference, 0.2). After adjusting for risk factors, including baseline comorbidity and utilization, White women were twice as likely to be evaluated for and diagnosed with PPD (odds ratio [OR], 2.06; 95% CI, 2.02-2.10) and more likely to use mental health services (OR, 1.37; 95% CI, 1.33-1.40) in the postpartum period (Table 3). To the extent that our model can adjust for confounding, the lower odds ratio among black women is at least partly associated with underlying disparity, based on prior evidence.

    Prediction and debiasing performance differed across models and methods used and outcomes (eTable 3 in the Supplement). The trained LR classifier had moderate test set balanced accuracy for predicting PPD development (0.73) or mental health service utilization (0.78). Before reweighing, PPD predictions from the base model had DI of 0.31 and EOD of −0.19 (Figure 1). After removing race, the DI value was 0.61 and the EOD value was −0.05; after reweighing (with race in the model), DI improved to 0.79 and EOD improved to 0.02. Similarly, for predicting mental health service use, the base model had DI of 0.45 and EOD of −0.11. Removing race from the model improved these metrics to 0.63 for DI and −0.04 for EOD, reweighing to 0.85 and −0.02, respectively. We underscore the fact that debiasing through reweighing did not compromise model accuracy (balanced accuracy of 0.72 for PPD and 0.78 for utilization prediction). In contrast, Prejudice Remover had lower balanced accuracy (0.68 for PPD and 0.72 for use prediction) and worse performance for EOD compared with DI at the comparable level with reweighing results. Results from XGB or RF were qualitatively similar (eFigure 3 and eFigure 4 in the Supplement, respectively). Race, the fourth important feature in XGB for PPD prediction, was no longer 1 of the 10 important features after reweighing (eFigure 5 in the Supplement).

    In the resource-limited deployment scenario, the pretrained LR model with reweighing retained the level of balanced accuracy (0.73) with improved metrics (DI = 0.68; EOD =  −0.02) compared with the base model (accuracy = 0.74; DI = 0.32; EOD = −0.20), reducing the gap in the proportion of predicted positive outcomes from 12.6% to 5.3% (Figure 2). In the high-risk subgroup predicted by the base model, Black women appeared to have worse pregnancy outcomes and higher comorbidity (eTable 4 in the Supplement). Debiased LR with reweighing added 1431 Black women to the high-risk subgroup and reduced differences in most risk factors examined.

    Discussion

    Our study represents a systematic effort to assess and mitigate racial bias using observational data. Our results found that in a Medicaid claims-based pregnancy cohort, White race was associated with greater PPD incidence and mental health services use after adjusting for confounding factors. Comparable PPD rates across races from population surveys suggest that the increased rates reflect likely underlying disparities in timely evaluation, screening, and symptom detection among Black women. Machine learning models trained on this data produced unequal predictions for both outcomes according to the chosen fairness metrics and favored White women who are already at an advantage for diagnosis and treatment. For those predicted to be at similarly higher risk, Black women appeared to have worse health status than White women. The approach to simply disregard race in model training was inferior, in most aspects, to applying algorithmic debiasing methods, likely because of the presence of associated variables.

    Obermeyer et al7 explained that possible mechanisms of bias for cost outcome are differential access to care and “direct discrimination,” which was mostly resolved by relabeling. Medicaid enrollment means that most women in our study had equal access to health care in terms of insurance coverage. Differences in structural barriers to care, perceptions, expectations, and health system interactions in perinatal and mental health care are the more likely sources of disparity.

    Our study showed that selection of debiasing method and fairness metric is associated with differences in the results. Reweighing improved metrics without compromising accuracy, whereas Prejudice Remover performed less reliably. Interestingly, the latter performed worse with EOD, which measures the difference in true-positive rate. It is possible that masking the race variable during the model fitting process succeeded in reducing only one type of metric, not both. On the basis of the results, we prefer reweighing because it preserves the original training data, unlike some other methods, and performs well across experimental settings. Whether this holds true in other cases should be examined in future research. There is no universal measure of fairness, and existing metrics often focus on specific scenarios. It is also well-known that some metrics conflict with one another and are impossible to satisfy simultaneously.24,25 Therefore, the choice of metric should be context specific and cost sensitive to different error types and trade-offs between sensitivity and specificity. Metrics focused on true-positive rate would be appropriate for screening algorithms because they put a higher penalty on missing potential cases. For algorithms classifying patients for invasive procedures, one may want to weigh false-positive results more heavily to prevent harm.

    We assume no causal effect of race on outcomes other than through racial disparities.16 This assumption may not hold if there are mechanisms that predispose Black women to lower probability of PPD or if residual confounding can explain different rates. Prior evidence is inconclusive,16,26-28 but overall indicates equal or higher prevalence of PPD symptoms among Black women. Complete knowledge of causal structures and data generation is required to rule out confounding, but it is unlikely that a factor uncorrelated with health disparity can independently account for the observed association after risk adjustment. We additionally assume that equal proportions of the predicted positive outcome indicate fairness (DI = 1). This assumption would be false if other factors can explain the difference in probability of the predicted label.29 In practice, it is challenging to predefine such factors without generating a self-fulfilling prophecy. For example, different PPD rates from a predictor such as location can wrongfully justify the DI value greater than one if it reflects underlying disparity.

    Importantly, we have limited insight on the reliability of the race variable and what it captures, a prerequisite for using it in predictive models to avoid worsening bias. We focused on race because pregnant women enrolled in Medicaid share similar demographic and socioeconomic backgrounds, but the logic easily extends to other attributes used in clinical predictions. For example, without knowing what gender represents (biological vs self-identified) and causal pathways between gender and target outcome, models using gender can behave unexpectedly. It is important to understand the interplay between race and factors outside of medicine, and how this contributes to medical findings used to build well-meaning but potentially erroneous algorithms. Considering these limitations, our approach should be viewed not as a yardstick for fairness but as a hypothesis generating tool to detect algorithmic bias.

    It is imperative to acknowledge that data sources and algorithms represent just one dimension of bias, because overreliance on technical metrics can bring unexpected consequences.13 Prediction algorithms almost always intend to influence the clinical decision-making process, and the biases of those developing and using them would have greater impact on whether they ultimately perpetuate inequality.30,31 For example, a clinician may recognize the potential disparity in diagnosis and intentionally compensate for the algorithmic bias. If the model is then debiased without this information, it could result in reversal of bias. The impact of human behavior and clinical judgement must be taken into consideration.

    Strengths and Limitations

    Our study is not without limitations. Although MarketScan data cover a wide range of US health systems, findings from the Medicaid claims data may not be generalizable to non-Medicaid populations. However, the ability to study postpartum mental health in a vulnerable population is the value of this study, and this population has less variability in socioeconomic and health care access-related factors compared with a commercial insurance cohort, reducing confounding concerns. Confirmatory studies in other populations, including non-Medicaid or commercially insured patients, would be desirable. It is, however, important to note that publicly available data sets often lack race and ethnicity information. Because of limited postpartum Medicaid coverage, follow-up was only 60 days. Evidence suggests that the majority of PPD cases are identified within 6 weeks of delivery, but some are captured later or via maternal depressive symptom screening during pediatric visits, and thus are missing from our data. Claims data lack clinical details and are known for incomplete and imprecise coding. Differential diagnosis misclassification by race can potentially alter study conclusions but cannot be verified with available resources. This study only examined data and algorithmic aspects of bias. A follow-up study incorporating human decisions and patient outcomes is needed to show that debiasing indeed reduces disparity. In addition to the possible violation of assumptions discussed already, the fairness metrics we examine are not causal in nature. Causal interpretation of fairness is an active field of study that we plan to explore further.

    Our work makes several contributions to the field of clinical algorithmic fairness as one of the first few systematic analyses of bias assessment and mitigation for clinical prediction algorithms.7,32 Our results also suggest that algorithms trained and debiased on historical data can have comparable performance in a prospective deployment setting, albeit a hypothetical one. Rather than using all features in the data, we use carefully curated features based on recommendations by researchers in the field.5 We rely on real-world data instead of simulated or synthetic data, to provide a clinically meaningful application and practical case study.

    Conclusions

    Observational health data are increasingly used for machine learning prediction models. In addition to the usual concerns of noncausal associations and selection bias, algorithmic bias should be better recognized and addressed in both research and applications. Proper use of algorithmic debiasing can help clinical researchers use machine learning models more effectively and more fairly.

    Back to top
    Article Information

    Accepted for Publication: February 9, 2021.

    Published: April 15, 2021. doi:10.1001/jamanetworkopen.2021.3909

    Open Access: This is an open access article distributed under the terms of the CC-BY-NC-ND License. © 2021 Park Y et al. JAMA Network Open.

    Corresponding Author: Yoonyoung Park, ScD, Center for Computational Health, IBM Research, 75 Binney St, Cambridge, MA 02142 (yoonyoung.park@ibm.com).

    Author Contributions: Dr Park had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

    Concept and design: Park, Hu, Singh, Das.

    Acquisition, analysis, or interpretation of data: Park, Hu, Singh, Sylla, Dankwa-Mullan, Koski.

    Drafting of the manuscript: Park, Singh, Sylla, Das.

    Critical revision of the manuscript for important intellectual content: Park, Hu, Singh, Dankwa-Mullan, Koski, Das.

    Statistical analysis: Park, Singh.

    Administrative, technical, or material support: Hu, Das.

    Supervision: Hu, Das.

    Conflict of Interest Disclosures: None reported.

    References
    1.
    Angwin  J, Larson  J, Mattu  S, Kirchner  L. Machine bias. Pro-Publica. May 23, 2016. Accessed July 31, 2020. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
    2.
    Koenecke  A, Nam  A, Lake  E,  et al.  Racial disparities in automated speech recognition.   Proc Natl Acad Sci U S A. 2020;117(14):7684-7689. doi:10.1073/pnas.1915768117Google ScholarCrossref
    3.
    Datta  A, Tschantz  MC.  Automated experiments on ad privacy settings  .  Proc Priv Enh Technol. 2015;1:92-112. doi:10.1515/popets-2015-0007Google Scholar
    4.
    Buolamwini  J, Gebru  T. Gender shades: intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research. Published 2018. Accessed February 26, 2021. http://proceedings.mlr.press/v81/buolamwini18a.html
    5.
    Gianfrancesco  MA, Tamang  S, Yazdany  J, Schmajuk  G.  Potential biases in machine learning algorithms using electronic health record data.   JAMA Intern Med. 2018;178(11):1544-1547. doi:10.1001/jamainternmed.2018.3763PubMedGoogle ScholarCrossref
    6.
    Vyas  DA, Eisenstein  LG, Jones  DS.  Hidden in plain sight—reconsidering the use of race correction in clinical algorithms.   N Engl J Med. 2020;383(9):874-882. doi:10.1056/NEJMms2004740PubMedGoogle ScholarCrossref
    7.
    Obermeyer  Z, Powers  B, Vogeli  C, Mullainathan  S.  Dissecting racial bias in an algorithm used to manage the health of populations.   Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342PubMedGoogle ScholarCrossref
    8.
    Verma  S, Rubin  J. Fairness definitions explained. FairWare '18: Proceedings of the International Workshop on Software Fairness. Published May 2018. Accessed February 26, 2021. doi:10.1145/3194770.3194776
    9.
    Friedler  SA, Choudhary  S, Scheidegger  C, Hamilton  EP, Venkatasubramanian  S, Roth  D.  A Comparative Study of Fairness-Enhancing Interventions In Machine Learning. ACM FAT. 2019:329-338. doi:10.1145/3287560.3287589
    10.
    Bellamy  RKE, Mojsilovic  A, Nagar  S,  et al  AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias.   IBM J Res Dev. 2019;63(4-5):4:1-4:15. doi:10.1147/JRD.2019.2942287Google ScholarCrossref
    11.
    Menon  AK, Williamson  RC. The cost of fairness in classification. ArXiv. Published May 25, 2017. Accessed February 26, 2021. https://arxiv.org/abs/1705.09055
    12.
    Rajkomar  A, Hardt  M, Howell  MD, Corrado  G, Chin  MH.  Ensuring fairness in machine learning to advance health equity.   Ann Intern Med. 2018;169(12):866-872. doi:10.7326/M18-1990PubMedGoogle ScholarCrossref
    13.
    McCradden  MD, Joshi  S, Mazwi  M, Anderson  JA.  Ethical limitations of algorithmic fairness solutions in health care machine learning.   Lancet Digit Health. 2020;2(5):e221-e223. doi:10.1016/S2589-7500(20)30065-0PubMedGoogle ScholarCrossref
    14.
    Wisner  KL, Chambers  C, Sit  DK.  Postpartum depression: a major public health problem.   JAMA. 2006;296(21):2616-2618. doi:10.1001/jama.296.21.2616PubMedGoogle ScholarCrossref
    15.
    Gress-Smith  JL, Luecken  LJ, Lemery-Chalfant  K, Howe  R.  Postpartum depression prevalence and impact on infant health, weight, and sleep in low-income and ethnic minority women and infants.   Matern Child Health J. 2012;16(4):887-893. doi:10.1007/s10995-011-0812-yPubMedGoogle ScholarCrossref
    16.
    Kozhimannil  KB, Trinacty  CM, Busch  AB, Huskamp  HA, Adams  AS.  Racial and ethnic disparities in postpartum depression care among low-income women.   Psychiatr Serv. 2011;62(6):619-625. doi:10.1176/ps.62.6.pss6206_0619PubMedGoogle ScholarCrossref
    17.
    Margulis  AV, Setoguchi  S, Mittleman  MA, Glynn  RJ, Dormuth  CR, Hernández-Díaz  S.  Algorithms to estimate the beginning of pregnancy in administrative databases.   Pharmacoepidemiol Drug Saf. 2013;22(1):16-24. doi:10.1002/pds.3284PubMedGoogle ScholarCrossref
    18.
    Institute of Medicine Committee on Understanding and Eliminating Racial and Ethnic Disparities in Health Care.  Unequal Treatment: Confronting Racial and Ethnic Disparities in Health Care. National Academies Press; 2003.
    19.
    VanderWeele  TJ, Robinson  WR.  On the causal interpretation of race in regressions adjusting for confounding and mediating variables.   Epidemiology. 2014;25(4):473-484. doi:10.1097/EDE.0000000000000105PubMedGoogle ScholarCrossref
    20.
    Feldman  M, Friedler  SA, Moeller  J, Scheidegger  C, Venkatasubramanian  S. Certifying and removing disparate impact. KDD '15: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August 10-13, 2015; Sydney, Australia. doi:10.1145/2783258.2783311
    21.
    Hardt  M, Price  E, Srebro  N. Equality of opportunity in supervised learning. arXiv. Published October 7, 2016. Accessed February 26, 2021. https://arxiv.org/abs/1610.02413
    22.
    Calders  T, Kamiran  F, Pechenizkiy  M.  Building Classifiers With Independency Constraints. ICDM Workshops - IEEE International Conference on Data Mining. 2009:13-18. August 6-9, 2009; Miami, Florida.
    23.
    Kamishima  T, Akaho  S, Asoh  H, Sakuma  J. Fairness-aware classifier with prejudice remover regularizer. In: Flach  PA, De Bie  T, Cristianini  N, eds.  Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012.  Lecture Notes in Computer Science, vol 7524. Springer;2012.
    24.
    Dwork  C, Hardt  M, Pitassi  T, Reingold  O, Zemel  R. Fairness through awareness. Paper presented at 3rd Innovations in Theoretical Computer Science Conference; January 8-10, 2012; Cambridge, MA. Accessed February 26, 2021. https://dl.acm.org/doi/10.1145/2090236.2090255
    25.
    Makhlouf  K, Zhioua  S, Palamidessi  C. On the applicability of ML fairness notions. arXiv. Published June 30, 2020. Accessed February 26, 2021. https://arxiv.org/abs/2006.16745
    26.
    Howell  EA, Mora  PA, Horowitz  CR, Leventhal  H.  Racial and ethnic differences in factors associated with early postpartum depressive symptoms.   Obstet Gynecol. 2005;105(6):1442-1450. doi:10.1097/01.AOG.0000164050.34126.37PubMedGoogle ScholarCrossref
    27.
    Gavin  AR, Melville  JL, Rue  T, Guo  Y, Dina  KT, Katon  WJ.  Racial differences in the prevalence of antenatal depression.   Gen Hosp Psychiatry. 2011;33(2):87-93. doi:10.1016/j.genhosppsych.2010.11.012PubMedGoogle ScholarCrossref
    28.
    Liu  CH, Tronick  E.  Rates and predictors of postpartum depression by race and ethnicity: results from the 2004 to 2007 New York City PRAMS survey (Pregnancy Risk Assessment Monitoring System).   Matern Child Health J. 2013;17(9):1599-1610. doi:10.1007/s10995-012-1171-zPubMedGoogle ScholarCrossref
    29.
    Corbett-Davies  S, Pierson  E, Feller  A, Goel  S, Huq  A. Algorithmic decision making and the cost of fairness. Paper presented at: 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13-17, 2017; Halfax, Nova Scotia. Accessed February 26, 2021. https://dl.acm.org/doi/10.1145/3097983.3098095
    30.
    Kusner  MJ, Loftus  JR.  The long road to fairer algorithms.   Nature. 2020;578(7793):34-36. doi:10.1038/d41586-020-00274-3PubMedGoogle ScholarCrossref
    31.
    Veinot  TC, Mitchell  H, Ancker  JS.  Good intentions are not enough: how informatics interventions can worsen inequality.   J Am Med Inform Assoc. 2018;25(8):1080-1088. doi:10.1093/jamia/ocy052PubMedGoogle ScholarCrossref
    32.
    Singh  M, Ramamurthy  KN. Understanding racial bias in health using the Medical Expenditure Panel Survey data. arXiv. Published November 4, 2019. Accessed March 1, 2021. https://arxiv.org/abs/1911.01509
    ×