Duke University Medical Center (DUMC) is located in central Durham County. Redder colors indicate poorer nSES, while bluer colors indicate better nSES. The northern parts of the county are fairly rural, while the center parts, where nSES is lower, are more urban.
With the exception of outpatient visit, those in the lowest neighborhood quartiles have the quickest time to the event. Quartile 1 indicates lower nSES, while quartile 4 indicates better nSES. Those in areas with lower nSES have quicker time to events than those in areas with higher nSES.
While event rates are relatively low, those in the lowest neighborhood quartiles have the quickest time to the event. Quartile 1 indicates lower nSES, while quartile 4 indicates better nSES. Those in areas with lower nSES have quicker time to events than those in areas with higher nSES.
eTable 1. Demographic and Clinical Characteristics of Patients
eTable 2. Model Fit Results Using Cross-Validation in the Training Data
eTable 3. Model Fit Results Using Principal Components of ACS Data
eFigure. Discrete Time Prediction Results
Customize your JAMA Network experience by selecting one or more topics from the list below.
Create a personal account or sign in to:
Identify all potential conflicts of interest that might be relevant to your comment.
Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.
Err on the side of full disclosure.
If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.
Bhavsar NA, Gao A, Phelan M, Pagidipati NJ, Goldstein BA. Value of Neighborhood Socioeconomic Status in Predicting Risk of Outcomes in Studies That Use Electronic Health Record Data. JAMA Netw Open. 2018;1(5):e182716. doi:10.1001/jamanetworkopen.2018.2716
What is the added predictive value of neighborhood socioeconomic status when predicting health outcomes and use of health care services with data from the electronic health record?
In this cohort study, the predictive value of neighborhood socioeconomic status varied by outcome of interest. When added to electronic health record variables, neighborhood socioeconomic status did not improve predictive performance for any outcome.
These results suggest that information about the neighborhood in which a person lives may not contribute much more to risk prediction than information already within electronic health record data.
Data from electronic health records (EHRs) are increasingly used for risk prediction. However, EHRs do not reliably collect sociodemographic and neighborhood information, which has been shown to be associated with health. The added contribution of neighborhood socioeconomic status (nSES) in predicting health events is unknown and may help inform population-level risk reduction strategies.
To quantify the association of nSES with adverse outcomes and the value of nSES in predicting the risk of adverse outcomes in EHR-based risk models.
Design, Setting, and Participants
Cohort study in which data from 90 097 patients 18 years or older in the Duke University Health System and Lincoln Community Health Center EHR from January 1, 2009, to December 31, 2015, with at least 1 health care encounter and residence in Durham County, North Carolina, in the year prior to the index date were linked with census tract data to quantify the association between nSES and the risk of adverse outcomes. Machine learning methods were used to develop risk models and determine how adding nSES to EHR data affects risk prediction. Neighborhood socioeconomic status was defined using the Agency for Healthcare Research and Quality SES index, a weighted measure of multiple indicators of neighborhood deprivation.
Main Outcomes and Measures
Outcomes included use of health care services (emergency department and inpatient and outpatient encounters) and hospitalizations due to accidents, asthma, influenza, myocardial infarction, and stroke.
Among the 90 097 patients in the training set of the study (57 507 women and 32 590 men; mean [SD] age, 47.2 [17.7] years) and the 122 812 patients in the testing set of the study (75 517 women and 47 295 men; mean [SD] age, 46.2 [17.9] years), those living in neighborhoods with lower nSES had a shorter time to use of emergency department services and inpatient encounters, as well as a shorter time to hospitalizations due to accidents, asthma, influenza, myocardial infarction, and stroke. The predictive value of nSES varied by outcome of interest (C statistic ranged from 0.50 to 0.63). When added to EHR variables, nSES did not improve predictive performance for any health outcome.
Conclusions and Relevance
Social determinants of health, including nSES, are associated with the health of a patient. However, the results of this study suggest that information on nSES may not contribute much more to risk prediction above and beyond what is already provided by EHR data. Although this result does not mean that integrating social determinants of health into the EHR has no benefit, researchers may be able to use EHR data alone for population risk assessment.
Electronic health records (EHRs) have become an important component of clinical practice. However, a key limitation of EHRs when used for research purposes is that they do not reliably collect sociodemographic and neighborhood information, which has long been recognized to be strongly associated with health.1 Social and behavior measures linked to clinical variables within EHRs may improve clinical care and population health while also helping to inform population-level risk reduction strategies.2
Data from EHRs have been used extensively to develop risk models.3 Several studies have shown that linking neighborhood socioeconomic status (nSES) indicators with disease risk factors improves the accuracy of models in predicting disease outcomes.4,5 For instance, adding nSES indicators improves the accuracy of the Framingham risk score in the estimation of coronary heart disease risk.6,7 To our knowledge, there are few systematic studies assessing the value of nSES indicators in the prediction of diverse clinical events. In the present study, we supplemented individual EHR data with nSES data from the American Community Survey (ACS). We emphasize that our goal is not to assess whether nSES is associated with health outcomes—it undoubtedly is8—but to assess whether knowledge of nSES improves the prediction of health outcomes. Specifically, we sought to determine whether census tract–level nSES indicators are associated with poor health outcomes, whether census tract–level nSES data alone or in concert with EHR data can improve risk prediction beyond current models by using EHR data, and which elements in EHR indicators can serve as proxies for census tract–level nSES measures.
Clinical data were derived from the EHR system of Duke University Health System (DUHS), which consists of 2 community hospitals, 1 large referral hospital, and a network of outpatient clinics. It is estimated that 85% of the residents of Durham County, North Carolina, receive their primary care from DUHS.9 We developed a data mart consisting of local patients by selecting those with an address in Durham County between January 1, 2009, and December 31, 2015, following the Patient-Centered Clinical Research Network Common Data Model, version 3.0, and adding custom fields, such as address and insurance status.10 We supplemented these data with EHR records from the Lincoln Community Health Center, a federally qualified health care facility serving a primarily underserved population in Durham County. All of the patients from the Lincoln Community Health Center were Durham residents. This study was approved by the Duke University School of Medicine institutional review board, which also granted a waiver of informed consent for this study because this is a secondary data analysis. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.
We divided our cohort into training and testing sets. The index date (ie, time zero) for the training set was January 1, 2009, and the index date for the testing set was January 1, 2012. To be eligible at the index time point, patients had to be age 18 years or older, have at least 1 health care encounter in the year prior to the index date, and be a Durham County resident at their last encounter. This protocol allowed us to characterize local patients who were actively seeking care at DUHS. The data mart contained encounter data through December 31, 2016.
We chose a broad range of outcomes based on the use of services (emergency department and inpatient and outpatient encounters) and hospitalizations due to accidents, asthma, influenza, myocardial infarction, and stroke. These clinical outcomes were chosen for their known association with nSES.11-17 Cause-specific hospitalizations were defined via discharge diagnosis. Patients were censored at their last encounter date in the data mart or 3 years after the index date (December 31, 2011, for the training set; December 31, 2015, for the testing set), whichever came first. Because patients had potential follow-up through the end of 2016, we had a “burnout” period when we could properly capture the censoring date.18
We abstracted 41 baseline predictors from our data mart that are commonly available in EHR systems, including measures of demographics, comorbidities, laboratory tests, medications, and use of health care services (eTable 1 in the Supplement). We used encounter data from the year prior to the index dates (ie, 2008 for the training set and 2011 for the testing set) to define predictor values. We presumed that the absence of a measurement (eg, no International Classification of Diseases, Ninth Revision code for diabetes) indicated that the individual did not have the condition. Because not all patients had all laboratory tests performed, instead of imputing missing values, we simply used the number of times the test was administered, a metric that has been shown to be predictive of outcomes.19
To define nSES, we extracted data from the 2010 ACS. The ACS is a rolling survey of the US population that gathers information, such as ancestry, educational level, income, language proficiency, migration, disability, employment, and housing characteristics, across 1298 variables.20 The ACS releases estimates at the regional, state, and county level every year, and data at the census-tract and block-group levels are available every 5 years. For our study, a patient’s address at the index date was used to identify their census tract. Census tracts are small geographical units of approximately 4000 residents. Durham County has 73 census tracts. To calculate nSES, we used the Agency for Healthcare Research and Quality (AHRQ) SES index.21 The index is a weighted combination of the percentage of households with a mean number of 1 person or more per room, the median value of owner-occupied dwelling, the percentage unemployed, percentage living below the poverty level, the median household income, the percentage 25 years or older with a bachelor’s degree or higher, and the percentage 25 years or older with less than a 12th-grade education. It is scaled to the US population to lie between 0 and 100, with a higher number indicative of greater neighborhood deprivation. Previous studies have used this index to represent a geographical area–based measure of the socioeconomic deprivation experienced according to neighborhood.22-25
The characteristics of the patients were summarized by county-level quartiles of the nSES score. Categorical variables were presented as frequencies, and continuous variables were presented as mean (SD) values. We assessed the amount of variation within nSES explained by EHR data by regressing nSES onto all the EHR data and calculating R2 statistics. To evaluate the differences in time to events based on nSES, we fit Kaplan-Meier curves stratified on quartiles of nSES. We assessed differences via a log-rank test. We next tried to determine how adding nSES to the EHR data affects risk prediction. To derive our prediction model, we used random survival forest (RSF).26 The random forests method is an extension of classification and regression trees, which combines multiple trees via a process called bagging (bootstrap aggregation) to create a more robust predictor.27,28 The RSF is an application of random forests to time-to-event data. In brief, RSF (and random forests) provides a nonparametric means of developing predictive models. Its primary value is that it allows one to model both nonlinear and heterogeneous (interaction) effects. This is a more robust model than the standard Cox proportional hazards regression model. Using the training data, we first trained an RSF model using only the EHR data. Next, we fit a second model including nSES as an additional predictor. We used the test data to assess the predictive performance of both models. We calculated C statistics appropriate for time-to-event data and compared them using the permutation test.29,30 All P values were from 2-sided tests, and results were deemed statistically significant at P < .05. The C statistic, also termed concordance statistic or c-index, is analogous to the area under the curve and is a global measure of model discrimination.31 Discrimination refers to the ability of a risk prediction model to separate patients who develop a health outcome from patients who do not develop a health outcome.31 Effectively, the C statistic is the probability that a model will result in a higher-risk score for a patient who develops the outcomes of interest compared with a patient who does not develop the outcomes of interest.
For sensitivity analysis, we used cross-validation within the training set based on the RSF to assess the added value of nSES. We also used a more general parameterization of the ACS variables. We performed a principle components analysis and selected the components that explained at least 95% of the variance. All statistical analyses were performed in R, version 3.1.4 (The R Foundation for Statistical Computing). We used the package randomForestSRC to build the RSF model, and we used the package survAUC to calculate the C statistic.32,33
We identified 90 097 eligible patients for the training data and 122 812 eligible patients for the test data. The demographics and clinical characteristics of these patients, stratified by nSES quartiles and training or test set, are shown in eTable 1 in the Supplement, with a reduced set of demographics in Table 1. The population in the training data set was predominately female (57 507 [63.8%]) and black (37 774 [41.9%]), with a mean (SD) age of 47.2 (17.7) years. Similar characteristics were seen in the testing data set (75 517 [61.5%] female; 48 766 [39.7%] black; mean [SD] age, 46.2 [17.9] years). Patients living in neighborhoods in a lower nSES quartile were more likely to be younger, black, have public insurance, and experience more clinical health care encounters than those in a higher nSES quartile. Clinically, those in a lower nSES quartile were also more likely to have more comorbidities, take more medications, and undergo more laboratory tests. Figure 1 displays the spatial distribution of nSES across 73 census tracts of Durham County. Overall, nSES ranged from a scaled value of 37% to 74%. The northern parts of Durham County are quite rural, and the central parts are fairly urban.
Next, we assessed differences in time-to-health outcomes based on nSES. Figure 2 and Figure 3 show Kaplan-Meier plots for the 8 different outcomes. (The eFigure in the Supplement provides risk set information.) The log-rank test was significant for all outcomes. In addition, for all outcomes, those in lower nSES neighborhoods had shorter times to events. The one exception was outpatient encounters; individuals in neighborhoods with a higher nSES had a shorter time to the next appointment.
Finally, we assessed the added predictive value of nSES to clinical variables readily available in the EHR. Table 2 shows the C statistics for the 8 outcomes based on EHR data alone, nSES information alone, and EHR data and nSES information combined. The predictive value of nSES varies by different outcomes of interest. Although nSES was moderately predictive for most outcomes (C statistic ranged from 0.50 to 0.63), it did not improve predictive performance for any outcome when added to EHR variables.
To understand the lack of added predictive value better, we regressed nSES onto the EHR variables, estimating the coefficient of determination (R2). All EHR data explained 31.2% of the variability in nSES, while demographic factors alone (age, sex, race/ethnicity, and insurance status) explained 28.7% of the variance, suggesting that a moderate amount of the variation in nSES is explained by demographic factors alone.
In our sensitivity analysis, both the use of the estimate based on the RSF within the training data and principal components to represent ACS data provided similar results (eTables 2 and 3 in the Supplement). We also hypothesized that nSES information would be more predictive for long-term outcomes compared with short-term outcomes. When we examined the added value of nSES for 30-day, 90-day, 180-day, 1-year, 2-year, and 3-year time horizons, we found that nSES did not improve prediction over longer-term horizons (eFigure in the Supplement).
Our study found that, while the risk of clinical outcomes differs based on nSES, and although nSES is moderately predictive of clinical outcomes, nSES does not meaningfully improve risk prediction of clinical events above and beyond what is easily extractable from the EHR. A primary explanation for this finding could be that, at least in our population, demographic characteristics are highly associated with nSES. In our study, knowledge of a patient’s age, sex, race/ethnicity, and insurance status explained more than 28% of the variability in nSES. For comparison, it is typical for the coefficient of determination to be less than 10% in clinical studies. To our knowledge, this study is one of the first to broadly assess the added value of nSES in a large, population-based risk prediction study using data from the EHR.
There has been increasing emphasis on the use of data from the EHR for population health.34 There is potential to use these data to understand the health of communities through activities such as disease surveillance and population risk assessment, especially when medical centers, such as DUHS, are the primary health care facility in a community.35 This use is increasingly salient amid changes to patient reimbursement in which medical centers are becoming financially responsible for managing the health of their patient populations.36 One of the concerns with EHR data are that they lack important contextual information regarding patients’ social environments.37 To this end, widely available nSES data may be linked to patients’ EHRs.
The goal of identifying neighborhoods with greater health care needs is to deploy pragmatic interventions, such as patient navigators, social workers, or access to telemedicine, which can target high-risk populations. To quantify nSES, we used data available from the ACS to calculate the AHRQ nSES index. Others have used the AHRQ nSES index to assess outcomes, such as prevalence of chronic disease and risk of hospital readmission, and, similar to our study, they found that lower nSES was associated with poorer health outcomes.23,24 In our study, we explored the effect that different measures of nSES may have on our results through a sensitivity analysis that used principal components analysis, which was conducted on all variables present in the ACS data set to identify constructs that may have better discriminatory characteristics than the AHRQ risk score alone. We did not see any appreciable differences in C statistics when we used the principal components analysis–derived constructs compared with the AHRQ risk score.
It is well known that neighborhoods are significantly associated with the health of their residents through physical and social attributes.8 The mechanisms by which neighborhoods are associated with health include increased stress level, decreased physical activity, and poor nutrition, which in turn affect both proximal risk factors, such as blood pressure, diabetes control, and inflammation, and distal health outcomes, such as cardiovascular disease.8 The democratization of neighborhood-level contextual data, the ability to link these data to the EHR, and the ability to target population-level interventions to high-risk areas have resulted in a resurgence in research related to neighborhoods and health. Our results support prior research in this area by showing that patients who live in areas with lower nSES have poorer health outcomes than patients who live in areas with higher nSES.38-40 As an extension of this finding, we examined the importance of nSES in risk prediction across multiple health and service-use outcomes and found little added value for the risk prediction models within our population. This area of research has not been extensively studied; however, prior studies may help place our results in context. Fiscella and colleagues41 showed that adding individual-level nSES (ie, educational level and income) to the Framingham risk score improved calibration of the risk model for coronary heart disease, but not discrimination, while reducing bias in risk prediction for coronary heart disease for those with lower socioeconomic status. They did not use nSES measures.
In a separate study of 1178 consecutive patients 65 years or younger who were discharged from 8 hospitals in central Israel, the C statistic for predicting mortality after myocardial infarction significantly improved from 0.72 to 0.76 (P < .001) after socioeconomic status measures, including nSES, were added to the basic prediction model. The study used an index developed by the Israel Central Bureau of Statistics, which may not be generalizable to other populations, and the extended model included both individual-level socioeconomic status and nSES predictors.42
In a study of 109 793 patients from the Cleveland Clinic Health System, Dalton and colleagues43 showed that the pooled cohort equation risk model predicted events associated with atherosclerotic cardiovascular disease with greater discrimination among individuals living in more affluent communities, as defined using the neighborhood disadvantage index, than among individuals living in poorer neighborhoods. These results may suggest that the predictive ability of nSES might depend on the nSES index used and the population within which it is applied.
This study has some strengths. Durham County is a diverse county with both wealthy and poor residents as well as both urban and rural neighborhoods. We were able to use our large sample size and relatively long follow-up to quantify outcomes with low event rates. There are also important limitations to our study. These clinical data are from one geographical region, and it is possible that, in a region with different demographic characteristics, the R2 would be lower, allowing for greater contribution of nSES in risk prediction. However, insurance status alone had an R2 of 12.5%. In addition, our models were developed and validated using EHR data from a single institution (DUHS and Lincoln Community Health Center share a common EHR system). Patients who received care at different institutions would be missed. We also do not have data on health care received outside DUHS or Lincoln Community Health Center by the patients included in our study. In addition, we used only 1 primary parameterization of nSES: the ARHQ neighborhood deprivation index. It is possible that other measures, such as the Gini index, would have yielded greater added value.44 That being said, our more agnostic principal components analysis yielded similar results. Finally, although RSF is a robust model algorithm capable of finding complex effects, it is possible that another modeling approach would have yielded different results.45
This work reaffirms that the social environment is associated with health outcomes. However, these results suggest that information about the environment in which a person lives may not contribute much more to population risk assessment than is already provided by EHR data. Although this result does not mean that integrating social determinants of health into the EHR has no benefit, researchers may be able to use EHR data alone for population risk assessment.
Accepted for Publication: July 17, 2018.
Published: September 21, 2018. doi:10.1001/jamanetworkopen.2018.2716
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2018 Bhavsar NA et al. JAMA Network Open.
Corresponding Author: Benjamin A. Goldstein, PhD, Department of Biostatistics and Bioinformatics, Duke University School of Medicine, 2424 Erwin Rd, Durham, NC 27705 (email@example.com).
Author Contributions: Dr Goldstein had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Dr Bhavsar and Ms Gao are co–first authors.
Concept and design: Bhavsar, Pagidipati, Goldstein.
Acquisition, analysis, or interpretation of data: Bhavsar, Gao, Phelan, Goldstein.
Drafting of the manuscript: Bhavsar, Gao, Goldstein.
Critical revision of the manuscript for important intellectual content: Bhavsar, Phelan, Pagidipati, Goldstein.
Statistical analysis: Bhavsar, Gao, Phelan, Goldstein.
Obtained funding: Bhavsar, Goldstein.
Administrative, technical, or material support: Bhavsar, Goldstein.
Supervision: Pagidipati, Goldstein.
Conflict of Interest Disclosures: None reported.
Funding/Support: Research reported in this publication was supported by National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) of the National Institutes of Health (NIH) career development award K25 DK097279 (Dr Goldstein) and NIDDK of the NIH award P30DK096493 (Dr Bhavsar). This publication was made possible (in part) by grant UL 1TR001117 from the National Center for Advancing Translational Sciences, a component of the NIH, and NIH Roadmap for Medical Research. Data from the Southeastern Diabetes Initiative was supported in part by grant 1C1CMS331018-01-00 from the Department of Health and Human Services, Centers for Medicare & Medicaid Services, and in part by the Bristol-Myers Squibb Foundation Together on Diabetes program.
Role of the Funder/Sponsor: The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Disclaimer: The contents of this publication are solely the responsibility of the authors and have not been approved by the Department of Health and Human Services or the Centers for Medicare and Medicaid Services. Its contents are solely the responsibility of the authors and do not necessarily represent the official view of the National Center for Advancing Translational Sciences or National Institutes of Health.