Development of a Synthetic Population Model for Assessing Excess Risk for Cardiovascular Disease Death

This decision analytical model describes the use of a semisynthetic population to identify the distribution of excess cardiovascular death risk and its correlation with social and biological risk factors.


eMethods. Description of Synthetic Population Creation
A synthetic population with demographics and disease characteristics was built from the synthetic population used by the FRED modeling and simulation platform (Figure 1). 1 The FRED population was generated by a census-based iterative proportional fitting methodology that results in a geospatially realistic population that accurately represents the demographics and household structure of the US population. 2 Counts of insured population with evidence of specific conditions by claims for Type II diabetes, hyperlipidemia, hypertension, and combinations of these conditions was provided by three major insurance organizations in the Allegheny County area. The covered population included private insurance, Medicare and Medicaid. Eligible enrollees were Allegheny County residents, enrolled for at least 90 continuous days under one of the three health plans in the year 2015 (Jan 1 -Dec 31). Medical claims data was provided for 30-86% of population in 97.2% of census tracts. Eligible enrollees varied by census tract but overall accounted for approximately 60 percent of the Allegheny County population. Percent of persons in a given census tract who were insured by each insurer or by the three insurers overall was calculated, but causes of lack of insurance are not available. At least one census tract with low reported percentage of insurance coverage is the location of 2 major universities and therefore is residence of a large number of students from other locations, who may have insurance through other carriers, but it is not possible to quantitate that from the available data. Since the insurers providing data included the largest Medicaid insurance provider in the area, Medicaid enrollees are presumed to be represented adequately in the claims data. Other reasons for low coverage in a census tract include cost of insurance, perceived lack of need for insurance, distrust of the system and lack of access to or knowledge about availability of insurance programs are possible but the level of these or other factors on insurance uptake at the census tract level are beyond the scope of this study. Further, some proportion of the population likely had insurance with other insurers who cover small portions of the county population and this may not be distributed evenly the population. This could affect the results and is a limitation of the study Census tract levels for each disease were stratified by gender and age range category (1-17, 18-44, 45-64, 65-84, and ages 85+).
Disease claim counts were used to assign levels of diabetes, hypertension and hyperlipidemia and combinations of these conditions to the FRED population on a census tract basis. It is not possible to connect a diagnostic variable (such as diabetes) directly to an individual in the claims data. When the number of agents per census tract exceeded the population covered by insurer data, agent diabetes, hypertension and hyperlipidemia status were assigned by randomly drawing an individual with matching demographics from the National Health and Nutrition Examination Survey (NHANES). 3 Plausible values for individual height, cholesterol, high density lipoproteins, blood pressure and history of stroke and prior myocardial infarction were obtained from NHANES, by matching to the synthetic population's agent-level demographics and disease status. When agent diabetes, hypertension and hyperlipidemia status were assigned from NHANES, the NHANES values for individual height, cholesterol, high density lipoproteins, blood pressure and history of stroke and prior myocardial infarction from that NHANES individual were used to assign those variables to that agent. Data from the National Health Interview Survey was used to assign smoking status to agents based on demographics. 4 Rates were obtained by summing counts per tract and dividing by population. Each agent was assigned a five-year risk of death due to CVD using a published risk equation (see eAppendix 1).

eAppendix 1. Description and Evaluation of Algorithm Used for Prediction of Cardiovascular Disease Death Rate
To predict risk of death from cardiovascular disease, this study used a risk score that was developed using data from eight randomized clinical trials for treatment of hypertension. 5 Development of the risk score related individual characteristics to risk of death from cardiovascular disease using a multivariate Cox model. A risk score was developed from 11 factors: age, sex, systolic blood pressure, serum total cholesterol concentration, height, serum creatinine concentration, cigarette smoking, diabetes, left ventricular hypertrophy, history of stroke, and history of myocardial infarction. The risk score is an integer, with points added for each factor according to its association with risk. This risk score algorithm was chosen in part because the majority of individual level characteristics needed for the prediction were available in the FRED synthetic population, could be added from the insurer claims data available for this project or could be distributed in the population in a realistic way by choosing random similar individuals from NHANES or NHIS. Creatinine values were not available so 2 points was added to the risk score for all agents as suggested by the risk calculator developers. Left ventricular hypertrophy was also not available so was not used in the calculation. Risk was scaled to four years to match data and was summed over each census tract. Agents 18 years old or younger were assigned zero risk.
Difference between expected and observed CVD death risk was approximately normally distributed by Shapiro-Wilk normality test (W = 0.99219, p-value = 0.06335) after removal of 2 outliers (eFigure1). Average difference between expected and actual CVD death risk was close to 0 (-40, SD 524) and approximately evenly distributed around 0 but with 2 notable outliers (eFigure2). Linear regression was used to evaluate the reliability of the algorithm used for prediction of cardiovascular disease (CVD) death risk. 6 Regression of observed CVD death rate risk from expected rate gave an intercept not significantly different from 0 (0.0013, CI [-0.0014, 0.0041], p=0.384) and slope close to 1 (0.94, CI [0.75, 1.12], p <0.001), with an adjusted R-squared of 0.214 and F-statistic: 95.87 on 1 and 348 DF (p-value: < 0.001). Plot of residuals versus fitted values did not show any pattern (eFigure 3A) and normal Q-Q plot showed residuals were normally distributed, with the exception of 2 outliers (eFigure 3B). Scalelocation plot indicated limited unequal variance (eFigure 3C). All points were within curved lines in plot of residuals vs leverage, although the outliers were close to the 0.5 line (eFigure 3D). Based on these metrics, this method was considered to provide an acceptable estimate of population risk.

eFigure 1. Histogram of Difference Between Expected and Observed CVD Death Rate
Difference between expected and observed CVD death risk per 100,000 was approximately normally distributed, after removal of 2 outliers. Normal curve plotted in red.  Census tracts for which a majority of determinants were missing were omitted from the study (n=5).

Data Description and Limitations
Data was collected in 2016-2017 but in some cases was for prior periods, as noted. Data was provided at percent or counts per census tract and was not available at higher granularity. Two data variables had a large proportion of missing data (eTable 1) and it was not possible to either obtain those missing values or to determine what caused the data to be missing. Level of missing data was low for the most part (of 20 determinants, 17 had 5 or fewer tracts with missing data). Different variables were not generally missing for the same tracts. Percent vacant property estimates were produced by the US Postal Service. Vacant properties data is routinely collected by mail carriers on addresses no longer receiving mail due to vacancy and is reported quarterly at census tract geographies in the United States along with counts of total mailing addresses. Data used in this study was aggregated to Allegheny County census tracts. Location information for all Supermarkets and Convenience Stores in Allegheny County was produced using the Allegheny County Fee and Permit Data for 2016. Fee and permit data were used to generate number of fast food restaurants (restaurant that has more than one location in the county but without an alcohol permit) and number of restaurants per census tract (ACHD Fee and Permit data, 2016). Census tract level counts of Allegheny County Fast Food Establishments was obtained by exporting all chain restaurants without an alcohol permit from the County's Fee and Permit System. Chain restaurants capture both local and national chains (including locally owned national chains) so long as there is one or more establishments in operation within the County. While access to supermarkets and excess fast food establishments are believed to impact nutrition and therefore health, in this dataset that effect was not apparent. Most tracts had 0 or 1 supermarket and this study did not include analysis of access to transportation to supermarkets. Fast food establishments were concentrated in the downtown area, where a large number of individuals work but few reside, and in the university areas, where again there are many workers who reside elsewhere and additionally there are many students, who are less at risk for cardiovascular disease. Poor Housing Conditions is an estimate of the percent of distressed housing units in each Census Tract and was prepared using data from the American Community Survey and the Allegheny County Property Assessment database (https://data.wprdc.org/dataset/property-assessments). The estimate was produced by the Allegheny County Reinvestment Fund with the Allegheny County Department of Economic Development. Obesity rates for each census tract were obtained from a published study. 7 Obesity rates for each Census Tract in Allegheny County were produced by estimates using statistical modeling techniques. The obesity rate of a demographically similar census tract was applied to similar ones in Allegheny County to compute an obesity rate. 7 Census tract walk scores measure the walkability of any address using a patented system developed by the Walk Score company. Walk scores were produced by Walk Score (https://www.walkscore.com). For each 2010 Census Tract centroid, Walk Score analyzed walking routes to nearby amenities. Points were awarded based on the distance to amenities in each category. Amenities within a 5 minute walk (.25 miles) are given maximum points. A decay function is used to give points to more distant amenities, with no points given after a 30 minute walk. Walk Score also measures pedestrian friendliness by analyzing population density and road metrics such as block length and intersection density. Data sources include Google, Education.com, Open Street Map, the U.S. Census, Localeze, and places added by the Walk Score user community. While walking scores indicate the ability of residents to walk to amenities, probability of individuals having increased fitness levels by walking to them is highly variable. Homicide counts were obtained from the Department of Human Services. Homicide counts were found to often be located at the hospital where an affected individual would have died, so this variable was considered noninformative.
The following census tract level data was obtained from the American Community Survey, US Census, We performed an analysis of Global Moran's I to assess spatial autocorrelation of difference between expected and observed CVD death risk at the census tract level. 8 Randomization with 999 permutations gave a pseudo p-value of 0.001, rejecting the null hypothesis that the distribution of difference was random in the county (eFigure 4, A and B). We further performed Local Indicators of Spatial Association (LISA) analysis to identify regions of clustering. Some areas of high-high and low-low clusters were identified, supporting the hypothesis that there was a degree of clustering of high and low risk census tracts within the county. Further analysis of spatial autocorrelation was beyond the scope of this study.  Determinants include (in order from top to bottom): percent high school graduates; percent households below poverty level (LowIncome); median age; percent unemployed; percent uninsured; percent vacant housing, walk score; food desert (based on number of supermarkets); percent households with no access to vehicle. Regression line in red.