Comparison of Use of Health Care Services and Spending for Unauthorized Immigrants vs Authorized Immigrants or US Citizens Using a Machine Learning Model

This cross-sectional study examines health care utilization and expenditures that are attributable to unauthorized immigrants vs authorized immigrants or US-born individuals.


eMethods. Machine Learning Model and Analysis
The machine learning procedure consisted of the following steps. Respondents to the L.A.FANS were asked whether they were born in the US, were a US citizen, were a permanent resident (possessed a green card), were granted asylum, or had a visa to stay in the US. For visa holders, they were further asked whether the visa had expired or was still valid. We classified respondents as unauthorized if they had an expired visa or did not answer "yes" to any of the prior questions. Respondents were classified as authorized immigrants if they were naturalized citizens, permanent residents, refugees, or had an unexpired visa. Respondents who were born in the US were classified as US-born individuals. The L.A.FANS surveyed 1,603 US-born and 2,179 non-US-born immigrants (771 unauthorized and 1,408 authorized immigrants). After listwise deletion, there were 1,464 (43.7%) US-born and 1,234 (36.9%) authorized and 649 (19.4%) unauthorized immigrants.
Second, in predicting unauthorized versus authorized status, the ML model used the following variables: sex, age in years, race/ethnicity, marital status (married versus not married), education status (less than high school, high school, college), language of interview (Spanish versus English/Other), poverty status, and occupation. Race/ethnicity included white, black, Latino or Other. Poverty status was based on having household income above versus below 100% of the federal poverty line. Occupation included the following categories: management business & financial, professional & related, service, sales & related occupation, office & administration, farming fishing & forestry, construction extraction & maintenance, production transportation & material moving, other, and not in labor force. These variables were selected based on a review of measures that were commonly defined across the L.A.FANS and MEPS databases (see below).
Third, using the above data from L.A.FANS, we applied a random forest classifier ML model using the L.A.FANS dataset (3,348 observations) described above. Python 3.7 was used for programming using the scikit-learn machine learning library. Scikit-learn is a powerful library of tools to undertake machine learning analyses using Python. In order to test model performance, we split-tested the model using K-fold cross-validation (CV). For K-fold CV, the L.A.FANS dataset was subdivided into 5 folds, which were used as testing sets for the ML model. K-fold CV suggested that the model had 90.5% accuracy (standard deviation (SD) = 2.1%) and 93.9% specificity (SD = 2.3) in predicting unauthorized versus authorized immigration status (eTable 1). Sensitivity (76.1%, SD = 6.2) and precision (75.7%, SD = 4.7) were comparable. The Fmeasure was 75.7 (SD = 3.6).
Finally, we applied the random forest classifier ML model to predict unauthorized status of MEPS respondents, using the corresponding measures in MEPS that were used in the ML model. The 2016-2017 MEPS includes 12,554 non-US-born respondents. Respondents with missing data for the above variables were listwise deleted, which resulted in 12,120 non-USborn respondents (3.5% missing). A total of 36,341 respondents were US-born. Listwise deletion resulted in 35,079 US-born respondents (3.5% missing).
Sensitivity analyses included examining alternative machine learning models. eTable 1 compares accuracy, specific, sensitivity, precision and F-measures for random forest classifier, decision tree classifier and boosted classifier (AdaBoost) models. These are widely used non-parametric machine learning approaches based on classifying observations into two groups. The random forest classifier model has improved metrics than the other approaches; for example, its F-measure is 75.7 compared to 74.5 for the boosted model and 73.0 for the decision tree classifier. However, differences are not statistically significant.
We compared percent unauthorized and their socio-demographic characteristics from our ML-based approach with corresponding estimates from the Pew Research Center. 1,[6][7][8]13 The ML approach is more conservative in classifying respondents as unauthorized, accounting for 1.2% (2.6% unweighted) of MEPS respondents versus 3.2% for the latest 2017 data reported by the Pew Research Center. 1 Reasons for this difference are unclear, and it may be reasonable to assume that the true percentage lies within this range.
In our opinion, there is no authoritative source of data on the unauthorized, thus, imputation has been an important technique for researchers studying this population. There is a general lack of detailed information on the unauthorized and substantial limitations associated with all attempts to characterize unauthorized immigrants. A comparison of our estimates using machine learning with alternative estimates of the unauthorized population should be interpreted in the context of limitations associated with all the approaches used. That being said, we will compare our characteristics of the unauthorized to two other well-known sources.
The first source, a 2018 report by the Pew Research Center (PRC), uses the residual approach with data from the year 2016. This approach starts with the weighted number of the total foreign-born population in the US and then uses US Department of Homeland Security data on lawful immigrant admissions to estimate the total number of unauthorized immigrants in the US. Afterward, they probabilistically assign unauthorized immigration status using either the American Community Survey or the Current Population Survey based on: "individual's demographic, social, economic, geographic and family characteristics in numbers that agree with the initial residual estimates for the estimated lawful immigrant and unauthorized immigrant populations in the survey". 2 The second source is the Migration Policy Institute (MPI). Their methodology imputes unauthorized status based on the 2012-2016 American Community Survey and using the 2008 Survey of Income and Program Participation (SIPP). 3 In the table below, our machine learning (ML) approach suggests that 49.4% of unauthorized immigrants are male compared to 54% for the PRC and 53% for the MPI approaches. Both the PRC and ML approaches suggest unauthorized immigrants skew to younger ages with 80.8% and 69% aged 18-44 for ML and PRC, respectively. Approximately half of unauthorized immigrants have less than high school education (53.7% for ML; 47.0% for MPI). However, the ML approach suggests significantly higher rates of poverty (40.6%) versus 28% for MPI.

Demographics
Machine