Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes

Key Points Question Can a machine learning model trained on routinely collected administrative health data be used to accurately predict the onset of type 2 diabetes at the population level? Findings In this decision analytical model study of 2.1 million residents in Ontario, Canada, a machine learning model was developed with high discrimination, population-level calibration, and calibration across population subgroups. Meaning Study results suggest that machine learning and administrative health data can be used to create population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions for diabetes prevention.


eFigure 1. Overview of our Approach
The data was accessed at ICES, which is an independent, non-profit research institute, whose legal status under Ontario's health information privacy law allows it to collect and analyze healthcare and demographic data, without consent, for health system evaluation and improvement. The diverse data sources from ICES were linked using the unique encoded identifiers from the Registered Persons Database (RPDB), a central registry from the province's universal single-payer healthcare system. This identifier enables linkage across datasets, as depicted in eFigure 1, and contains basic demographic information, including sex, age and geographical residence information that we used in our model. eTable 1 offers a comparison between the two types of AHD data and EMR. The study by Razavian et al. is the closest to ours as it also leverages administrative health data 1 . We chose two recently published studies applying machine learning to diabetes as examples of studies using EMR data 2,3 . EMR data typically are more complete on laboratory values, and sometimes are restricted to include only patients with non-missing values. They may also contain family history or lifestyle variables. AHD are characterized by an abundance of claims and the associated diagnosis codes. They also usually contain drug history and laboratory values, but each data source is typically very sparse, with very few variables that are never missing across patients. AHD may also include geographical or socioeconomic variables, as is the case with our study. While setting a system collecting and building EMRs can be a years-long process 4 , AHD forms the first layer of healthcare data, and is automatically collected in single-payer healthcare systems and health insurance systems.
For each patient, we extracted data on healthcare utilization and services accessed from the following sources: physician and emergency claims from the Ontario Health Insurance Plan (OHIP), hospitalization history from the Discharge Abstract Database (DAD), emergency services from the National Ambulatory Care Reporting System (NACRS) and prescription medication claims for individuals aged 65 years or above and those receiving social assistance. Diabetes-related laboratory test results were obtained from the Ontario Laboratory Information System (OLIS). The Ontario portion of the Immigration, Refugees and Citizenship Canada (IRCC) database was used to identify immigration status and country of birth. Neighbourhood-level measures of socioeconomic status were obtained with the 2001, 2006 and 2011 Canadian censuses. Finally, patient deaths that occurred during the observation period were identified from the Office of Registrar General-Deaths (ORG-D) database. A detailed description of all the data sources can be found in Table S2.
The health event data from the observation window was used to extract features that summarize a patient's history at that point in time. We found that the two-year window was sufficient to obtain the necessary information on the predictor features. As shown in eFigure 1, the extracted features were then fed into the model and instance-level diabetes onset predictions were generated for the target window. Despite leading to a lower incidence rate compared to large values, we found that a three-month period had greater predictive performance than a larger window of six months or one year. Furthermore, a three-month target window is designed for the generation of acute and continuous predictions every three months from the patient's health history, giving granularity on the patient's monitoring. If diabetes onset happens after the end of the current target window, the target window of the following instances would capture it.
To determine the 5-year diabetes onset label for a given patient, we used the Hux algorithm in its 2016 version. 6 The Hux algorithm is a validated method used to build the Ontario Diabetes Dataset (ODD) registry, reaching a sensitivity of 86% and a specificity of 97%, when compared to physician labels. We augmented this algorithm with a criteria on the HbA1c values. That is, the earliest date a patient had an HbA1c reading greater than or equal to 6.5 (when such reading is available on the patient) was set to be the onset date. When both HbA1c and HUX onset dates existed, we determined the onset date to be the earlier of the two. If a target window contains diabetes incidence, all subsequent target windows are discarded as irrelevant since the patient has already been diagnosed. The OHIP claims database contains information on inpatient and outpatient services provided to Ontario residents eligible for the province's publicly funded health insurance system by fee-for-service health care practitioners (primarily physicians) and shadow billings for those paid through non-fee-for-service payment plans. The main data elements include patient and physician identifiers (encrypted), code for service provided, date of service, associated diagnosis, and fee paid. We also extracted OHIP emergency claims data using OHIP Emergency Services (ERCLAIM) dataset, which uses a macro to extract emergency claims data from OHIP claims (one record per emergency service). Derived cohorts for the six respective chronic diseases: asthma, congestive heart failure, hypertension, Crohn's disease, myocardial infarction and rheumatoid arthritis. These datasets contain yearly binary flags for both prevalence and incidence of the associated disease for the patient.

eTable 1. Comparing Electronic Medical Records vs Administrative Health Data
We list all the datasets used as input features in our model and their role in the administrative health data at ICES.

eMethods 2. Feature Engineering
We did not perform any pre-processing of continuous variables, except for laboratory results. Laboratory results can be reported in different units, such as mg/L and g/L, and we standardized the unit before doing feature extraction. One-hot encoding was used for all categorical variables, and we discarded categories that appeared with a frequency of less than 1%. Removing infrequent categories significantly reduced the feature size and improved model generalization. As reported in previous studies, we also found that events in the observation window occur in highly irregular patterns. 7 Patients would typically have clusters of activity (multiple doctor/ER visits, laboratory tests, etc.) followed by quiet periods with few events. To summarize these patterns we performed various aggregations over different time intervals within the two year observation window. For time aggregation we counted events in the last month, quarter, 6 months, year, etc.
For event aggregation we combined events of the same type such as doctor visits by physician specialty and prescription medication by drug type. This double aggregation resulted in features such as "number of ophthalmologist visits in the last month" and "total quantity of drug X prescribed in the last year". We found such features to be highly informative for onset prediction.
During feature selection we adopted a greedy approach, and computed multiple combinations of time and event aggregation. These features were then incrementally added into the model and retained only if the validation set performance improved. Note that throughout this process, to prevent any model bias, the test set remained untouched and was only used for the test performance computation of the final model.
In addition to event aggregation, we included other features that summarize a patient's recent medical history. To estimate the recurrence frequency we computed time between consecutive events, as well as time since the most recent event. The goal was to estimate whether certain events are becoming more frequent or occur with a specific time pattern.
Moreover, we compared each patient's event history with histories from patients in the same sex, age and immigration status groups. Within-group comparisons can identify "outlier" patients whose progression of condition trajectory deviates significantly from other patients. 8 All feature selection here was performed in a similarly greedy fashion by incrementally adding subsets of features to the model. After multiple rounds of feature selection we obtained a set of approximately 300 features that maximized the validation AUC, and used this set for all further experiments.

eMethods 3. Model Development and Evaluation
To find optimal hyperparameters we performed grid search by first specifying ranges for each hyper parameter, and then exhaustively evaluating on points selected from those ranges. After grid search we selected the following settings: a tree depth of 10, learning rate of 0.05, minimum child weight of 50, α= 0.3, γ= 0.1, λ= 0.5, column sample by tree of 0.8 and column sample by level of 0.9 (relevant XGBoost parameter documentation can be found here: https://xgboost.readthedocs.io/en/latest/parameter.html). Since incidence rate for onset is typically lower than 1%, we under-sampled negative instances by a factor of up to 20x to balance the training data. 9,10,11 After training, the output probabilities from the model were re-calibrated using the approach proposed by Dal Pozzolo et al. 12 We reported the feature contribution with the Shapley values (see eFigure 3). 13,14 Specifically, after the model was trained, to estimate the contribution for each type of feature, we averaged absolute Shapley values over a sample of 10,000 instances selected at random from the test cohort. These feature contributions vary from an instance to another, and we found that 10,000 was a sample large enough to get stable Shapley values. eFigure 3 displays the total contribution from each data type, computed with average absolute Shapley values. Our vastly heterogeneous input data can be organized into eight broad categories: demographics (stationary data such as country of birth or native language; but also age or landing date in Canada for immigrants), routine diagnosis codes and history, laboratory values, geographical information (including latitude and longitude), yearly flags for history of chronic diseases (i.e. asthma, hypertension, congestive heart failure, chronic obstructive pulmonary disease, Crohn's disease and arthritis), prescription history, information on the specialty of each doctor encounter, and hospitalizations.
Demographics dominate among all data categories, due to the strong contribution of age and related features such as year of birth and age at landing date in Canada for immigrants. We also note the strong contribution of diagnosis history, and moderate contribution of laboratory results. These two data types have the highest frequency among our non-stationary input data. In contrast, extreme and rarer health events such as hospitalizations or ambulatory care usage contribute less to model predictions.