Predictive Value of Early Autism Detection Models Based on Electronic Health Record Data Collected Before Age 1 Year

Key Points Question Can autism be detected from routine electronic health records (EHRs) with clinically meaningful accuracy before age 1 year? Findings In this diagnostic study of 45 080 children, the accuracy of EHR-based early autism detection models at age 30 days was competitive with caregiver surveys collected at ages 18 to 24 months. Model accuracy improved further by age 1 year. Meaning These findings suggest that EHR-based autism detection could be integrated with caregiver surveys to improve the accuracy of early autism screening.

including ADHD and intellectual disability. To explore the latter, we altered the cases in the training and validation sets to include all individuals with at least one documented autism code. We refer to this as the weak phenotype in contrast to the stronger phenotype previously described (see Case Definitions and Cohort Selection).

Performance measures
The following measures were used during model development or model evaluation. Please note that in this section and elsewhere, we use risk in a mathematical sense to denote a patient's instantaneous probability of diagnosis per unit time given that they were not diagnosed previously. The term is not meant to imply that autism or an autism diagnosis are negative.

Area under the receiver operating characteristic curve (AUC) and AUCt
The area under the receiver operating characteristic curve (AUC) assesses the model's ability to distinguish between cases and controlsin this case, children later diagnosed with autism and children not diagnosed. It quantifies the probability that model-predicted risk for a randomly selected child who was later diagnosed with autism is higher than model-predicted risk for a randomly selected child who was not, which is equal to the area under the sensitivity versus specificity curve. Since lifetime diagnosis status is highly uncertain for patients with short follow-up, we primarily report the AUCt rather than the AUC, where AUCt is defined as the AUC when limiting negative cases to individuals followed for at least t years.

Average positive predictive value (AP) and APt
The average positive predictive value (PPV), also known as the average precision (AP), is a conservative estimate of area under the PPV versus sensitivity curve. It quantifies the average PPV across a range of prediction thresholds with varying sensitivity. Compared to the AUC, the AP is preferred when the number of cases is much smaller than the number of controls. Similar to the AUC, we primarily report the APt rather than the AP, where APt is defined as the AP when limiting negative cases to individuals followed for at least t years.

Concordance Index (CI)
The concordance index (CI) quantifies the probability that model-predicted risk for a randomly selected pair of individuals is consistent with observed diagnosis and censoring times. 6 A given pair of individuals contributes to the CI only if (a) diagnosis is observed for both of them, or (b) diagnosis is observed at age t in only one individual, but the other is followed beyond age t without being diagnosed.
Although the concordance index was used to select our final prediction model, as previously described, the AUCt and APt were chosen as our primary evaluation measures due to their higher clinical relevance and interpretability. To contextualize these measures, we also report the corresponding effective prevalence after controls followed for fewer than t years were excluded. The sensitivity of these measures to the threshold t was explored by calculating AUCt and the corresponding receiver operating characteristic (i.e., sensitivity versus 1specificity) curve; and APt and the corresponding PPV versus sensitivity curve. The AUC8 and AP8 were explored in greatest depth; the selection of this cutoff reflects a compromise between (a) ensuring children have been followed long enough for diagnosis to be likely, were it to occur; and (b) maintaining a large evaluation sample covering a wide range of birth years. High sensitivity and high specificity operating points were selected to achieve 90% sensitivity and specificity, respectively, based on an 8-year follow-up threshold. 1-calibration 7 , which is based on a Hosmer-Lemeshow test statistic, was used to quantify correspondence between model-predicted risk and true diagnosis rates for all models at ages 4 and 8 years.
All performance measures were evaluated on the full test set as well as in subgroups defined by demographic variables (sex, race, ethnicity) and two other factors of interest. The first of these factors, low birth weight, was designed to investigate the degree to which model predictions and performance were driven by factors related to premature birth. This factor was defined as positive for all individuals whose earliest recorded birth weight was below the 5 th percentile based on World Health Organization growth charts available from the National Center for Health Statistics. The second of these factors indicates whether individuals were born before 2013. This factor was designed to investigate whether performance differed between individuals whose data was extracted from the current (2013present) DUHS EHR versus older, legacy systems.

Description of cohort
Cohort demographics and rates of each neurodevelopmental condition are compared between autism cases and controls in sTable 1. Among the autism cases, there were 738 males and 186 females (79.9% male). All four groups of neurodevelopmental conditions other than autism included in the analysis occurred at higher rates among the autism cases compared to the controls (p < 0.001): there were 266, 44, 7, and 766 autism cases with co-occurring ADHD, intellectual disability, genetic neurodevelopmental conditions, and other neurodevelopmental conditions, respectively.
There were group differences in the number of encounters between autism cases and controls throughout the first year of life (see sFigure 2). The median number of encounters before 30 days was 3 in both groups, but the 75 th percentile was higher among autism cases (5) versus controls (4), and the difference between distributions was statistically significant (p=0.030). The number of encounters in all other windows (30-60 days, 60-90 days, 90-180 days, 180-270 days, 270-360 days) was higher among autism cases than controls (p<0.001).
Autism cases made up 2.0% of the individuals included in our analysis. Adjusting for right censoring, the estimated cumulative rate of autism diagnosis was 0.1% at age 2, 0.6% at age 3, 1.2% at age 4, 2.0% at age 6, 2.5% at age 8, 3.1% at age 10, and 3.4% at age 12 (see sFigure 3).

Prediction performance over time
Sensitivity to the length of required follow-up among controls is shown for the 30-day models (sFigure 5) and the 360-day models (sFigure 6). At 30 days, the AUCt ranged from an AUC4 of 0.688 to an AUC10 of 0.801, and the APt ranged from an AP4 of 0.110 (2.3-fold increase over effective prevalence) to an AP10 of 0.530 (3.2fold increase over effective prevalence). At 360 days, the AUCt ranged from an AUC4 of 0.701 to an AUC10 of 0.826, and the APt ranged from an AP4 of 0.160 (3.4-fold increase over effective prevalence) to an AP10 of 0.606 (3.6-fold increase over effective prevalence). When varying the required follow-up length, the number of cases included in the evaluation (N=363) was unchanged, but the number of controls 7638, 6537, 5373, 4428, 3615, 2868, and 2173 at t values of 4 to 10 years, respectively.
Sensitivity and PPV at our high (90%)-specificity and very high (97%)-specificity operating points, as well as specificity and PPV at our high (90%)-sensitivity operating points, are summarized in sTable 2, while the operating points themselves are depicted in sFigure 7. At the high-specificity operating points, sensitivity ranged from 0.452 at 60 days to 0.482 at 270 days, and PPV ranged from 0.226 at 90 days to 0.239 at 270 days. At the very high-specificity operating points, sensitivity ranged from 0.256 at 90 days to 0.292 at 30 days, and PPV ranged from 0.360 at 90 days to 0.393 at 30 days. Finally, at the high-sensitivity operating points, specificity ranged from 0.362 at 30 days to 0.396 at 270 days, and PPV ranged from to 0.085 at 30 days to 0.089 at 270 days.
Model calibration at age 4 and 8 years is shown for the 30-day and 360-day models in sFigure 8. The corresponding 1-calibration statistics 7 are included in sTable 2. Figure 3 further illustrates the effect of other neurodevelopmental conditions on correct identification of autism cases and controls over time. At high specificity operating points, detection of autism cases was highest among those with another neurodevelopmental condition other than ADHD both at 30 days (56.1%) and by 360 days (68.2%). Approximately half (46.7% and 50.0%, respectively) of those with comorbid ADHD and those without any other neurodevelopmental condition were detected by 360 days. False positive rates at 30 days were highest among controls with a neurodevelopmental condition other than ADHD (16.7%) followed by those with ADHD (13.6%) and those with neither (7.2%). Across the full test set, 59.8% of the autism cases were detected by 360 days (i.e., sensitivity of combined models), and 81.5% of controls were predicted negative at all time points (i.e., specificity of combined models).

Prediction among those with other neurodevelopmental conditions
At very high specificity operating points, detection of autism cases was again highest among those with another neurodevelopmental condition other than ADHD both at 30 days (38.8%) and by 360 days (51.9%). Rates of detection at 30 days were higher for those with co-occurring ADHD (16.2%) compared to those without any other neurodevelopmental condition (13.6%), but this trend reversed by 360 days (19.0% versus 22.8%, respectively). False positive rates at 30 days were highest among controls with a neurodevelopmental condition other than ADHD (6.7%) followed by those with ADHD (4.1%) and those with neither (1.5%). Across the full test set, 38.8% of the autism cases were detected by 360 days (i.e., sensitivity of combined models), and 94.3% of controls were predicted negative at all time points (i.e., specificity of combined models).

Performance in subgroups
AUC8 was higher in females (0.794) than in males (0.748) (see Figure 4). AP8 was higher in males (0.527) than in females (0.372), fold increase in AP8 over autism prevalence was higher in females (9.1) than in males (3.7). Among all racial groups represented in the test set (>15 individuals), AUC8 ranged from 0.753 (American Indian / Alaskan Native) to 0.857 (Unknown Race), and was higher among White individuals (0.825) than among Black (0.781) or Asian (0.805) individuals (see sFigure 10). AP8 and the fold increase in AP8 over autism prevalence were lowest among Asian individuals (0.332 and 3.7, respectively). Both AUC8 and AP8 were higher in Hispanic individuals (0.859 and 0.620, respectively) than in non-Hispanic individuals (0.798 and 0.466, respectively; see sFigure 11).
There were 29 autism cases (8.0%) and 1030 controls (5.8%) in the test set ( 2 -value=0.105) whose earliest recorded weight was below the 2 nd percentile. AUC8 was higher among individuals whose earliest recorded weight was below the 2 nd percentile (0.913) compared to others (0.801), but the fold increase in AP8 over autism prevalence was lower (3.5 versus 5.3). These trends were similar among the 38 autism cases (10.5%) and 1640 controls (9.3%) in the test set ( 2 -value=0.497) whose earliest recorded weight was below the 5 th percentile (see sFigure 12). AUC8 was higher among individuals whose earliest recorded weight was below the 5 th percentile (0.883) compared to others (0.798), but the fold increase in AP8 over autism prevalence was similar (4.7 versus 4.8).

Feature importance
Among the different feature (i.e., predictor) groups, laboratory measurements had the greatest total influence on model predictions across all time points, ranging from 31.6% of predictions explained at 180 days to 33.8% at 270 days (see Figure 5). Procedures had the second greatest influence, peaking at 23.6% at 90 days and declining to 19.2% at 360 days. Diagnoses became more important over time (r=0.857; p=0.029), accounting for 11.0% of predictions at 30 days up to 19.6% at 360 days. Demographics became less important over time (r=-0.952, p=0.003), accounting for 11.2% of predictions at 30 days down to 5.9% at 360 days. Inpatient encounters also became more important over time (r=0.881, p=0.020). Other changes over time were not statistically significant (p>0.05).
The importance of specific predictors at each time point (see sFigures 14-16) show that on average, the single most influential predictor was the count of blood glucose measurements ( |SHAP| =0.038) followed by the count of basic metabolic panels ( |SHAP| =0.034), male sex ( |SHAP| =0.030), the count of diagnosis codes associated with administrative and social admissions (CCS category 255; |SHAP| =0.020), the count of complete blood counts ( |SHAP| =0.017), and the count of other diagnostic procedures (e.g., interview, evaluation, consultation; |SHAP| =0.015).

Effect of diagnosis criteria
Performance was also evaluated for a second set of models trained with the weak autism phenotype (≥1 autism-related ICD code; see Model Development and Evaluation). In addition to the 363 individuals in the test set meeting autism diagnosis criteria, there were 56 individuals in the test set that did not meet criteria but did satisfy the weak phenotype. Risk predicted by models trained on the weak phenotype was strongly correlated with risk predicted by our primary models across all time points ( =0.959, 0.943, 0.936, 0.944, 0.927, and  0.937 at 30, 60, 90, 180, 270, and 360 days, respectively).
The two sets of models had similar performance (see sFigure 17) when discriminating between autism cases and controls (p>0.5) except at 30 days, when discrimination performance was higher for the model trained with the weak phenotype (AUC8=0.813) compared to the corresponding, primary 30-day model (AUC8=0.794; p=0.001). The two sets of models also had similar performance when discriminating between non-autism cases meeting the weak phenotype and controls (p>0.5) except at 30 days, when discrimination performance was higher for the model trained with the weak phenotype (AUC8=0.833) compared to the corresponding, primary 30-day model (AUC8=0.791; p=0.025). None of the models trained with either phenotype was able to effectively distinguish between autism cases and those not meeting full criteria but satisfying the weak phenotype (AUC8<0.512 for all models), and differences between models trained with the weak versus full phenotype were not statistically significant (p>0.05).

Prediction of cases identified by chart review
Of the 309 participants evaluated by chart review, 74 were later determined to have an autism diagnosis. Of these 74, 52 met our autism computable phenotype (sensitivity=70.3%), and an additional 6 met the weak phenotype (≥1 autism-related diagnosis code). Of the remaining 236 without an autism diagnosis, only 3 met our computable phenotype (specificity=98.7%).
A total of 79 of these 309 participants were in the test set, including 23 determined by chart review to have an autism diagnosis. Model AUC when discriminating between these 23 cases and the other 56 controls was 0.630 at 30 days, 0.643 at 60 days, 0.666 at 90 days, 0.609 at 180 days, 0.585 at 270 days, and 0.605 at 360 days.
At 30 days (see sFigure 18), model-predicted risk among individuals determined to have an autism diagnosis was higher for those who also satisfied the computable phenotype compared to those who did not (p=0.036). However, at 360 days, this was no longer true: among those with an autism diagnosis, model-predicted risk was higher for those not satisfying the computable phenotype (p=0.039). Conditions. This figure is analogous to Figure 2, but shows prediction performance at 360 days rather than at 30 days. Prediction performance is shown for individuals later diagnosed with (a) ADHD, (b) another neurodevelopmental condition, or (c) neither. In each of these groups, cases were defined as children later meeting our autism computable phenotype, and controls were defined as children followed through age 8 but not meeting our phenotype. The top panels summarize the number of cases and controls in each group (top left) and the relationship between autism prevalence and average positive predictive value of model-based prediction (top right). The bottom panels show the tradeoff between sensitivity and specificity (bottom left) and sensitivity and positive predictive value (bottom right) in each group. eFigure 10. Sensitivity of Prediction Performance Stratified by Race to Follow-up Threshold. The panels show the effect of varying the required follow-up threshold t from 4 to 10 years when evaluating differences in performance of the 360-day models between racial groups. The panels show the AUC8 (top), AP8 (middle), and AP8 divided by autism prevalence (bottom) in each group. The dotted lines indicate performance associated with random guessing (i.e., no information). eFigure 11. Prediction Performance Stratified by Low Birth Weight. Performance when discriminating between children later diagnosed with autism and children not diagnosed through age 8 among individuals whose earliest recorded weight fell below (True) versus above (False) the 5 th percentile based on World Health Organization growth charts. The panels show the AUC8 (top), AP8 (middle), and AP8 divided by autism prevalence (bottom) in each group. The dotted lines indicate performance associated with random guessing (i.e., no information). eFigure 16. Effect of Training Phenotype. Performance (AUC8) when discriminating between three groups of individuals: (a) those diagnosed with autism, per our computable phenotype (cases); (b) those with at least one documented autism-related ICD code, but not meeting our phenotype (weak phenotype); and (c) those without a documented autism-related ICD code and followed through at least age 8 (controls). The panels show discrimination of cases versus controls (left), weak phenotype cases versus controls (center), and cases versus weak phenotype cases.