Wearable Biosensing to Predict Imminent Aggressive Behavior in Psychiatric Inpatient Youths With Autism

This prognostic study investigates the use of wearable biosensors and machine learning for predicting aggressive behaviors in youths with autism who are psychiatric inpatients.


Introduction
Autism is one of the most common childhood disorders (occurring in 1 in 36 children). 1][8][9][10] Several factors make it difficult for youths with autism to regulate their emotions and selfreport their internal states 11 ; 30% to 40% are minimally verbal, 12 and those who are fluently verbal often have poor emotional insight and self-awareness. 13This predicament can make aggressive behavior unpredictable and thus dangerous, creating a barrier to accessing the community, therapy services, clinicians, and educational placements.Families report that aggressive behavior increases their stress, isolation, and financial burden and decreases available support options because they fear putting their child with autism into environments that may result in unexpected aggressive behavior. 14,15Aggressive behavior also affects support professionals, leading to compensatory payments for injury, increased numbers of sick days, and higher turnover rates. 16,17This challenging situation can demoralize parents and clinicians, accelerate negative patient trajectories, and lead to homebound or residential living placement care, collectively decreasing quality of life while increasing costs.
Peripheral physiology is a promising objective indicator of aggressive behavior. 18,19While significant heterogeneity in individuals with autism exists, atypical autonomic reactivity is a common feature [20][21][22] and can putatively occasion maladaptive behavior when demands exceed an individual's coping ability. 23,24 prior work, 25 we recorded peripheral physiological (cardiovascular and electrodermal activity) and motion (accelerometry) signals from a wearable biosensor worn by 20 youths with autism (ages 6-17 years; 75% male; 85% minimally verbal) during naturalistic observation sessions with concurrent behavioral coding in a single specialized inpatient psychiatry unit.Our objective was to test the hypothesis that we could use preceding physiological changes to predict ATO before it occurred.Using ridge-regularized logistic regression (LR), we demonstrated that we could predict ATO 1 minute before it occurred using 3 minutes of preceding biosensor data with a mean area under the receiver operating characteristic curve (AUROC) of 0.71 for a population model (PM) and 0.84 for person-dependent models (PDMs).This study extends that initial research by replicating our approach in 70 independent participants with autism across 4 psychiatric inpatient units and expanding prediction evaluation to include self-injury and emotion dysregulation (ED; ie, tantrums and meltdowns).

Methods
This prognostic study designed to update and validate a predictive model is reported following the

Study Participants
We enrolled 86 psychiatric inpatients not included in our previous studies serially at 4 clinical inpatient sites participating in the AIC (eMethods 1 in Supplement 1).Inclusion criteria included confirmation of autism via research-reliable administration of the Autism Diagnostic Observation Schedule-2 (ADOS-2) 26 and parent-reported, staff-reported, or staff-observed physical aggression or

JAMA Network Open | Statistics and Research Methods
Wearable Biosensing to Predict Aggressive Behavior in Inpatient Youths With Autism self-injurious behavior (SIB).ADOS-2 is a semistructured autism diagnostic observation.In each of 4 developmental and language-level-dependent modules, a protocol of social presses is administered by a trained examiner, and then behavioral items relevant to autism spectrum disorders (ASDs) are scored as a standardized metric of severity ranging from 1 to 10 (with 1 indicating no ASD features and 10 indicating severe ASD symptoms).Exclusion criteria included not having a parent proficient in English or prisoner status for the individual with autism.

Data Collection
Race and ethnicity were self-reported.Available race categories in the database were American Indian or Alaskan Native, Asian, Black or African American, Native Hawaiian or Other Pacific Islander, White, and other.Available ethnicity categories were Hispanic or Latino or not Hispanic or Latino.
Race and ethnicity were assessed and included to describe sample characteristics.Intellectual disability was evaluated using the Leiter International Performance Scale-Third Edition (Leiter-3), a widely used standardized (normative mean [SD] score, 100 [15]) measure of nonverbal intellectual functioning designed to assess attention, cognition, and memory.It is administered without vocal instructions and does not require reading, writing, or verbal responses.Naturalistic observational coding sessions were performed by research staff.At the same time, inpatient study participants with autism wore the commercially available and regulatory-compliant E4 biosensor (Empatica, Inc) on their nondominant wrist.The E4 records changes in peripheral autonomic (blood volume pulse and electrodermal activity) and motion activity (3-axis acceleration) (eMethods 2 in Supplement 1).
Research staff conducted observations with minimal interference in participant daily inpatient routines, which included academic lessons; behavioral, occupational, speech, and milieu therapies; meals; and free time.Research staff coded targeted aggressive behavior (ie, SIB, ED, and ATO) (see eMethods 3 in Supplement 1 for operational definitions) episode start (onset) and stop (offset) times within the observation period (eMethods 12 and eFigure 1 in Supplement 1) using a custom mobile application time-synchronized to the internal clock of the biosensor worn by participants.

Statistical Analysis
After time-series feature extraction and data preprocessing (eMethods 4-5 in Supplement 1), we used ridge-regularized LR, support vector machines (SVMs), and neural networks (NNs) with extracted time-series features as input variables to make binary aggressive behavior predictions over time (eMethods 6-8 in Supplement 1).Specifically, at every time t, the classifier indicated the likelihood of aggressive behavior, indicated by the label l, in an upcoming time range (t to t + τ f ), using features extracted in a previous time range (t − τ p to t).We also include the SD of each extracted feature in all prediction models.We use augmented feature vectors (AFVs) to refer to aggressive behavior observations and time since the last aggressive behavior, in contrast to features vectors (FVs), which do not include such labels.Data were analyzed from March 2020 through October 2023.

Classification Strategies
We performed aggressive behavior prediction using PMs and PDMs, wherein data were processed every 15 seconds for decision-making (eMethods 9 in Supplement 1).A single classifier was trained in PM using the entire data set, which included all participants and sessions.Individual classifiers were trained across sessions from a single participant in a PDM.In both models, 1 to 3 minutes of prior data (τ p ) were used to make predictions 1 to 3 minutes into the future (τ f ) given that this temporal range accommodated the briefest individual observational coding session in our corpus.
To achieve PM individualization, we explored the application of pseudolabeling as a domain adaptation (DA) technique comprising 2 phases: pseudolabeling and model training (eMethods 10 in Supplement 1).By using this iterative process, we aimed to enhance the adaptability and accuracy of PMs, ultimately improving their individualization for specific people.

JAMA Network Open | Statistics and Research Methods
Wearable Biosensing to Predict Aggressive Behavior in Inpatient Youths With Autism

Model Validation
We used AUROC values as our primary model performance metrics in all experiments.ROC curves plot the probability of false alarms vs the likelihood of detecting an event of interest at varying prediction thresholds.An ideal classifier presents a probability of detection equal to 1 for a probability of a false alarm equal to zero, and thus an AUROC of 1.For PDM models, we report mean AUROCs across all individuals and data splits.We computed ROCs and AUROCs for each discrete aggressive behavior in the multiclass setting.When evaluating DA, we calculated the median AUROC increase across participants.

Training and Testing Data-Splitting Methods
To avoid overfitting and to assess internal and external validity, we split the data 3 ways: session splits (SSs), k-fold cross-validation (CV) with leave-individuals-out (LIO), and CV with leave-sessions-out (LSO).Unless otherwise stated, we used 5-folds with 5 repetitions when using CV.We used SS in PM and PDM models, CV with LIO only in PM, and CV with LSO only in PDM.We split sessions in 2, using the first 80% of each session to construct the training set and reserving the remaining 20% for testing (eMethods 11 in Supplement 1).For experiments with CV, we computed 95% CIs for AUROCs obtained for all CV splits.

Experiments
We conducted 7 main experiments to evaluate AFVs and FVs (eMethods 12 in Supplement 1).Except for experiments 5 and 6, all investigations were performed with target aggressive behaviors combined into a single label (ie, CMB).The additional eighth experiment evaluated semisupervised DA for model individualization for different τ p and τ f settings with AFVs and FVs (eMethods 13 in Supplement 1).This post hoc analysis examines changes in AUROC as a function of observation duration and mean aggressive behavior frequency, duration, and intensity (eMethods 14 in Supplement 1).

Data Collected
Of 86 participants enrolled, 16 individuals were not included in data analysis (18.6%) because they could not wear the physiological biosensor (8 individuals) or were discharged before an observation could be made (8 individuals).Common reasons stated by clinical staff for participants not being able to wear the sensor were tactile sensitivity and general behavioral noncompliance.There were 70 remaining study participants (mean [range; SD] age, 11.9 [5-19; 3.5]   1; see eMethods 15 in Supplement 1 for additional participant demographics).Our study population sex demographics were commensurate with the sex distribution of autism, wherein males are 4 times as likely as females to receive a diagnosis. 1Nearly half of the population (32 individuals [45.7%]) was minimally verbal (ADOS-2 module 1 or 2), and 30 individuals (42.8%) had an intellectual disability (mean [SD] Leiter-3 global IQ score, 72.96 [26.12]).Participant length of inpatient hospital stay ranged from 8 to 201 days (mean [SD] length of stay, 37.28 [33.95] days).
After a brief desensitization protocol that involved gradually increasing exposure to the biosensor, 70 participants tolerated wearing the E4, and we obtained usable data for all participants.
We collected 429 independent naturalistic observational coding sessions (median [IQR], 5 [1-7]   sessions/participant) totaling 497 hours (median [IQR]  f The Emotion Dysregulation Inventory is an informant questionnaire that assesses poor emotion regulation and is validated for general community and clinical populations, as well as youths with autism spectrum disorders.Response options ask respondents to rate each item on a 5-point Likert scale from "not at all" to "very severe" for observed functioning over the past 7 days.It includes 2 scales: Reactivity, which captures rapidly escalating, intense, and poorly regulated negative emotion and is available as a 24-item form or 7-item short form, and Dysphoria, a 6-item measure of low positive affect, sadness, and general unease.Abbreviations: ATO, aggression toward others; ED, emotion dysregulation; SIB, self-injurious behavior.
a The bottom row displays the total overall sessions.

Experiment Outcomes
The 8 experiments were performed with different values of τ p and τ f .Results are summarized in Figure 1 and Figure 2 and eResults 1, eFigure 1, and eFigure 2 in Supplement 1. Figure 1A to D presents mean AUROCs across all tested values of τ p and τ f (ʦ{60, 120, 180} seconds).We observed considerable performance improvements when using AFVs for all experiments, especially in the onset scenario.
Within the combined or binary classification setting (experiments 1-4), PM with SSs (experiment 1) produced the best results, followed by PM with LIO (experiment 3), using AFVs-onset.Experiments 5 and 6 leveraged PM with LIO and PDM with LSO, respectively, with a τ p of 180 seconds and considered multiclass SIB, ED, and ATO predictions.Results are summarized in Figure 1B.Compared with experiment 1 and 3 results, SVMs performed poorly even for AFVs-offset.
Among aggressive behaviors, SIB was most detectable with AFVs, while ATO was most detectable using FVs.For PDM with LSO (experiment 6) models, LR produced the best results, achieving a mean AUROC of 0.74 (95% CI, 0.73-0.75)for combined behaviors a τ f of 180 seconds.Similar to what we found in experiment 2 and 4 results, we hypothesize that smaller data sets for NNs and SVMs may have been associated with lower AUROCs.Moving to individual classes, we observed higher AUROCs for SIB (0.69; 95% CI, 0.63-0.75)and ED (0.62; 95% CI, 0.55-0.69)than ATO (0.56; 95% CI, 0.52-0.60).
We also evaluated LR in PM for the CMB scenario with CV on SS using AFV-onset test-retest reliability given that it was generally the best-performing model across experiments.We computed AUROC statistics (median and IQR) for a given session across individuals and CV splits, summarized in boxplots shown in eFigure 3 in Supplement 1.Given that the number of sessions varied by participant, we computed AUROCs by groups of 3 consecutive sessions, depicted in eFigure 3 in Supplement 1.Although a decreasing AUROC trend was apparent in both plots, grouping sessions was more stable than the median AUROC across session groups, ranging from 0.64 (0.62-0.72) for sessions 22 to 26 to 0.80 (0.77-0.82) for sessions 1 to 3.
Finally, we analyzed the use of semisupervised DA to mitigate the lack of data available for PDMs.In this experiment, we focused on LR owing to its overall reliability and considered FVs and AFVs-onset. Figure 1D presents median AUROC differences from the final individualized LR model after DA and the initial PM AUROC across varying training and test splits (eResults 2 and eFigure 4 in Supplement 1) at different τ p and τ f values.Across scenarios, we observed a noticeable increase in AUROC for the general PM.The best performance was achieved for a τ p of 180 seconds and τ f of 60 seconds, with median (IQR) AUROC improvements of 14.48 (11.37-17.08)and 11.03 (8.60-16.92)for FVs and AFVs-onset, respectively.The median (IQR) AUROC improvement for a τ p of 180 seconds and τ f of 180 seconds was 5.27 (2.05-7.18 ) and 6.38 (2.46-9.10)for FVs and AFVs-onset, respectively.
For models with a τ f of 180 seconds, the maximum median (IQR) improvement in AUROC of 7.32 (5.13-9.85)was observed when τ p was 120 seconds for FVs.We highlight that these AUROC

JAMA Network Open | Statistics and Research Methods
improvements were computed with respect to the initial PM model performance on individual data.
As discussed in experiment 1, PMs with LIO and AFVs-onset had higher overall performance.

Discussion
Our experiments in this prognostic study found that machine learning combined with wearable biosensing and time-stamped mobile behavior annotation data could be used to predict SIB, ED, and ATO in a sizable sample of youths with autism who were psychiatric inpatients.Our determination of best classifier performance considers the pattern of results observed across all experiments.While SVM produced the best single AUROC it did not perform as consistently well across all experiments as LR.Hence, we considered LR our best performing classifier.LR predictions yielded a mean 0.80 AUROC 3 minutes before aggressive behavior onset.Furthermore, our experiments demonstrated generalizability and reliability when we assessed model performance using LIO, LSO, and test-retest reliability approaches.
We also observed that knowledge regarding recent aggressive behavior was associated with improved prediction performance.Additionally, our results illustrated the ability to discriminate between different types and intensities of aggressive behavior; however, collapsing these 3 behaviors into 1 class was associated with better detection performance and lower false-positive rates.Of 3 aggressive behaviors evaluated, SIB was the most predictable.However, this may be a power issue given that SIB episodes were 1.93 and 6.43 times more frequent in our sample than ED and ATO episodes, respectively.
We also noticed a general decrease in performance when we extracted augmented features based on aggressive behavior offsets and removed feature vectors that fell within an aggressive behavior period, suggesting that ground truth behavior labels and reinforcement learning may be promising areas to explore further in this domain.Using biosensor acceleration as a proxy for intensity, we found that mean high-intensity aggressive behaviors were more predictable than mean mid-and low-intensity episodes.Finally, our results suggest that domain adaptation may be associated with improved individualized PM prediction performance.

Limitations
Our study has several limitations, including restricted participant demography and geography and nonuniform frequency and duration of observed aggressive behavior across participants.More extensive trials with a more diverse participant population, including individuals in the outpatient setting, will be required to evaluate the generalizability of our results and to establish further the relative association of the amount of training data with prediction performance.
In future work, we will explore more advanced machine learning methods to improve prediction performance.Although NNs and SVMs are popular alternatives to linear models, our practical experience with these classifiers suggests high probabilities of overfitting or results that do not outperform simpler methods, such as LR.However, a promising addition may be person-dependent, nonhomogeneous point-process priors that enable longer prediction windows into the future.
Modeling different physiological response profiles and estimating their differential likelihood of future aggressive behavior using Markov models may also be a fruitful future direction.

Conclusion
This prognostic study sought to define an ecologically valid approach for identifying objective indicators of impending aggressive behaviors in youths with autism who were psychiatric inpatients.
8][29] Our findings may lay the groundwork for developing just-in-time adaptive

Figure 1 .
Figure 1.Mean Area Under the Receiver Operating Characteristic Curves (AUROCs) Across Time Parameter Values

Table 1 .
Sample Characteristics (continued) Module 1 is for individuals who do not possess consistent verbal communication skills.Module 2 is for individuals who have few communication skills.Module 3 is for individuals who are verbally fluent and capable of playing with age-appropriate toys and games.Module 4 is for individuals who are verbally fluent and seem to play with toys appropriate for individuals older than their age.Each ADOS-2 module is scored as a standardized metric of severity ranging from 1 to 10 (with 1 indicating no autism spectrum disorder features and 10 indicating severe autism spectrum disorder symptoms).
a b The Leiter-3 is a standardized IQ battery scored with a normative mean of 100 and SD of 15.cThe Child Behavior Checklist is a 113-item parent-report measure of children's psychiatric and behavioral functioning.It provides composite scale t scores for internalizing and externalizing problems.Higher scores indicate more problems in these areas; t scores have a mean of 50 and an SD of 10.dThe Vineland Adaptive Behavior Scales, Second Edition is a standardized measure of adaptive functioning for individuals of any age.A primary parent or caregiver with knowledge of the child's everyday routines and skills was asked to complete the parent or caregiver rating form.The Adaptive Behavior Composite score combines results from communication, daily living skills, socialization, and motor skills domains to provide an overall score of the child's functioning level.Lower scores indicate greater impairment in adaptive functioning; v scale scores have a mean of 15 and an SD of 3.e The Aberrant Behavior Checklist is an informant rating instrument empirically derived by principal component analysis.It contains 58 items that resolve onto 5 subscales.Subscales and respective number of items include irritability (15 items), lethargy and social withdrawal (16 items), stereotypic behavior (7 items), hyperactivity and behavioral noncompliance (16 items), and inappropriate speech (4 items).

Table 2 .
Participant-Level Descriptive Statistics performance across PM experiments.They were often better than SVM for scenarios with FVs and AFV-offset.The performance of PM with LIO was similar to that of PM with SS.For instance, for a τ p of 180 seconds, a τ f of 60 seconds, and AFVs-offset, we obtained similar AUROCs Figure1Adepicts the mean AUROC across all τ p and τ f for experiments 1 to 4. SVMs achieved superior performance for AFVs-onset.LR and NN performed similarly, with an advantage for LR in PDM with LSO.However, all methods performed poorly for PDM with LSO and better for PDM with SS.See eFigure 2 in Supplement 1 for AUROC curves for the best-performing NN, LR, and SVM for PM with SS and a τ f of 180 seconds.Estimating more exact compromises between true and false rates requires joint analysis with clinicians and caregivers and is beyond the scope of this study.

JAMA Network Open | Statistics and Research Methods Wearable
Biosensing to Predict Aggressive Behavior in Inpatient Youths With Autism Operational Definitions of Target Aggressive Behaviors eMethods 4. Time-Series Feature Extraction eMethods 5. Data Preprocessing eMethods 6. Logistic Regression Classifier Model eMethods 7. Support Vector Machine Classifier Model eMethods 8. Neural Network Classifier Model eMethods 9. Classification Strategies eMethods 10.Domain Adaptation of Population Models eMethods 11.Training-Testing Data-Splitting Methods eMethods 12. Experiments and Performance Assessments eMethods 13.Domain Adaptation Train and Test Methods eFigure 1. Aggressive Behavior Episode Timeline eMethods 14.Aggressive Behavior Intensity Analysis Methods eMethods 15.Additional Study Participant Demographics eResults 1. Classification Results for Experiments 1-7 eFigure 2. Best-Performing Classifiers for Population Model With Session Splits Using Augmented Feature Vectors-Onset for 180-s Times eFigure 3. Best Overall Performing Classifier Across Sessions eResults 2. Results for Experiment 8 (Domain Adaptation) eReferences.