Y represents a binary response (0 = no, 1 = yes) to the question, “Has patient i engaged with the platform at time t?” where t takes values from 1 to 14. K is a multinomial latent variable per patient and is set in ranges of 2 to 10, π is the starting probability, abr is the class (K)–dependent transition probabilities of transition from state b to state r, qs and q’s are the emission probabilities of observing engagement with a section () conditioned on the hidden state, qs indicates that the hidden state takes a value of 1, and q’s indicates that the hidden state takes a value of 0. When sections data are used, emission probabilities are considered distinct for each section.
Panel A shows the proportion of users who are engaged for each class at each time point over a 14-week time period. Panels B-F show the proportion of users who use each of the program sections.
Interactions based on core modules completed (A), mean time spent per week on the program (B), and mean number of sessions per week (sum of session duration) (C). Error bars indicate 95% CIs.
eMethods. Detailed Methods
eTable 1. List of Tools Available on the SilverCloud Health iCBT Platform
eTable 2. Description of Sections Users Engaged With Over a 14 Week Period on the SilverCloud Platform
eTable 3. Estimated Class-Specific Change in PHQ-9 Over Time
Customize your JAMA Network experience by selecting one or more topics from the list below.
Chien I, Enrique A, Palacios J, et al. A Machine Learning Approach to Understanding Patterns of Engagement With Internet-Delivered Mental Health Interventions. JAMA Netw Open. 2020;3(7):e2010791. doi:10.1001/jamanetworkopen.2020.10791
Can machine learning techniques be used to identify heterogeneity in patient engagement with internet-based cognitive behavioral therapy for symptoms of depression and anxiety?
In this cohort study using data from 54 604 individuals, 5 heterogeneous subtypes were identified based on patient engagement with the online intervention. These subtypes were associated with different patterns of patient behavior and different levels of improvement in symptoms of depression and anxiety.
The findings of this study suggest that patterns of patient behavior may elucidate different modalities of engagement, which can help to conduct better triage for patients to provide personalized therapeutic activities, helping to improve outcomes and reduce the overall burden of mental health disorders.
The mechanisms by which engagement with internet-delivered psychological interventions are associated with depression and anxiety symptoms are unclear.
To identify behavior types based on how people engage with an internet-based cognitive behavioral therapy (iCBT) intervention for symptoms of depression and anxiety.
Design, Setting, and Participants
Deidentified data on 54 604 adult patients assigned to the Space From Depression and Anxiety treatment program from January 31, 2015, to March 31, 2019, were obtained for probabilistic latent variable modeling using machine learning techniques to infer distinct patient subtypes, based on longitudinal heterogeneity of engagement patterns with iCBT.
A clinician-supported iCBT-based program that follows clinical guidelines for treating depression and anxiety, delivered on a web 2.0 platform.
Main Outcomes and Measures
Log data from user interactions with the iCBT program to inform engagement patterns over time. Clinical outcomes included symptoms of depression (Patient Health Questionnaire-9 [PHQ-9]) and anxiety (Generalized Anxiety Disorder-7 [GAD-7]); PHQ-9 cut point greater than or equal to 10 and GAD-7 scores greater than or equal to 8 were used to define depression and anxiety.
Patients spent a mean (SD) of 111.33 (118.92) minutes on the platform and completed 230.60 (241.21) tools. At baseline, mean PHQ-9 score was 12.96 (5.81) and GAD-7 score was 11.85 (5.14). Five subtypes of engagement were identified based on patient interaction with different program sections over 14 weeks: class 1 (low engagers, 19 930 [36.5%]), class 2 (late engagers, 11 674 [21.4%]), class 3 (high engagers with rapid disengagement, 13 936 [25.5%]), class 4 (high engagers with moderate decrease, 3258 [6.0%]), and class 5 (highest engagers, 5799 [10.6%]). Estimated mean decrease (SE) in PHQ-9 score was 6.65 (0.14) for class 3, 5.88 (0.14) for class 5, and 5.39 (0.14) for class 4; class 2 had the lowest rate of decrease at −4.41 (0.13). Compared with PHQ-9 score decrease in class 1, the Cohen d effect size (SE) was −0.46 (0.014) for class 2, −0.46 (0.014) for class 3, −0.61 (0.021) for class 4, and −0.73 (0.018) for class 5. Similar patterns were found across groups for GAD-7.
Conclusions and Relevance
The findings of this study may facilitate tailoring interventions according to specific subtypes of engagement for individuals with depression and anxiety. Informing clinical decision needs of supporters may be a route to successful adoption of machine learning insights, thus improving clinical outcomes overall.
The World Health Organization defines health as a state of complete physical, mental, and social well-being and not merely the absence of disease or infirmity.1 Mental disorders present a substantial burden for good health as they have deleterious effects on the individual, society, and the worldwide economy,2-4 making their prevention and treatment a public health priority.5-7
Responding to the demand for accessible and sustainable mental health care services, internet-delivered psychological interventions offer access to evidence-based treatment and positive clinical outcomes while maintaining quality of care and reducing costs.8,9 Extensive research has reported possible effectiveness of these interventions for treating psychological disorders.9-13 However, more complete understanding of the clinical use of digital therapy programs requires further research.14-16 Most previous studies explored the association between use of the interventions and outcomes, relying on single metrics, such as raw use counts.17,18 Other studies suggest that single metrics are unlikely to sufficiently capture associations between engagement and outcomes, especially when compared with other factors, such as the actual level of attention or interactivity during an intervention.19,20 Thus, identifying different behavioral patterns of engagement and linking these patterns to clinical outcomes offer new opportunities for personalizing treatment delivery to reduce nonadherence to therapy and enhance possible effectiveness.20,21
The aim of this study was to examine whether different types of patient behaviors manifest in the way people engage with an internet-based cognitive behavioral therapy (iCBT) intervention for symptoms of depression and anxiety. We used machine learning to build a probabilistic graphical modeling framework to understand longitudinal patterns of engagement with iCBT.22-24 We hypothesized that these patterns would allow us to infer distinct, heterogeneous patient behavior subtypes. We further hypothesized that these subtypes are associated with the intervention’s success of improving mental health and that different subtypes of engagement are associated with differences in clinical outcomes.
We used clinical measures and behavioral engagement data from SilverCloud Health. SilverCloud Health is an evidence-based, online, self-administered platform that delivers iCBT alongside feedback from trained human supporters.25,26 We used deidentified data from 67 468 patients on the Space From Depression and Anxiety treatment program between January 31, 2015, and March 31, 2019. We removed 12 864 individuals who had no supporter assigned and restricted analysis to the remaining 54 604 patients who viewed the program content at least once. The program consists of 8 core modules covering the CBT principles for treating symptoms of depression and anxiety. Content is delivered using textual and audiovisual materials, interactive tools, and personal stories. The platform includes several interactive tools, such as journal, quizzes, mood trackers, and other CBT-based exercises. Human supporters provide guidance to patients in the first 8 weeks of treatment. Further details of the platform and tools are available in the eMethods and eTable 1 in the Supplement. Data analysis was carried out between April 1 and October 31, 2019. All users provided written or oral consent for their anonymized data to be used in routine evaluations for service monitoring and improvement. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline for cohort studies. Per the Common Rule, institutional review board review was not required for this study, which used deidentified publicly available data.
We assessed symptoms of depression using the Patient Health Questionnaire-9 (PHQ-9) and symptoms of anxiety using the Generalized Anxiety Disorder-7 (GAD-7). We used a PHQ-9 cut point of greater than or equal to 1027 and GAD-7 scores greater than or equal to 8 to define depression and anxiety cases.28 These measures were collected at baseline and during routine outcome monitoring (biweekly) up to 14 weeks. A decrease of greater than or equal to 6 on PHQ-9 and greater than or equal to 4 points on GAD-7 represent clinical improvement. The rate of reliable improvement was calculated based on the Reliable Change Index (RCI) using Jacobson and Truax criteria.29
We defined 2 types of engagement: whether a patient used the program in a given week (yes/no) and whether a patient used a particular section of the program in a given week. There are 14 sections in total as described in eTable 2 in the Supplement.
Comparing these 2 modalities of engagement allowed us to meaningfully operationalize a measure of engagement: Is it more important that patients engage with the program, or does engagement with specific treatment sections also matter? Distinguishing the iCBT components that patients engage with may allow us to better motivate engagement with therapy based on increased understanding of their behavior.
We developed a probabilistic latent variable model to infer distinct subtypes of patients based on their interaction with the depression and anxiety program over 14 weeks, which was the median time spent with the treatment.
We assumed n latent classes, which represent these subtypes/patterns of engagement. The size and number of these latent classes was unknown a priori, but we learned the optimal number of latent classes (between 2 and 10) by comparing model loss functions. We explored 2 probabilistic model formulations to infer the number of latent subtypes that best encapsulate longitudinal heterogeneity of patterns of engagement: a mixture of the hidden Markov model (Figure 1; eMethods in the Supplement) and a latent variable mixed model. For the hidden Markov model, we assumed that weekly observed engagement is encoded by a latent state while the conditional dependence structure on the longitudinal sequence of states is governed by an overarching discrete latent state that represents prototypical patterns of engagement over time (Figure 1). The latent variable mixed model assumes that behavior within a latent class depends on a time parameter rather than on dynamic transitions between states.
We assessed whether there were specific iCBT sections to which different subgroups had a particular affinity and whether subgroups were associated with different profiles of PHQ-9 and GAD-7 using longitudinal linear regression techniques. Tool use was evaluated within the first 2 weeks on the program to investigate whether we could identify early predictors of subsequent engagement patterns of how people used the program. These patterns of tool use were not used to identify the engagement classes themselves, but served as an external validation to assess whether different classes based on program engagement alone could also indicate what tools people in particular subgroups were more likely to interact with. Analysis was carried out using Python, version 3.7 (Python Software Foundation), and Stata, version 15.1 (StataCorp LLC). Findings were considered significant at P ≤ .05.
We used data from a total of 54 604 patients. Patients spent a mean (SD) of 111.33 (118.92) minutes on the program, used 230.60 (241.21) tools, and had baseline scores of 12.96 (5.81) in PHQ-9 and 11.85 (5.14) in GAD-7. Across all patients, clinical scores improved over 14 weeks by a mean (SD) of 4.29 (4.90) for PHQ-9 and 4.01 (4.61) for GAD-7.
Figure 2 shows the distribution of engagement over time based on results from a hidden Markov model, which gave the best model fit. The model identified 5 subtypes of engagement (latent classes) based on patient interaction with sections over time rather than engagement as a binary measure at each week. Figure 3 illustrates characteristics of different subtypes in terms of the number of modules completed, mean time spent, and number of sessions per week. Table 1 reports the odds ratio (OR) of each class compared with all other classes using a particular tool conditioned on all other tools used in the first 2 weeks.
Class 1 (low engagers: 19 930 [36.5%]) had the lowest probability of engagement with a steady dropout over time. They spent less time on the program, used fewer sessions, and completed fewer modules compared with all other classes. This class interacted more with the review page where users can communicate with their supporter. With regard to early use, when compared with all other groups, low engagers were more likely to use sections associated with mood monitoring and worry (Worry Tree: OR, 1.26; 95% CI, 1.04-1.52; P = .02; My Worries: OR, 1.34; 95% CI, 1.12-1.61; P = .002; Anxious Thoughts and Worry Quiz: OR, 1.45; 95% CI, 1.21-1.74; P < .001; Mood Monitor: OR, 1.23; 95% CI, 1.05-1.45; P = .01; and Understanding My Situation: OR, 1.22; 95% CI, 1.04-1.42; P = .001), but less likely to do activities such as Activity Scheduling (OR, 0.57; 95% CI, 0.48-0.68; P < .001) and Activities List (OR, 0.58; 95% CI, 0.49-0.69; P < .001).
Class 2 (late engagers: 11 674 [21.4%]) had initially low engagement with a slower rate of disengagement over time. During the first 2 weeks, they were less likely to use the To-do list at the end of modules and less likely to look at content such as Hierarchy of Fears (OR, 0.56; 95% CI, 0.34-0.91; P = .02), Worry Tree (OR, 0.80; 95% CI, 0.66-0.98; P = .03), and Graded Exposure Quiz (OR, 0.58; 95% CI, 0.38-0.93; P = .02). However, late engagers were more likely to engage with sections associated with Anxiety: Myths and Facts (OR, 1.67; 95% CI, 1.30-2.16; P < .001), Staying in the Present (OR, 1.17; 95% CI, 1.00-1.36; P = .04), and Stress Response (OR, 1.40; 95% CI, 1.00-1.96; P = .04).
Class 3 (high engagers with rapid disengagement: 13 936 [25.5%]) had the sharpest rate of disengagement despite initial high engagement. Patients were more likely to engage with Take Home Points. They were associated with early higher use of tools included in the core modules Understanding Feelings, Spotting Thoughts, Challenging Thoughts, and Boosting Behaviour (eg, Thoughts, Feelings, and Behaviour cycles: OR, 1.31; 95% CI, 1.16-1.47; P < .001). However, high engagers with rapid disengagement were also the least likely to engage with the Activity Goals (OR, 0.39; 95% CI, 0.24-0.64; P < .001).
Class 4 (high engagers with moderate decrease: 3258 [6.0%]) had a constantly high probability of engaging with the program. Along with class 5, class 4 undertook significantly more sessions, spent more time on the program, and did more modules and tools per week compared with all other classes. They also engaged more with Progress Points, Profile, and Review sections. They were more likely to interact with goal-based tools and quizzes that require more introspection (What's Your Lens? What's Your Thinking Style?: OR, 1.28; 95% CI, 1.02-1.60; P = .03; Core Beliefs Quiz: OR, 1.93; 95% CI, 1.19-3.16; P = .008; and Graded Exposure Quiz: OR, 1.62; 95% CI, 1.06-2.48; P = .03), but less likely to use sections containing more reading material (Depression: Myths and Facts: OR, 0.44; 95% CI, 0.27-0.74; P = .002; and Anxiety: Myths and Facts: OR, 0.40; 95% CI, 0.24-0.66; P < .001).
Class 5 (highest engagers: 5799 [10.6%]) had the highest probability of engagement throughout the time spent on the program. They had significantly more interaction with the journal and were more likely to engage with the sections Progress Points and Take Home Points. Early use indicated ORs above 1 for most modules. The distinguishing feature of this subgroup is that they used Anxious Thoughts and Worries less (OR, 0.82; 95% CI, 0.68-0.99; P = .04), but interacted more with Sleeping Tips (OR, 1.47; 95% CI, 1.16-1.86; P = .002) and Relaxation (OR, 2.12; 95% CI, 1.57-2.86; P < .001), which are sections that require unlocking by a human supporter.
The reliable improvement for class 1 was 39.5%; class 2, 54.8%; class 3, 58.0%; class 4, 58.8%; and class 5, 66.9%. We assumed that PHQ-9 scores were missing at random conditional on individual engagement class membership. To test this missingness assumption and ensure robustness, we restricted analysis to users who had completed 3 or more PHQ-9 assessments (n = 31 466). We found consistent estimates in terms of the estimated longitudinal mean improvement in symptoms of depression and anxiety within each subgroup (eTable 3 in the Supplement). Table 2 summarizes the estimated class-specific change in PHQ-9 and GAD-7 scores over a 14-week period. Patients in class 3 spent less time on the program than those in classes 4 and 5 but had significantly greater weekly change in PHQ-9 scores, with an estimated mean (SE) decrease in PHQ-9 score of 6.65 (0.14) compared with patients in class 5 (estimated PHQ-9 score reduction, 5.88 [0.14]) and class 4 (estimated PHQ-9 score reduction, 5.39 [0.14]). Patients in class 2 had the lowest initial PHQ-9 score and the lowest rate of improvement compared with all other groups, with an estimated mean (SE) improvement of −4.41 (0.13). Class 2 showed the slowest improvement rate, with the lowest mean initial PHQ-9 score (mean PHQ-9, 12.43 [0.07]). Compared with the change in PHQ-9 score in class 1, the Cohen d effect size (SE) was statistically significant at −0.46 (0.014) for class 2, −0.46 (0.014) for class 3, −0.61 (0.021) for class 4, and −0.73 (0.018) for class 5. Similar patterns were found for GAD-7 (Table 2), with class 3 showing the biggest estimated improvement in GAD-7 score after 14 weeks (estimated mean decrease of 6.36 [0.13]), and class 2 showing the lowest improvement (−4.18 [0.52]).
In this study, we used probabilistic graphical modeling to identify heterogeneous subtypes of patient engagement with iCBT for symptoms of depression and anxiety. We identified 5 distinct subtypes of users based on program use over 14 weeks. The use patterns of these subtypes suggest that clinical outcomes obtained from interactions with treatment were not always proportional to time spent on the program. Class 4 engaged more in goal-based activities and mood tracking and accessed many core modules, whereas class 5 participants were less likely to access core modules, but used relaxation and mindfulness tools. Patients in class 3 were more likely to complete content belonging to key components of CBT (ie, cognitive restructuring and behavioral activation) within the first 2 weeks on the program. These insights may facilitate tailoring of interventions for specific subtypes of engagement. For example, we may be able to front-load specific recommendations of content associated with improved therapy engagement and clinical outcomes for patients within particular subtypes. Such patterns may elucidate different modalities of engagement that can help us to better triage patients for different therapy modules or activities.
The observed changes in PHQ-9 and GAD-7 scores suggest a dodo bird verdict, where all classes of engagement have won and therefore all show some level of clinical improvement.29 Thus, even low engagers show a level of engagement that leads to positive outcomes.14,16 However, reliable clinical improvement for the low engagers was less than noted for those who engaged more with the program and its treatment components. We observed an incremental increase in reliable improvement for patients across the different classes from class 1 (39.5%) to class 5 (66.9%). This increase suggests that offering more personalized interventions may ameliorate these observed differences in clinical improvement rates and increase the possibility of reliable improvement for all.30,31
Previous research has focused on the association between simple metrics of mental health intervention use and outcomes. Raw counts of use can serve as a first step toward exploring user behaviors within these programs but they are not able to account for the complexity and diversity in the type of content.32-34 As a result of the limited perspective of these associations, there has been a call for more sophisticated analytical approaches to better understand how intervention use is associated with clinical outcomes18,21 and it motivated our present study to assess whether there are groups more likely to respond to the intervention and whether these groups differ in use.14 A strength of our study is that it included what is, to our knowledge, the largest real-world patient data set of its kind based on over 3 million data points from 54 604 patients from mental health services. Our results suggest that engagement based on iCBT section use in each week gave us a better fit than using engagement as a single binary measure at each week (whether a patient used the program in a given week). Therefore, effective engagement may not be determined merely by absolute engagement with the program, but also by what particular sections or elements a patient engages with. This in turn supports the idea of active treatment ingredients,34 that is, the components of any treatment that have been empirically supported and may affect therapeutic change. It is precisely these active components that combine to create a coherent treatment.35
Future research will need to identify how these machine learning–generated insights about different types of patient engagement and outcome trajectories translate into actionable steps for personalizing the content and delivery of online therapy programs. In this regard, we suggest that the human supporters of internet-delivered interventions, whose role is to provide patients regularly with feedback, continued encouragement, and content recommendations, are well positioned to be the recipients of generated machine learning outputs, providing them with additional information about their patients’ engagement and progress with the treatment. As part of their existing practices, these supporters already review patients’ activities weekly via platform use metrics (ie, number of logins, page views, and clinical scores), as well as notes that patients enter into their journal or are sent directly to the supporters. Supporters may also benefit from machine learning insights about the behavioral patterns and predicted outcomes for their patients to make more-informed choices in how they personalize treatment recommendations and what they choose to communicate in their feedback. Simultaneously, having a human-in-the-loop and domain expert to evaluate how system-generated insights are to be interpreted foregrounds human agency and leaves the supporters in control and accountable for decisions made on the back of machine learning insights,36,37 thus presenting a responsible avenue for introducing machine learning into sensitive domains, such as mental health care. Future research at the intersections of machine learning, interaction design, and mental health requires care in identifying how additional subtype information can best fit within existing information review practices for enabling supporters to provide more personalized, effective feedback.
This study has limitations. One limitation is that machine learning cannot determine whether there are distinct associations in engagement across sex, different times of year (seasonality effects), or other sociodemographic characteristics as this information is not present in this deidentified data set. Although we are basing our discussion on a robust data set and methods, we have been conservative in our conclusions for clinical implications for iCBT service delivery. This would require further context and replication of the methods on distinct data sets with sociodemographic and clinical covariates to achieve a more complete understanding of the type of profile being considered and what its clinical relevance for iCBT treatment provision might be. These data may help to establish priorities for treatment delivery based on individual needs that promote positive engagement and patient behavior toward realizing improved clinical outcome, and thereby can support any underlying mechanisms that are not ordinarily or immediately observable without machine learning approaches.
In this study, we have defined a path toward triaging patients within an early stage in the intervention to possibly determine whether modifications or additions to the intervention need to take place within or outside the treatment program for maximizing therapeutic benefits. The identification of engagement subtypes may create opportunities for improved intervention strategies, with more tailored, personalized approaches to the delivery of content. By identifying subtypes of patients as opposed to individual level responses, we gain insights regarding typical groups of heterogeneous behavior. Identifying such heterogeneous subtypes may enable us to target distinct groups in a more meaningful way by providing different levels of support or additional treatment content modules.
Accepted for Publication: April 22, 2020.
Published: July 17, 2020. doi:10.1001/jamanetworkopen.2020.10791
Open Access: This is an open access article distributed under the terms of the CC-BY-NC-ND License. © 2020 Chien I et al. JAMA Network Open.
Corresponding Author: Danielle Belgrave, PhD, Microsoft Research Cambridge, 21 Station Rd, Cambridge CB1 2FB, United Kingdom (email@example.com).
Author Contributions: Mr Keegan and Dr Belgrave had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Ms Chien and Drs Enrique, Palacios, and Regan contributed equally to the study. Drs Thieme, Richards, Doherty, and Belgrave contributed equally to the study.
Concept and design: Chien, Tschiatschek, Nori, Doherty, Belgrave.
Acquisition, analysis, or interpretation of data: Chien, Enrique, Palacios, Regan, Keegan, Carter, Thieme, Richards, Doherty.
Drafting of the manuscript: Chien, Enrique, Keegan, Tschiatschek, Thieme, Richards, Doherty, Belgrave.
Critical revision of the manuscript for important intellectual content: Chien, Enrique, Palacios, Regan, Carter, Nori, Richards, Doherty, Belgrave.
Statistical analysis: Chien, Palacios, Regan, Belgrave.
Administrative, technical, or material support: Regan, Keegan, Belgrave.
Supervision: Enrique, Tschiatschek, Nori, Thieme, Richards, Doherty, Belgrave.
Conflict of Interest Disclosures: Dr Enrique, Dr Palacios, Mr Keegan, and Dr Richards are employees of SilverCloud Health. Dr Doherty is a cofounder of SilverCloud Health and has a minority shareholding in the company. No other disclosures were reported.
Additional Information: Dr Tschiatschek completed this work while employed at Microsoft Research, Cambridge, United Kingdom.