Development and Validation of Computerized Adaptive Assessment Tools for the Measurement of Posttraumatic Stress Disorder Among US Military Veterans

Key Points Question Can rapid psychometrically sound adaptive diagnostic screening and dimensional severity measures be developed for posttraumatic stress disorder? Findings In this diagnostic study including 713 US military veterans, the Computerized Adaptive Diagnostic–Posttraumatic Stress Disorder measure was shown to have excellent diagnostic accuracy. The Computerized Adaptive Test–Posttraumatic Stress Disorder also provided valid severity ratings and demonstrated convergent validity with the Post-Traumatic Stress Disorder checklist for Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition. Meaning In this study, the Computerized Adaptive Diagnostic–Posttraumatic Stress Disorder and Computerized Adaptive Test–Posttraumatic Stress Disorder measures appeared to provide valid screening diagnoses and severity scores, with substantial reductions in patient and clinician burden.


Introduction
Posttraumatic stress disorder (PTSD) in US military veterans is recognized as one of the signature injuries of the conflicts in Iraq and Afghanistan. Fulton et al 1

conducted a meta-analysis of 33 studies published between 2007 and 2013, and PTSD prevalence among Operations Enduring Freedom and
Iraqi Freedom veterans was estimated at 23%. Disease burden associated with PTSD is also notable among veterans from previous conflicts. Magruder and colleagues 2 estimated the temporal course of PTSD among Vietnam veterans and identified 5 mutually exclusive groups (ie, no PTSD, early recovery, late recovery, late onset, and chronic). Based on these findings, the authors suggested that PTSD remains "a prominent issue" for many who served. 2(p2) Among adults in the US without a history of military service, lifetime incidence of PTSD is estimated at 6.8%, 3 with women being twice as likely as men to be diagnosed with the condition. 3,4 Provision of evidence-based treatment for those with PTSD is contingent on accurate identification. Traditionally, this identification has required the use of measures developed using classical test theory (ie, summing responses to a fixed set of items). 5 Limitations of classical test theory are amplified when measuring complex conditions, such as PTSD. 5 Diagnostically, criterion A events of PTSD include "exposure to actual or threatened death, serious injury, or sexual violence." 6(p271) Such exposure can be secondary to directly experiencing, witnessing, learning about (occurred in a close family member or friend), and/or experiencing repeated or extreme exposure to aversive details regarding 1 or more traumatic events. Symptombased criteria include intrusive symptoms (eg, distressing memories of the events), avoidance of stimuli (eg, people and/or places that remind the affected person of the events), and negative alterations in cognitions and mood associated with the events (eg, feeling detached from others). 6 As would be expected based on the above-stated criteria, individuals with PTSD experience a wide range of symptoms with varying severity. Using latent profile analysis, Jongedijk et al 7 identified 3 classes of individuals among Dutch veterans with PTSD, including average, severe, and highly severe symptom severity classes. Among trauma-exposed, inner-city primary care patients, Rahman et al 8 examined data to assess associations between PTSD subclasses and major depressive disorder.
The investigators identified 4 subclasses, including high severity and comorbidity, moderate severity, low PTSD and high depression, and resilient. These findings highlight the need to identify strategies capable of measuring complex traits.
One alternative to administering traditional assessment measures is computerized adaptive testing (CAT) in which a person's initial item responses are used to determine a provisional estimate of their standing on the measured trait, which is then used for the selection of subsequent items, 9 thereby increasing the precision of measurement and accuracy of diagnostic screening and minimizing clinician and patient burden. 10 For complex disorders, such as PTSD, in which items are selected from distinct yet related subdomains (eg, exposure, negative alteration in mood and/or cognition, alteration in arousal and/or activity, avoidance, and intrusion), selection of items is based on multidimensional rather than unidimensional item response theory (IRT). 11 Adaptive diagnosis and measurement are fundamentally different. In measurement (ie, CAT) the objective is to move the items to the severity level of the patient. In computerized adaptive diagnosis (CAD), we move the items at the tipping point between a positive and negative diagnosis. 12 15 ), can be used to validate the CAT, but these tools are not used to derive a CAT. By contrast, CAD is based on machine-learning models for supervised learning (eg, random forest). We can use the same set of symptom items as the CAT to derive a CAD, but here we need an external criterion, such as the CAPS-5, to train the machinelearning model. CAD adaptively derives a binary screening diagnosis with an associated level of confidence, and CAT derives a dimensional severity measure that can be used to assess the severity of the underlying disorder and change in severity over time. CAD and CAT are complementary but are fundamentally different in theory and application. To do large-scale screening and measurement of PTSD, both measures are needed.
Evidence for other mental health conditions (ie, depression, 16 anxiety, 17 mania/hypomania, 18 psychosis, 19 suicide risk, 20 and substance use disorders 21 ) indicates that one can create large item banks (hundreds of items for a given disorder), from which a small optimal subset of items can be adaptively administered for a given individual with no or minimal loss of information, yielding a substantial reduction in patient and clinician burden while maintaining high sensitivity and specificity for diagnostic categorization, as well as high correlation with extant self-and clinician-rated symptom severity standard measures. For CAD, Gibbons et al 12  for PTSD using multidimensional IRT. Initially, the investigators conducted a systematic review of PTSD instruments to identify items representing each of the 3 symptom clusters (reexperiencing, avoidance, and hypervigilance), as well as 3 additional subdomains (depersonalization, guilt, and sexual problems). A 104-item bank was constructed. Eighty-nine of these items were retained to further develop and validate a computerized test for PTSD (P-CAT). Although the DSM-5 was not completed at that time, the authors indicated that they included items related to domains that they expected to be included. Similarly, because DSM-5 measures were not yet developed, validation measures (eg, civilian version of the PTSD Checklist) 25 were based on DSM-IV criteria. Moreover, to "minimize burden and distress for participants," 24(p118) the SCID PTSD module 26 vs the Clinician-Administered PTSD scale 27 was administered. Work by Weathers et al 28 suggests that the CAPS is the most valid measure of PTSD relative to other clinical interviews or self-report measures. According to Eisen et al, 24 although concurrent validity was supported by high correlations, sensitivity and specificity were variable and the P-CAT was found to not be as reliable among those with "low levels of PTSD." 24(p1120) Although there are similarities between the CAT-PTSD and the P-CAT in terms of the underlying method, there are important differences as well. First, unlike the CAT-PTSD, which varies in length and has fixed precision of measurement, the P-CAT is fixed in length and allows the precision of measurement to vary. This difference has implications for longitudinal assessments in which constant precision of measurement is important and is assumed in most statistical models for the analysis of longitudinal data. 29 Second, the P-CAT item bank was limited to 89 items, whereas our item bank has 211 items. As such, these new methods provide better coverage of the entire PTSD continuum and have more exchangeable items at any point on that continuum. Third, we have developed both a CAT for the measurement of severity and a CAD for diagnostic screening.
Diagnostic screening based on a CAD generally outperforms thresholding a continuous CAT-based measure, using fewer items. 12 The limitation of CAD is that it does not provide a quantitative determination, a gap that is filled by the CAT-PTSD. In combination, however, CAT and CAD can be used for both screening and measurement.
Based on DSM-5 criteria, this study aimed to develop and test the psychometric properties of the CAD-PTSD (diagnostic screener) and the CAT-PTSD (dimensional severity measure) against the standard criterion measure (CAPS-5), 14 as well as the PCL-5. 15

Measure Development
We developed the CAD-PTSD and CAT-PTSD scales using the general method introduced by Gibbons and colleagues. 16 First, a large item bank containing 211 PTSD symptom items was developed to create both the CAD-PTSD and CAT-PTSD measures, using separate analyses.
The CAT-PTSD measure was developed by first calibrating the item bank using a multidimensional IRT model (the bifactor model 30 ) and then simulating CAT from the complete item response patterns (211 items) to select optimal CAT tuning parameters from 1200 different simulations. Next, the CAT-PTSD scale was validated against an extant PTSD scale, the PCL-5 (convergent validity) and the CAPS-5 (diagnostic discriminant validity). For CAD, we used an extremely randomized trees algorithm 31 to develop a classifier for the CAPS-5 PTSD diagnosis based on adaptive administration of no more than 6 items from the bank. 12 Classification accuracy was assessed using data not used to calibrate the model.
Most applications of IRT are based on unidimensional models that assume that all of the association between the items is explained by a single primary latent dimension or factor (eg, mathematical ability). However, mental health constructs are inherently multidimensional; for example, in the area of depression, items may be sampled from the mood, cognition, behavior, and somatic subdomains, which produce residual associations between items within the subdomains that are not accounted for by the primary dimension. If we attempt to fit such data to a traditional unidimensional IRT model, we will typically have to discard most candidate items to achieve a reasonable fit of the model to the data. Bock and Aitkin 32 developed the first multidimensional IRT model, where each item can load on each subdomain that the test is designed to measure. This model is a form of exploratory item factor analysis and can accommodate the complexity of mental health constructs such as PTSD. In some cases, however, the multidimensionality is produced by the sampling of items from unique subdomains (eg, negative alterations in mood and/or cognition, avoidance, and intrusion). In such cases, the bifactor model, originally developed by Gibbons and Once the entire bank (ie, 211 PTSD items) is calibrated, we have estimates of each item's associated severity and we can adaptively match the severity of the items to the severity of the person. We do not know the severity of the person in advance of testing, but we learn it as we adaptively administer items. Beginning with an item in the middle of the severity distribution, we administer the item, obtain a categorical response, estimate the person's severity level and the uncertainty in that estimate, and select the next maximally informative item. 16 This process continues until the uncertainty falls below a predefined threshold, in our case, 5 points on a 100-point scale. The CAT has several tuning parameters 16  total bank score. The tuning parameters include the level of uncertainty at which we stop the adaptive test, a second stopping rule based on available information remaining in the item bank at the current level of severity, and an additional random component that selects the maximally informative item or the second maximally informative item to increase variety in the items administered. We select the next maximally informative item based on the following item information criteria. Item information describes the information contained in a given item for a specific severity estimate. Our goal is to administer the item with maximum item information at each step in the adaptive process.
Unlike a CAT, which is criterion-free, a CAD uses the diagnostic information (ie, external criterion) to derive a classifier based on a subset of the symptoms in the item bank that maximize the association between the items and the diagnosis. A CAD is used for diagnostic screening, whereas a CAT is used for symptom severity measurement. Gibbons et al 12

Measures
We developed an item bank containing 211 PTSD items drawn from 16 existing self-report and clinician-administered PTSD scales (eTable in the Supplement) and newly created items. Existing items were reworded to make them appropriate for adaptive administration, self-report, and userselectable time frames. Items were drawn from 5 subdomains: exposure (5 items), negative alterations in mood/cognition (58 items), alterations in arousal/reactivity (79 items), avoidance (18 items), and intrusion (51 items). Items were rated on 4-or 5-point Likert scales with categories of not at all, a little bit, moderately, quite a bit, very much, never, rarely, sometimes, and often.
The trauma/PTSD L Module of the SCID 13 was used to assess criterion A events and the presence of symptoms. If a criterion A event and at least 1 current symptom were endorsed, the CAPS-5 was administered. 14 The CAPS-5 is the standard for assessing PTSD diagnosis. 28 Non-PTSD modules of the SCID 13 were administered to obtain information regarding current mental health conditions. The PCL-5 15 was used to determine self-reported PTSD symptom severity.

Statistical Analysis
The bifactor IRT models were fitted with the POLYBIF program. Improvement in fit of the bifactor model over a unidimensional alternative was determined using a likelihood ratio χ 2 statistic. The extra-trees classification algorithm was fitted using the Scikit-learn Python library. Logistic regression was used to estimate diagnostic discrimination capacity for the CAT-PTSD and area under the curve (AUC) for the receiver operating characteristic curve with 10-fold cross-validation using Stata, version 16 (StataCorp LLC). The Pearson r correlation coefficient test was used to assess the association between the CAT-PTSD score and the PCL-5 score. Using 2-sided testing, findings were considered significant at P < .05.

Participants
In
To aid in patient triage, severity thresholds were selected based on sensitivity and specificity for the CAPS-5 diagnosis of PTSD. Scores on the CAT-PTSD can range from 0 to 100 and map on to PTSD severity categories. Categories of none, mild, moderate, and severe were selected; the shift between Median (range) 0 (0-27.6) 0 (0-27.5) a Some participants declined to respond to certain items; in these cases, the number who responded to that item or measure is reported.  In Table 3, example CAT-PTSD interviews for patients with low, moderate, and high PTSD severity are presented. The testing session result is classification as having no evidence of PTSD (requires 12 items), possible PTSD (requires 9 items), and PTSD definite or highly likely (requires 11 items). In Table 4  Future directions include the need for additional field testing, which would also allow for evaluation of the acceptability and feasibility of implementing these tools in clinical settings, including via telehealth, which has been increasingly implemented as a result of the COVID-19

JAMA Network Open | Psychiatry
pandemic. Use of telehealth assessment will in part be facilitated by designing a graphical user interface 45 in a cloud computing environment for routine test administration on internet-capable devices, such as smartphones, tablets, notebooks, and computers, and providing an advanced programming interface that can be interfaced with the electronic health record. To accommodate literacy issues, audio to the self-report questions can be enabled. Because the generation and testing of subdomain scores is beyond the scope of this study, future research in this area is warranted.

Limitations
This study has limitations. The CAD-PTSD and CAT-PTSD do not allow for evaluation and monitoring of specific symptoms to the extent that they may not always be adaptively administered. However, items from the 5 subdomains are available from most interviews and can be used to assess specific subdomains of PTSD (eg, avoidance). In addition, this study was conducted exclusively in English.
Independent replication of our findings in other patient populations and in other languages (eg, Spanish) is needed. 46 How much were you bothered by repeated, disturbing dreams of a stressful experience from the past?

Very much
How much did feelings of being "super alert," on guard, or constantly on the lookout for danger occur or become worse after a stressful event or experience in the past? c

Very much
Have you markedly lost interest in free-time activities that used to be important to you?

Often
How much did having a very negative emotional state occur or become worse after having a stressful event or experience?

Very much
Someone touched me in a sexual way against my will d Often Diagnosis: positive Probability of having PTSD P = .81 Abbreviations: CAD, computerized adaptive diagnostic; PTSD, posttraumatic stress disorder.
Role of the Funder/Sponsor: The funding organizations had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. The VA Office of Mental Health and Suicide Prevention did not influence the decision to submit the manuscript for publication.

Disclaimer:
The views, opinions, and/or findings contained in this article are those of the authors and should not be construed as an official Department of Veterans Affairs position, policy, or decision unless so designated by other documentation.
Additional Information: The POLYBIF program used is freely available at http://www.healthstats.org.