Computerized Adaptive Test–Depression Inventory Not Ready for Prime Time—Reply | Clinical Decision Support | JAMA Psychiatry | JAMA Network
[Skip to Navigation]
Sign In
Gibbons  RD, Weiss  DJ, Pilkonis  PA,  et al.  Development of a computerized adaptive test for depression.  Arch Gen Psychiatry. 2012;69(11):1104-1112.PubMedGoogle ScholarCrossref
Nunnally  JC.  Psychometric Theory.2nd ed. New York, NY: McGraw-Hill; 1978.
Research Letter
July 2013

Computerized Adaptive Test–Depression Inventory Not Ready for Prime Time—Reply

Author Affiliations
  • 1Center for Health Statistics, University of Chicago, Chicago, Illinois
  • 2Department of Psychology, University of Minnesota, Minneapolis, Minnesota
  • 3Western Psychiatric Institute, University of Pittsburgh, Pittsburgh, Pennsylvania
JAMA Psychiatry. 2013;70(7):763-765. doi:10.1001/jamapsychiatry.2013.1322

In Reply The Carroll letter “Computerized Adaptive Test–Depression Inventory Not Ready for Prime Time” criticizes our recent article1 published in the November 2012 issue of this journal. Carroll suggests that:

Clinicians do not need another scale to screen for depression using 7 to 22 items. Existing scales do that well with 10 to 12 items and, unlike CAT-DI [Computerized Adaptive Test–Depression Inventory], provide a symptom crosswalk to DSM-IV criteria.…No analyses showed that CAT-DI performance matches existing scales.

This statement is not correct. Indeed, we reported convergent validity with the Patient Health Questionnaire 9, Hamilton Rating Scale for Depression, and Center for Epidemiologic Studies Depression Scale with correlations in the r = 0.80 range. In terms of a “crosswalk to DSM-IV criteria,” the CAT-DI demonstrated sensitivity of 0.92 and specificity of 0.88 with a Structured Clinical Interview for DSM-IV–based diagnosis of major depressive disorder, showing that it provides a very strong linkage to DSM-IV criteria.

In terms of the need for such a scale, what the CAT-DI provides that none of the existing scales do is a standard error of measurement that can be used to assess the uncertainty in the severity score obtained. As described in our article, the adaptive nature of the CAT-DI provides consistent precision of measurement across different respondents by administering different items to different respondents of varying depressive severity. Furthermore, the degree of precision can be set in advance depending on the application. Measurements of depressive severity in a randomized clinical trial may require greater precision than measurements used for screening in primary care or for psychiatric epidemiology. The CAT-DI permits the degree of precision to be selected in advance of testing depending on the requirements of the specific application. The traditional scales cited by Carroll provide the same items to all respondents and therefore allow uncertainty to vary across respondents. Figure 2 in our article clearly shows that the CAT-DI provides a linear scale of measurement with homogenous variance across different patient diagnostic strata (no depression and minor and major depressive disorder), whereas the other scales show lack of discrimination between no depression and minor depression, skewed distributions, and greater overlap across diagnostic groups. It is hard to understand how Carroll concludes that these scales do as well as the CAT-DI and that the psychometric rigor the CAT-DI adds to the literature is of no practical value. It is even more difficult to understand how Carroll concludes that “no analyses showed that CAT-DI performance matches existing scales.”

Carroll suggests that:

CAT-DI has other serious deficiencies. A guideline in multivariate analyses is that 10 times more subjects than items are needed for satisfactory solutions.

This statement is unclear and highlights the limitations of Carroll’s understanding of the statistical foundation of the CAT-DI. Multivariate analysis covers a wide range of statistical methods including multivariate analysis of variance, multivariate regression, factor analysis for measurement data, item response theory, and many other possibilities. The specific requirement of the number of respondents relative to “items” will depend on many factors including the specific multivariate model, types of hypotheses being tested (assuming they are being tested at all), effect sizes, and specific method of estimation (eg, least squares vs maximum likelihood). In the context of item response theory, we certainly need a greater number of respondents than items, but a statement suggesting that it must be 10:1 has no basis whatsoever in statistical theory. Since we are using item response theory to calibrate the model, the question is whether the solution is stable and the estimation procedure has converged. Carroll’s statement is apparently based on a quotation from Nunnally2 and predates the method of estimation used in our article (ie, marginal maximum likelihood estimation) and therefore is at best questionable in terms of the degree to which it applies and, more important, to what it applies. Furthermore, convergence was obtained both in the current study and in our previous study using a similar number of participants and number of items. If the more complex bifactor model was not appropriately estimated, then it would be unlikely that it would provide such an overwhelming improvement in fit over the simpler unidimensional model. Finally, Carroll ignores the description of the analysis where we clearly state that we used a balanced incomplete block design to select a subset of approximately 250 items per participant. Even if we were testing hypotheses regarding the item parameters (for example, to assess differential item functioning), the rule of 10 times the number of subjects to items could not possibly hold true for all hypotheses, item parameters, item response theory models, and effect sizes.

Carroll suggests that “The sample is mostly of low socioeconomic status (Table 2 in the Gibbons et al article) and perhaps marginally literate.” Table 2 of our article1 indicates that 95% of the sample had a high school degree or beyond. Seventy-four percent of the sample had some college. How exactly does Carroll come to the conclusion that they are “marginally literate”? Furthermore, the sample is quite representative of patients with depression in that it is a mixture of patients being seen at an urban tertiary referral center (Western Psychiatric Institute and Clinic at the University of Pittsburgh) and a local Pittsburgh, Pennsylvania, community mental health center.

Carroll suggests that the “choice of threshold SE ≤ 0.3 around CAT-DI scores is not justified, but this standard error appears too large relative to the scores.” Carroll’s lack of familiarity with modern psychometric theory and computerized adaptive testing is made glaringly apparent in this statement. The threshold of 0.3 SE is quite standard in computerized adaptive testing because it implies reliability in excess of 0.9. This follows from the underlying true score distribution being N(0,1) and the definition of reliability being:

r = s2t/(s2t+s2e) = 1/(1 + 0.32) = 0.92

It is unclear why Carroll considers this tradition in computerized adaptive testing to be “too large relative to the scores.” What is the magnitude of the standard errors for the traditional scales that he seems to prefer? Of course, the standard error for an individual score for a traditional test such as the Hamilton Rating Scale for Depression is unknown. Again, it is clear that Carroll lacks the technical expertise to support these criticisms and that his criticism of the CAT-DI is based on something other than science.

Carroll suggests that “The goal of commercial development seems premature; patients risk being “assayed” against a non–gold standard.” We have not proposed the CAT-DI as a gold standard but rather have demonstrated that the test tracks traditional depression measurement scales and the diagnosis of major depressive disorder, yet takes approximately 2 minutes to administer without the need for a clinical interview. Nowhere in the article do we claim that the CAT-DI should replace the clinician. Rather we suggest that the CAT-DI is useful as a screening tool and a method by which the effectiveness of treatment can be monitored.

Carroll goes on to suggest that we agree with his criticism that “It is not ready for clinical use, as Gibbons et al acknowledged, or for research.” This statement is completely inaccurate. We indicated that the software for web-based distribution would not be available until the end of 2012, not that the test is not ready for clinical use.

Carroll suggests that:

CAT-DI does not deliver clinically useful symptom profiles: exemplar case 2 (Table 3 in the Gibbons et al article) was not assessed for sleep, appetite, concentration, or psychomotor disturbances. Thus, after administering CAT-DI, clinicians would still need to administer a standardized scale to verify DSM-IV diagnostic criteria.

This comment appears to miss the point of the CAT-DI. For different individuals, different constellations of items are required to assess their severity. Not all individuals need to be asked all questions from all domains to estimate their depressive severity level. To verify DSM-IV diagnostic criteria, a DSM-IV clinical interview would be required regardless of what a particular rating scale indicates. The item response theory model underlying the CAT-DI allows for different patients to be assessed using different items and still provides valid estimates of the underlying latent variable of interest: depressive severity. This is clearly evident by the fact that the adaptive test results correlate at r = 0.95 with the entire 389-item bank scores, which do contain items from all depressive subdomains. Carroll also suggests that this is a weakness for longitudinal measurement because the same questions are not repeatedly asked. In fact, this is a distinct advantage of the CAT-DI because different questions can be asked on different occasions thereby minimizing response set bias. Furthermore, as we articulated in the article, even greater savings for longitudinal assessments can be obtained by starting the next computerized adaptive test session using the severity estimate from the previous assessment. No such advantage can be realized using traditional psychiatric measurement instruments and the repeated administration of the same items over the course of a study can lead to biased responses.

In summary, it is very clear that Carroll is not a fan of multidimensional item response theory and computerized adaptive testing as applied to the process of psychiatric measurement. It is, however, completely unclear that his lack of enthusiasm is based on any scientifically rigorous foundation. Indeed, his knowledge of these methods seems lacking.

Finally, Carroll is quick to point out the acknowledged potential conflicts of others as if they have led to bias in reporting of scientific information. In this case, it is Carroll who has the overwhelming conflict of interest. As developer, owner, and marketer of the Carroll Depression Scale–Revised, a traditional fixed-length test, it is not surprising that the paradigm shift described in our article would be of serious concern to him.

Back to top
Article Information

Corresponding Author: Robert D. Gibbons, PhD, University of Chicago, 5841 S Maryland Ave, MC 2007 Office W260, Chicago, IL 60637 (

Conflict of Interest Disclosures: The CAT-DI will ultimately be made available for routine administration and its development as a commercial product is under consideration.

Funding/Support: This work was supported by National Institute of Mental Health grant R01-MH66302.

Gibbons  RD, Weiss  DJ, Pilkonis  PA,  et al.  Development of a computerized adaptive test for depression.  Arch Gen Psychiatry. 2012;69(11):1104-1112.PubMedGoogle ScholarCrossref
Nunnally  JC.  Psychometric Theory.2nd ed. New York, NY: McGraw-Hill; 1978.