MINI indicates Mini International Neuropsychiatric Interview; PHQ, Patient Health Questionnaire.
The figure is for the 44 studies (participants = 10 627; No. with major depression = 1361) that used a semistructured reference standard and had both PHQ-2 and PHQ-9 item scores available. Among the 48 PHQ-2 studies that used a semistructured reference standard, 4 studies did not have PHQ-9 item scores available, and thus could not be included in the comparison of screening strategies. The PHQ-2 line has 7 calculated points (inflections), representing possible scores of 0 (right) to 6 (left). The PHQ-9 alone and PHQ-2 scores of 2 or greater followed by PHQ-9 lines have 28 calculated points (inflections), representing possible scores of 0 (right) to 27 (left). The area under the curve was 0.88 (95% CI, 0.87-0.89) for PHQ-2 alone, 0.92 (95% CI, 0.91-0.93) for PHQ-9 alone, and 0.90 (95% CI, 0.89-0.91) for PHQ-2 scores of 2 or greater followed by PHQ-9.
eMethods 1. Search Strategies
eMethods 2. Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) Coding Manual for Primary Studies Included in the Present Study
eFigure 1. Receiver operating characteristic (ROC) Plots Comparing Sensitivity and Specificity Estimates for Patient Health Questionnaire-2 (PHQ-2) Cutoffs 1-6 Among Semi-structured Diagnostic Interviews, Fully Structured Diagnostic Interviews, and the Mini International Neuropsychiatric Interview (MINI)
eFigure 2. Forest Plots of Sensitivity and Specificity Estimates for Cutoff 2 and 3 of the PHQ-2 for Each Reference Standard Category, Including Among Participants Verified to Not Currently be Diagnosed or Receiving Treatment for a Mental Health Problem Compared to All Participants as Well as Among Participant Subgroups Based on Age, Sex, Human Development Index and Care Setting
eFigure 3. Nomograms of Positive and Negative Predictive Value for Assumed Major Depression Prevalence of 5-25%, Based on Accuracy Estimates Among Studies With a Semi-structured Reference Standard and PHQ-9 Scores Available
eTable 1. Characteristics of Included Primary Studies as Well as Eligible Primary Studies Not Included in the Present Study
eTable 2. Numbers of Participants and Cases of Major Depression by Diagnostic Interview
eTable 3. Estimates of Heterogeneity at PHQ-2 Cutoff Score of 2 and 3
eTable 4. Comparison of PHQ-2 Sensitivity and Specificity Estimates Among Participants Verified to Not Currently be Diagnosed or Receiving Treatment for a Mental Health Problem Compared to All Participants as Well as Among Participant Subgroups Based on Age, Sex, Human Development Index, Care Setting, and Risk of Bias Factors, for Each Reference Standard Category
eTable 5. Sensitivity and Specificity Estimates for the PHQ-2 Alone, the PHQ-9 Alone, and for PHQ-2 ≥ 2 Followed by PHQ-9 Among 44 Studies (N Participants = 10,627; N Major Depression = 1,361) That Used a Semi-structured Reference Standard and Had Both PHQ-2 and PHQ-9 Item Scores Available
eTable 6. Comparison of Sensitivity and Specificity for PHQ-2 ≥ 2 in Combination With PHQ-9 ≥ 5 to 15 Versus Sensitivity and Specificity for PHQ-9 ≥ 5 to 15, Among Studies That Used a Semi-structured Diagnostic Interview as the Reference Standard
eTable 7. QUADAS-2 Ratings for Each Primary Study Included in the Present Study
Customize your JAMA Network experience by selecting one or more topics from the list below.
Levis B, Sun Y, He C, et al. Accuracy of the PHQ-2 Alone and in Combination With the PHQ-9 for Screening to Detect Major Depression: Systematic Review and Meta-analysis. JAMA. 2020;323(22):2290–2300. doi:10.1001/jama.2020.6504
What is the accuracy of the Patient Health Questionnaire (PHQ)–2 alone and in combination with the PHQ-9 for screening for depression?
In an individual participant data meta-analysis that included 10 627 participants from 44 studies with semistructured diagnostic interviews, the combination of PHQ-2 (with cutoff ≥2) followed by PHQ-9 (with cutoff ≥10) had a sensitivity of 0.82, specificity of 0.87, and area under the receiver operating characteristic curve of 0.90.
PHQ-2 followed by PHQ-9 may provide acceptable accuracy for screening for depression.
The Patient Health Questionnaire depression module (PHQ-9) is a 9-item self-administered instrument used for detecting depression and assessing severity of depression. The Patient Health Questionnaire–2 (PHQ-2) consists of the first 2 items of the PHQ-9 (which assess the frequency of depressed mood and anhedonia) and can be used as a first step to identify patients for evaluation with the full PHQ-9.
To estimate PHQ-2 accuracy alone and combined with the PHQ-9 for detecting major depression.
MEDLINE, MEDLINE In-Process & Other Non-Indexed Citations, PsycINFO, and Web of Science (January 2000-May 2018).
Eligible data sets compared PHQ-2 scores with major depression diagnoses from a validated diagnostic interview.
Data Extraction and Synthesis
Individual participant data were synthesized with bivariate random-effects meta-analysis to estimate pooled sensitivity and specificity of the PHQ-2 alone among studies using semistructured, fully structured, or Mini International Neuropsychiatric Interview (MINI) diagnostic interviews separately and in combination with the PHQ-9 vs the PHQ-9 alone for studies that used semistructured interviews. The PHQ-2 score ranges from 0 to 6, and the PHQ-9 score ranges from 0 to 27.
Individual participant data were obtained from 100 of 136 eligible studies (44 318 participants; 4572 with major depression [10%]; mean [SD] age, 49  years; 59% female). Among studies that used semistructured interviews, PHQ-2 sensitivity and specificity (95% CI) were 0.91 (0.88-0.94) and 0.67 (0.64-0.71) for cutoff scores of 2 or greater and 0.72 (0.67-0.77) and 0.85 (0.83-0.87) for cutoff scores of 3 or greater. Sensitivity was significantly greater for semistructured vs fully structured interviews. Specificity was not significantly different across the types of interviews. The area under the receiver operating characteristic curve was 0.88 (0.86-0.89) for semistructured interviews, 0.82 (0.81-0.84) for fully structured interviews, and 0.87 (0.85-0.88) for the MINI. There were no significant subgroup differences. For semistructured interviews, sensitivity for PHQ-2 scores of 2 or greater followed by PHQ-9 scores of 10 or greater (0.82 [0.76-0.86]) was not significantly different than PHQ-9 scores of 10 or greater alone (0.86 [0.80-0.90]); specificity for the combination was significantly but minimally higher (0.87 [0.84-0.89] vs 0.85 [0.82-0.87]). The area under the curve was 0.90 (0.89-0.91). The combination was estimated to reduce the number of participants needing to complete the full PHQ-9 by 57% (56%-58%).
Conclusions and Relevance
In an individual participant data meta-analysis of studies that compared PHQ scores with major depression diagnoses, the combination of PHQ-2 (with cutoff ≥2) followed by PHQ-9 (with cutoff ≥10) had similar sensitivity but higher specificity compared with PHQ-9 cutoff scores of 10 or greater alone. Further research is needed to understand the clinical and research value of this combined approach to screening.
In depression screening, questionnaires are used to identify patients with scores above a cutoff threshold for evaluation to determine whether depression is present.1 One strategy is to administer a brief screening tool followed by a longer tool for positive screens.2,3 The Patient Health Questionnaire–2 (PHQ-2),4 which consists of the first 2 items (depressed mood and anhedonia) of the Patient Health Questionnaire–9 (PHQ-9),5 has been recommended as a prescreen prior to administering remaining PHQ-9 items (Table 1).2,4,6,7
A 2016 aggregate-data meta-analysis on PHQ-2 accuracy included 21 published studies of the PHQ-28; however, it did not include PHQ-2 data from an additional 37 studies of the PHQ-9.9,10 Except for clinical setting, subgroup results were not reported in primary studies and not evaluated; all primary studies were synthesized regardless of the diagnostic interview used, despite differences in their likelihood of classifying major depression11-13; and PHQ-2 accuracy was not evaluated in combination with the PHQ-9, as typically used in practice. Two primary studies14,15 have evaluated the PHQ-2 and PHQ-9 combination and produced inconsistent results; one examined score cutoffs for PHQ-2 of 2 or greater and for PHQ-9 of 10 or greater in older community-dwelling adults,14 and the other examined score cutoffs for PHQ-2 of 2 or greater and for PHQ-9 of 6 or greater in patients with acute coronary syndrome.15
The objectives of this meta-analysis of individual participant data were to evaluate PHQ-2 screening accuracy in adults (1) among studies that used different types of reference standards separately; (2) among participants verified as not diagnosed or in treatment vs all participants and by subgroups based on age, sex, country Human Development Index, and recruitment setting; and (3) alone and in combination with the PHQ-9 vs the PHQ-9 alone.
We published a protocol16 and registered in PROSPERO (CRD42014010673). Results were reported per PRISMA-DTA17 and PRISMA-IPD.18 Previous publications reported PHQ-819 and PHQ-920 accuracy. Individual prediction models described in the protocol will be developed in future studies. Analysis of the PHQ-2 and PHQ-9 combination was not prespecified. This study involved analysis of previously collected deidentified data, and included studies were required to have obtained ethics approval and informed consent; thus, the research ethics committee of the Jewish General Hospital determined that ethics approval was not required.
Studies were sought with data sets that (1) included PHQ-2 scores or item data to calculate PHQ-2 scores; (2) included current major depressive disorder or major depressive episode classification based on Diagnostic and Statistical Manual of Mental Disorders (DSM)21-23 or International Classification of Diseases (ICD)24 criteria and a validated diagnostic interview; (3) administered the PHQ and diagnostic interview within a 2-week period because diagnostic criteria include only symptoms from the last 2 weeks; (4) included participants 18 years and older not recruited from school or university settings; and (5) did not recruit participants only from psychiatric settings or with depression symptoms because screening is done to identify people not suspected of having depression.25 In data sets where only some participants were eligible, we included only those participants. There were no language restrictions.
The database search was designed by a medical librarian and peer-reviewed26 and included MEDLINE, MEDLINE In-Process & Other Non-Indexed Citations via Ovid, PsycINFO, and Web of Science (January 1, 2000-May 9, 2018) (eMethods 1 in the Supplement). We searched from 2000 because the PHQ-9 was published in 2001.5 We reviewed review articles and queried contributing authors about nonpublished studies or studies not identified by the search. We uploaded results into RefWorks (RefWorks-COS; Bethesda, Maryland), removed duplicates, then uploaded references into DistillerSR (Evidence Partners; Ottawa, Ontario, Canada).
Titles and abstracts were independently reviewed by varying pairs of 2 investigators. If 1 identified a study as potentially eligible, the full text was reviewed by pairs of 2 investigators independently. Any differences were resolved by consensus, with a third investigator consulted if necessary.
We conducted a literature search on April 6, 2020, to seek eligible published results that could be included. No studies published since the original search provided results for PHQ-2 and PHQ-9 combined.
We emailed corresponding authors of studies with eligible data sets at least 3 times, as necessary, to invite them to contribute data sets. If there was no response, we emailed coauthors and attempted contact by telephone.
Country, recruitment setting (nonmedical, primary care, inpatient, outpatient specialty), and diagnostic interview were extracted from published reports by 2 investigators independently, with disagreements resolved by consensus. Countries were categorized as having very high, high, or low-medium development based on the United Nation’s 2019 Human Development Index.27 Individual participant records included sex, age, major depression status, current mental health diagnosis or treatment, and PHQ-2 and PHQ-9 total and item scores. PHQ-9 items reflect the 9 DSM symptoms of major depression; PHQ-2 items reflect depressed mood and anhedonia. We prioritized major depressive episode over major depressive disorder, if both were provided, because screening attempts to detect episodes, and we prioritized DSM over ICD. For 4 studies with multiple recruitment settings, setting was coded by participant. When primary studies provided sampling weights, we used those weights. If weighting should have been done but was not, we used inverse selection probability weights. If all study participants with scores above a threshold but only a random subset of 50% below the threshold received a diagnostic interview, for instance, those above the threshold received a weight of 1 and those below received a weight of 2.
For each included data set, we attempted to replicate published participant characteristics and accuracy results. We worked with primary study investigators to resolve any discrepancies.
Risk of bias was assessed with the Quality Assessment of Diagnostic Accuracy Studies–2 tool (QUADAS-2; eMethods 2 in the Supplement).28 This was done by 2 investigators independently with discrepancies resolved by consensus, involving a third investigator, if necessary.
The PHQ-2 score ranges from 0 to 6, and the PHQ-9 score ranges from 0 to 27. We estimated sensitivity and specificity for all possible PHQ-2 cutoffs (scores 1-6) by reference standard type separately: semistructured diagnostic interviews; fully structured diagnostic interviews, excluding the Mini International Neuropsychiatric Interview (MINI)29,30; and the MINI. We did this because, controlling for depressive symptom scores, the Composite International Diagnostic Interview (CIDI),31 the most commonly used fully structured interview, may classify more participants with low-level symptoms as depressed, but fewer participants with higher-level symptoms, than semistructured interviews.11-13 The MINI may classify more participants as depressed.11-13 This is consistent with interview designs. Semistructured interviews are intended for administration by experienced diagnosticians, require clinical judgment, and allow question rephrasing and probes. Fully structured interviews are designed for lay interviewer administration and are fully scripted with no deviation allowed. They are intended to achieve standardization but may sacrifice accuracy.32-35 The MINI was designed for rapid administration and to be overinclusive.29,30
Within each reference standard category, we conducted subgroup analyses. We estimated sensitivity and specificity among participants who could be verified as not currently diagnosed or receiving mental health treatment vs all participants. This is because some primary studies included people already diagnosed or receiving treatment, but those participants would not be screened in practice. We estimated sensitivity and specificity by age (<60, ≥60 years), sex, country Human Development Index, and recruitment setting.
Among studies that used a semistructured interview, we evaluated accuracy of the PHQ-2 and PHQ-9 combination based on commonly used cutoffs.8,20 We compared sensitivity and specificity for PHQ-2 scores of 2 or greater and 3 or greater alone and combined with PHQ-9 scores of 10 or greater vs PHQ-9 scores of 10 or greater alone. In each scenario, we calculated the number of participants who scored above the PHQ-2 threshold and, in practice, would need to complete the full PHQ-9. For these analyses, we excluded studies and participants without PHQ-9 scores. In additional analyses, we compared sensitivity and specificity for PHQ-2 scores of 2 or greater in combination with PHQ-9 cutoff scores of 5 to 15 vs PHQ-9 alone at cutoff scores of 5 to 15.
In all meta-analyses, for all cutoff scores separately, we fit bivariate random-effects models using Gauss-Hermite quadrature.36 This 2-stage approach simultaneously models sensitivity and specificity, accounting for the correlation between them and within-study precision estimates. Within each reference standard category, we constructed empirical receiver operating characteristic plots and calculated area under the curve (AUC). To compare results between subgroups and for the PHQ-2 and PHQ-2 and PHQ-9 combination vs PHQ-9 alone, we estimated sensitivity and specificity differences and constructed confidence intervals for differences via the cluster bootstrap approach,37,38 resampling at study and participant levels. We ran 1000 bootstrap iterations for each comparison, omitting iterations where difference estimates were not produced. We considered differences to be statistically significantly different if their confidence intervals did not include 0.
To evaluate heterogeneity, for each included study, we produced sensitivity and specificity forest plots by reference standard category and for all studies in each subgroup within each category. We quantified heterogeneity by reporting τ2, the estimated variances of the random effects for sensitivity and specificity, and estimating R, the ratio of the estimated standard deviation of pooled sensitivity or specificity from the random-effects model to estimated standard deviation from the corresponding fixed-effects model.39
We generated hypothetical nomograms to illustrate possible positive and negative predictive values of PHQ-2 cutoff scores of 2 or greater and 3 or greater alone and in combination with PHQ-9 scores of 10 or greater for assumed major depression prevalence of 5% to 25%. These were based on summary sensitivity and specificity estimates from the analysis of studies that used semistructured interviews and had PHQ-9 scores available.
In sensitivity analyses, within each reference standard category, we evaluated whether there were accuracy differences by subgroups based on QUADAS-2 items. We did this for all items with at least 100 major depression cases and noncases rated as low vs unclear or high risk of bias.
For all analyses, we excluded studies with no major depression cases or noncases, because this did not allow application of the bivariate random-effects model, and participants missing data for a covariate of interest. There was a maximum of 74 participants excluded from any analysis. For clinical setting, we excluded 1 MINI study (130 participants) that recruited inpatients and outpatients but did not have participant-level setting data.
We did not conduct sensitivity analyses that combined accuracy results with published results from studies that did not contribute data. This is because, among 36 eligible studies that did not contribute data, only 2 studies with a semistructured reference standard40,41 (908 participants, 65 cases), 1 study with a fully structured reference standard42 (201 participants, 42 cases), and 4 studies using the MINI43-46 (878 participants, 220 cases) published accuracy results eligible for any analyses. The other studies with eligible data sets did not publish eligible accuracy results (eTable 1b in the Supplement).
All analyses were run in R (R version R 3.4.1 and R Studio version 1.0.143) using the glmer function within the lme4 package.47 For cutoff scores of 1 or greater for fully structured and 5 or greater for MINI reference standards, the default optimizer failed to converge, and bobyqa was used. In each analysis, pooled sensitivity and specificity and corresponding 2-sided 95% CIs were estimated.
The database search identified 9674 unique citations, of which 9198 were excluded after title and abstract review and 289 after full-text review, leaving 187 eligible articles with 131 unique data sets. Of these, 100 (76%) contributed data sets with PHQ-9 scores, PHQ-2 scores, or both. Authors of included studies contributed data from 5 additional unpublished studies, for a total of 105 data sets. Five data sets with PHQ-9 total scores did not have item data necessary to calculate PHQ-2 scores and were excluded. Thus, 100 data sets (44 318 participants; 4572 cases [10%]; mean [SD] age, 49  years; 59% female) were included (Figure 1). eTable 1 in the Supplement shows study characteristics of included studies and eligible studies that did not provide data. Not counting the 5 unpublished studies, of 54 633 participants in 131 eligible published studies, we included 43 787 participants (80%) from 95 published studies (73%).
Of the 100 included data sets, 48 were from studies that used semistructured interviews, 20 from studies that used fully structured interviews (MINI excluded), and 32 from studies that used the MINI. The Structured Clinical Interview for the DSM (SCID)48 (45 studies, 9713 participants) and CIDI (17 studies, 15 899 participants) were the most commonly used semistructured and fully structured interviews (Table 2; eTable 2 in the Supplement).
Among studies with a semistructured interview, sensitivity and specificity for PHQ-2 scores of 2 or greater were 0.91 (95% CI, 0.88-0.94) and 0.67 (95% CI, 0.64-0.71); for PHQ-2 scores of 3 or greater, sensitivity and specificity were 0.72 (95% CI, 0.67-0.77) and 0.85 (95% CI, 0.83-0.87), respectively. Across cutoffs, sensitivity with semistructured interviews was 0.04 (95% CI, 0.01-0.08) to 0.20 (95% CI, 0.10-0.28) higher than with fully structured interviews (significantly higher for cutoffs 1-6) and 0.02 (95% CI, 0.00-0.04) to 0.05 (95% CI, –0.04-0.13) higher than with the MINI (not significantly different at any cutoff); specificity was not significantly different across reference standard types (Table 3; eFigure 1 in the Supplement). The AUC was 0.88 (95% CI, 0.86-0.89) for semistructured interviews, 0.82 (95% CI, 0.81-0.84) for fully structured diagnostic interviews, and 0.87 (95% CI, 0.85-0.88) for the MINI.
There was moderate heterogeneity. For cutoffs 2 to 3, the τ2 values ranged from 0.47 to 1.29 for sensitivity and 0.27 to 0.78 for specificity, while R values ranged from 2.22 to 3.50 for sensitivity and 3.47 to 9.30 for specificity. Forest plots are shown in eFigure 2 and τ2 and R values in eTable 3 in the Supplement.
Sensitivity and specificity estimates were not significantly different for participants verified as not currently diagnosed or receiving mental health treatment compared with all participants across reference standard categories. Among other subgroup comparisons, there were no statistically significant or substantive differences that replicated across cutoffs and reference standard categories (eTable 4; forest plots: eFigure 2; τ2 and R values: eTable 3 in the Supplement).
Based on 44 studies that used a semistructured reference standard and provided both PHQ-2 and PHQ-9 scores, compared with PHQ-9 scores of 10 or greater alone, all strategies resulted in substantially reduced sensitivity or specificity, except PHQ-2 scores of 2 or greater in combination with PHQ-9 scores of 10 or greater. For this combination, sensitivity was 0.82 (95% CI, 0.76-0.86) vs 0.86 (95% CI, 0.80-0.90) (not statistically significant) and specificity was slightly higher (0.87 [95% CI, 0.84-0.89] vs 0.85 [95% CI, 0.82-0.87]) (statistically significant; Table 4; eTable 5 in the Supplement; Figure 2). The AUC was 0.90 (95% CI, 0.89-0.91). Nomograms of positive and negative predictive values are shown in eFigure 3 in the Supplement. Using PHQ-2 scores of 2 or greater in combination with other PHQ-9 cutoffs (5-9, 11-15) resulted in lower combined sensitivity and specificity compared with PHQ-2 scores of 2 or greater with PHQ-9 scores of 10 or greater (eTable 6 in the Supplement).
With PHQ-2 scores of 2 or greater then PHQ-9 scores of 10 or greater, 43% (95% CI, 42%-44%) of participants had positive PHQ-2 screens and would have needed to complete the full PHQ-9 in practice; 23% (95% CI, 22%-24%) of all participants would have had a positive PHQ-9 screen and needed further mental health assessment compared with 25% (95% CI, 24%-26%) for PHQ-9 scores of 10 or greater alone and 43% (95% CI, 42%-44%) for PHQ-2 scores of 2 or greater alone.
eTable 7 in the Supplement shows QUADAS-2 ratings for individual signaling items and risk of bias domains for included primary studies. Among 400 total domain ratings (4 per included study), 131 (33%) were coded as having low risk of bias, 253 (63%) as having an unclear risk, 11 (3%) as having a high risk, and 5 (1%) as varying across participants within a study. Three of 48 studies (6%) that used a semistructured interview, 6 of 20 studies (30%) with a fully structured interview, and 9 of 32 studies (28%) with a MINI reference standard had low risk of bias across all 4 domains.
PHQ-2 accuracy comparisons across QUADAS-2 items within reference standard categories are shown in eTable 4 in the Supplement. No statistically significant differences were found that replicated across cutoffs for any reference standard category.
In this individual participant data meta-analysis of 44 studies that used semistructured diagnostic interviews to classify depression, sensitivity using the combination of PHQ-2 (cutoff ≥2) and PHQ-9 (cutoff ≥10) was not significantly different than using the full PHQ-9 (cutoff ≥10) for all participants. Specificity for the combination was significantly, though minimally, higher. The combination approach was estimated to reduce the number of participants needing to do the full PHQ-9 by 57% (95% CI, 56%-58%). Compared with the PHQ-9 alone, the PHQ-2 alone resulted in statistically significant lower sensitivity or specificity, depending on the cutoff score.
Consistent with previous findings with the PHQ-9,20 PHQ-2 sensitivity was highest compared with semistructured interviews, which most closely replicate clinical interviews by trained professionals, and lower compared with fully structured interviews and the MINI, although differences compared with the MINI were small and not statistically significant. Specificity estimates were not significantly different across reference standards. There were no significant accuracy differences between subgroups that replicated across reference standard categories, although some subgroups had limited numbers of participants and cases.
The finding that PHQ-2 sensitivity was greater when compared with semistructured rather than fully structured interviews may have occurred because fully structured interviews are designed for reliability at the cost of validity.32-35 Previous studies found that among participants with low-level depressive symptoms, fully structured interviews may classify more participants as having major depression than semistructured interviews but fewer among participants with high-level symptoms.11-13 In the present meta-analysis, most participants did not have major depression. Thus, misclassification of major depression among participants with subthreshold depressive symptoms based on fully structured interviews might explain the lower sensitivity compared with semistructured interviews.
Among studies with semistructured interviews, PHQ-2 sensitivity and specificity were generally similar to estimates reported in a previous aggregate-data meta-analysis that combined reference standards without adjustment.8 Using individual participant data from 48 studies with semistructured interviews in the present study, sensitivity and specificity were, respectively, 0.91 and 0.67 for cutoff scores of 2 or greater and 0.72 and 0.85 for cutoff scores of 3 or greater compared with 0.91 and 0.70 for cutoff scores of 2 or greater (17 studies) and 0.76 and 0.87 for cutoff scores of 3 or greater (19 studies) in the previous meta-analysis. This differed from a meta-analysis of PHQ-9 individual participant data,20 in which, among studies that used a semistructured interview, sensitivity at the standard cutoff score of 10 or greater was substantially greater than reported in a previous aggregate-data meta-analysis that combined reference standards.9,20
No previous meta-analysis and only 2 primary studies14,15 have evaluated the PHQ-2 in combination with the PHQ-9. The 2 primary studies, however, reported results using different cutoff combinations and generated estimates of sensitivity and specificity that differed among older community-dwelling adults (N = 378; sensitivity = 0.81, specificity = 0.89) and patients with coronary artery disease (N = 1024, sensitivity = 0.75, specificity = 0.84). Using individual participant data from 44 primary studies with semistructured interviews in the present study and standard cutoffs, which maximized combined sensitivity and specificity, sensitivity (0.82) for PHQ-2 scores of 2 or greater followed by PHQ-9 scores of 10 or greater was not significantly different to PHQ-9 scores of 10 or greater alone, and specificity (0.87) was significantly better, though minimally. Assuming that screening procedures allow for quick calculation of PHQ-2 scores before presenting remaining PHQ-9 items (eg, electronic administration), the combination could improve efficiency.
Routine screening for depression in primary care has been recommended in the United States.6 National guidelines from Canada and the United Kingdom, however, recommended against screening due to the lack of direct trial evidence of benefit and concerns about harms and consumption of health care resources.49-52 Well-conducted trials that compare screening vs no screening are needed to determine whether screening improves mental health outcomes. Using the PHQ-2 in combination with the PHQ-9 may be a resource-efficient approach. Many individuals who screen positive, however, will not meet major depression diagnostic criteria and will need to be evaluated by a clinician.
Strengths of the study included the large sample size, inclusion of results from all cutoffs from all studies (rather than just those published), assessment of PHQ-2 accuracy separately across reference standards and by participant subgroups, and evaluation of the PHQ-2 and PHQ-9 combination, which had not been previously done in meta-analyses.
This study has several limitations. First, primary data from 36 of 131 published eligible data sets (27%) were not included.
Second, there was moderate heterogeneity across studies, although it improved in most cases when subgroups were considered. Subgroup analyses based on medical comorbidities, as specified in the study protocol, and on country and language could not be conducted. This is because data on the presence of nonpsychiatric medical diagnoses were not available for 40% of participants, with higher percentages missing for specific diagnoses, and because many countries and languages were represented in few primary studies.
Third, many included studies did not explicitly exclude participants who may have already been diagnosed or receiving care for depression, although there were not statistically significant differences between analyses of participants verified to not currently be diagnosed or receiving treatment and analyses of all participants, including those without this information.
Fourth, studies in the meta-analysis of individual participant data were categorized based on the interview administered, but it is possible that interviews may not have always been used in the way intended. Among 48 studies that used semistructured interviews, 3 used interviewers who did not meet typical standards, and 11 were rated unclear. It is possible that use of nonqualified interviewers may have reduced differences in accuracy estimates across reference standard categories.
Fifth, few studies were rated as having a low risk of bias across all QUADAS-2 domains; thus, sensitivity analyses using only studies with all low ratings were not conducted.
In an individual participant data meta-analysis of studies that compared PHQ scores with major depression diagnoses, the combination of PHQ-2 (with cutoff ≥2) followed by PHQ-9 (with cutoff ≥10) had similar sensitivity but higher specificity compared with PHQ-9 cutoff scores of 10 or greater alone. Further research is needed to understand the clinical and research value of this combined approach to screening.
Corresponding Author: Brett D. Thombs, PhD, Jewish General Hospital; 4333 Cote Ste Catherine Rd; Montreal, Quebec, Canada, H3T 1E4 (email@example.com).
Accepted for Publication: April 10, 2020.
Author Contributions: Drs Benedetti and Thombs had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Drs Benedetti and Thombs contributed equally as co–senior authors.
Concept and design: Levis, Benedetti, Thombs.
Acquisition, analysis, or interpretation of data: All authors.
Drafting of the manuscript: Levis, Sun, Benedetti, Thombs.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Levis, Sun, He, Wu, Negeri, Fischer, Benedetti, Thombs.
Obtained funding: Benedetti, Thombs.
Administrative, technical, or material support: Sun, Thombs.
Supervision: Benedetti, Thombs.
Conflict of Interest Disclosures: None reported.
Funding/Support: This study was funded by the Canadian Institutes of Health Research (CIHR; grants KRS-134297, PCG-155468, and PJT-162206). Dr Levis was supported by a CIHR Frederick Banting and Charles Best Canada Graduate Scholarship doctoral award and a Fonds de recherche du Québec–Santé (FRQS) Postdoctoral Training Fellowship. Dr Wu was supported by a FRQS Postdoctoral Training Fellowship. Mr Bhandari was supported by a studentship from the Research Institute of the McGill University Health Centre. Ms Neupane was supported by G.R. Caverhill Fellowship from the Faculty of Medicine, McGill University. Drs Benedetti and Thombs were supported by FRQS researcher salary awards.
Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Group Information: The DEPRESSD PHQ Collaboration members and contributions include the following:
Data analysis: Liying Chen, McGill University, Montréal, Québec, Canada; and Alexander W. Levis, McGill University, Montréal, Québec, Canada.
Data extraction, coding, and synthesis: Kira E. Riehm, Lady Davis Institute for Medical Research, Montréal, Québec, Canada; Nazanin Saadat, Lady Davis Institute for Medical Research, Montréal, Québec, Canada; Marleine Azar, McGill University, Montréal, Québec, Canada; and Danielle B. Rice, McGill University, Montréal, Québec, Canada.
Design and conduct of database searches: Jill Boruff, McGill University, Montréal, Québec, Canada; and Lorie A. Kloda, Concordia University, Montréal, Québec, Canada.
DEPRESSD Steering Committee, including conception and oversight of collaboration: Pim Cuijpers, Vrije Universiteit, Amsterdam, the Netherlands; Simon Gilbody, University of York, Heslington, York, UK; John P. A. Ioannidis, Stanford University, Stanford, California; Dean McMillan, University of York, Heslington, York, UK; Scott B. Patten, University of Calgary, Calgary, Alberta, Canada; Ian Shrier, McGill University, Montréal, Québec, Canada; and Roy C. Ziegelstein, Johns Hopkins University School of Medicine, Baltimore, Maryland.
Knowledge user consultant: Ainsley Moore, McMaster University, Hamilton, Ontario, Canada.
Contributed included data sets: Dickens H. Akena, Makerere University College of Health Sciences, Kampala, Uganda; Dagmar Amtmann, University of Washington, Seattle; Bruce Arroll, University of Auckland, Auckland, New Zealand; Liat Ayalon, Bar Ilan University, Ramat Gan, Israel; Hamid R. Baradaran, Iran University of Medical Sciences, Tehran, Iran; Anna Beraldi, Lehrkrankenhaus der Technischen Universität München, Munich, Germany; Charles N. Bernstein, University of Manitoba, Winnipeg, Manitoba, Canada; Arvin Bhana, University of KwaZulu-Natal, Durban, South Africa; Charles H. Bombardier, University of Washington, Seattle; Ryna Imma Buji, Hospital Mesra Bukit Padang, Sabah, Malaysia; Peter Butterworth, The University of Melbourne, Melbourne, Victoria, Australia; Gregory Carter, University of Newcastle, New South Wales, Australia; Marcos H. Chagas, University of São Paulo, Ribeirão Preto, Brazil; Juliana C. N. Chan, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China; Lai Fong Chan, National University of Malaysia, Kuala Lumpar, Malaysia; Dixon Chibanda, University of Zimbabwe, Harare, Zimbabwe; Rushina Cholera, University of North Carolina at Chapel Hill School of Medicine; Kerrie Clover, University of Newcastle, New South Wales, Australia; Aaron Conway, University of Toronto, Toronto, Ontario, Canada; Yeates Conwell, University of Rochester Medical Center, Rochester, New York; Federico M. Daray, University of Buenos Aires, Buenos Aires, Argentina; Janneke M. de Man-van Ginkel, University Medical Center Utrecht, Utrecht, the Netherlands; Jaime Delgadillo, University of Sheffield, Sheffield, UK; Crisanto Diez-Quevedo, Hospital Germans Trias i Pujol, Badalona, Spain; Jesse R. Fann, University of Washington, Seattle; Sally Field, University of Cape Town, Cape Town, South Africa; Jane R. W. Fisher, Monash University, Melbourne, Victoria, Australia; Daniel Fung, Duke-NUS Medical School, Singapore; Emily C. Garman, University of Cape Town, Cape Town, South Africa; Bizu Gelaye, Harvard T.H. Chan School of Public Health, Boston, Massachusetts; Leila Gholizadeh, University of Technology Sydney, Sydney, New South Wales, Australia; Lorna J. Gibson, London School of Hygiene and Tropical Medicine, London, UK; Felicity Goodyear-Smith, University of Auckland, Auckland, New Zealand; Eric P. Green, Duke Global Health Institute, Durham, North Carolina; Catherine G. Greeno, University of Pittsburgh, Pittsburgh, Pennsylvania; Brian J. Hall, University of Macau, Macau Special Administrative Region, China; Petra Hampel, University of Flensburg, Flensburg, Germany; Liisa Hantsoo, The Johns Hopkins University School of Medicine, Baltimore, Maryland; Emily E. Haroz, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland; Martin Harter, University Medical Center Hamburg-Eppendorf, Hamburg, Germany; Ulrich Hegerl, German Depression Foundation, Leipzig, Germany; Leanne Hides, University of Queensland, Brisbane, Queensland, Australia; Stevan E. Hobfoll, STAR-Stress, Anxiety & Resilience Consultants, Chicago, Illinois; Simone Honikman, University of Cape Town, Cape Town, South Africa; Marie Hudson, McGill University, Montréal, Québec, Canada; Thomas Hyphantis, University of Ioannina, Ioannina, Greece; Masatoshi Inagaki, Shimane University, Shimane, Japan; Khalida Ismail, King’s College London Weston Education Centre, London, UK; Hong Jin Jeon, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, South Korea; Nathalie Jetté, Ihcan School of Medicine at Mount Sinai, New York, New York; Mohammad E. Khamseh, Iran University of Medical Sciences, Tehran, Iran; Kim M. Kiely, University of New South Wales, Sydney, Australia; Sebastian Kohler, Maastricht University, Maastricht, the Netherlands; Brandon A. Kohrt, The George Washington University, Washington, DC; Yunxin Kwan, Tan Tock Seng Hospital, Singapore; Femke Lamers, Amsterdam UMC, Amsterdam, the Netherlands; María Asunción Lara, National Institute of Psychiatry Ramon de la Fuente Muñiz, Mexico City, Mexico; Holly F. Levin-Aspenson, University of Notre Dame, Notre Dame, Indiana; Valéria T. S. Lino, National School of Public Health Sergio Arouca, Rio de Janeiro, Brazil; Shen-Ing Liu, Mackay Memorial Hospital, Taipei, Taiwan; Manote Lotrakul, Mahidol University, Bangkok, Thailand; Sonia R. Loureiro, University of São Paulo, Ribeirão Preto, Brazil; Bernd Löwe, University Medical Center Hamburg-Eppendorf, Hamburg, Germany; Nagendra P. Luitel, Transcultural Psychosocial Organization Nepal, Kathmandu, Nepal; Crick Lund, University of Cape Town, Cape Town, South Africa; Ruth Ann Marrie, University of Manitoba, Winnipeg, Manitoba, Canada; Laura Marsh, Houston and Michael E. DeBakey Veterans Affairs Medical Center, Houston, Texas; Brian P. Marx, Boston University School of Medicine, Boston, Massachusetts; Anthony McGuire, St. Joseph’s College, Standish, Maine; Sherina Mohd Sidik, Universiti Putra Malaysia, Serdang, Selangor, Malaysia; Tiago N. Munhoz, Federal University of Pelotas, Pelotas, Brazil; Kumiko Muramatsu, Graduate School of Niigata Seiryo University, Niigata, Japan; Juliet E. M. Nakku, Butabika National Referral Teaching Hospital, Kampala, Uganda; Laura Navarrete, National Institute of Psychiatry Ramon de la Fuente Muñiz, Mexico City, Mexico; Flávia L. Osório, University of São Paulo, Ribeirão Preto, Brazil; Vikram Patel, Harvard Medical School, Boston, Massachusetts; Brian W. Pence, The University of North Carolina at Chapel Hill; Philippe Persoons, Katholieke Universiteit Leuven, Leuven, Belgium; Inge Petersen, University of KwaZulu-Natal, South Africa; Angelo Picardi, Italian National Institute of Health, Rome, Italy; Stephanie L. Pugh, NRG Oncology Statistics and Data Management Center, Philadelphia, Pennsylvania; Terence J. Quinn, University of Glasgow, Glasgow, Scotland; Elmars Rancans, Riga Stradins University, Riga, Latvia; Sujit D. Rathod, London School of Hygiene and Tropical Medicine, London, UK; Katrin Reuter, Group Practice for Psychotherapy and Psycho-oncology, Freiburg, Germany; Svenja Roch, University of Flensburg, Flensburg, Germany; Alasdair G. Rooney, University of Edinburgh, Edinburgh, Scotland, UK; Heather J. Rowe, Monash University, Melbourne, Victoria, Australia; Iná S. Santos, Federal University of Pelotas, Pelotas, Brazil; Miranda T. Schram, Maastricht University Medical Center, Maastricht, the Netherlands; Juwita Shaaban, Universiti Sains Malaysia, Kelantan, Malaysia; Eileen H. Shinn, University of Texas M. D. Anderson Cancer Center, Houston; Abbey Sidebottom, Allina Health, Minneapolis, Minnesota; Adam Simning, University of Rochester Medical Center, Rochester, New York; Lena Spangenberg, University of Leipzig, Leipzig, Germany; Lesley Stafford, Royal Women’s Hospital, Parkville, Australia; Sharon C. Sung, Duke-NUS Medical School, Singapore; Keiko Suzuki, Asahikawa University Hospital, Asahikawa, Hokkaido, Japan; Richard H. Swartz, University of Toronto, Toronto, Ontario, Canada; Pei Lin Lynnette Tan, Tan Tock Seng Hospital, Singapore; Martin Taylor-Rowan, University of Glasgow, Glasgow, Scotland; Thach D. Tran, Monash University, Melbourne, Victoria, Australia; Alyna Turner, University of Newcastle, New South Wales, Newcastle, Australia; Christina M. van der Feltz-Cornelis, University of York, York, UK; Thandi van Heyningen, University of Cape Town, Cape Town, South Africa; Henk C. van Weert, Amsterdam University Medical Centers, Amsterdam, the Netherlands; Lynne I. Wagner, Wake Forest School of Medicine, Winston-Salem, North Carolina; Jian Li Wang, University of Ottawa Institute of Mental Health Research, Ottawa, Ontario, Canada; Jennifer White, Monash University, Melbourne, Victoria, Australia; Kirsty Winkley, King’s College London, London, UK; Karen Wynter, Deakin University, Melbourne, Victoria, Australia; Mitsuhiko Yamada, National Center of Neurology and Psychiatry, Tokyo, Japan; Qing Zhi Zeng, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China; and Yuying Zhang, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China.
Data Sharing Statement: Requests to access data should be made to the corresponding author.