[Skip to Navigation]
Sign In
Figure. Relative Diagnostic Odds Ratios and 95% Confidence Intervals (CIs) of the 9 Study Characteristics Examined With a Multivariate Regression Analysis
Image description not available.
Table 1. Diagnostic Problems, Tests, Number of Studies, and Search Period of the Meta-Analyses
Image description not available.
Table 2. Results of the Scoring of Study Quality (N = 218)
Image description not available.
1.
Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research: getting better but still not good.  JAMA.1995;274:645-651.Google Scholar
2.
Jaeschke R, Guyatt G, Sackett DL. Users' guides to the medical literature, III: how to use an article about a diagnostic test, A: are the results of the study valid?  JAMA.1994;271:389-391.Google Scholar
3.
Jaeschke R, Guyatt GH, Sackett DL. Users' guides to the medical literature, III: how to use an article about a diagnostic test, B: what are the results and will they help me in caring for my patients?  JAMA.1994;271:703-707.Google Scholar
4.
Greenhalgh T. How to read a paper: papers that report diagnostic or screening tests.  BMJ.1997;315:540-543.Google Scholar
5.
Mulrow CD, Linn WD, Gaul MK, Pugh JA. Assessing quality of a diagnostic test evaluation.  J Gen Intern Med.1989;4:288-295.Google Scholar
6.
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials.  JAMA.1995;273:408-412.Google Scholar
7.
Feinstein AR. Diagnostic and spectral markers.  Clinical Epimediology: The Architecture of Clinical Research.Philadelphia, Pa: WB Saunders Co; 1985:597-631.Google Scholar
8.
van der Schouw YT, Verbeek AL, Ruijs SH. Guidelines for the assessment of new diagnostic tests.  Invest Radiol.1995;30:334-340.Google Scholar
9.
Griner PF, Mayewski RJ, Mushlin AI, Greenland P. Selection and interpretation of diagnostic tests and procedures: principles and applications.  Ann Intern Med.1981;94:557-592.Google Scholar
10.
Sackett DL, Haynes RB, Guyatt GH, Tugwell P. The selection of diagnostic tests. In: Sackett D, ed. Clinical Epidemiology. Boston, Mass: Little Brown & Co; 1991:47-57.
11.
Kraemer HC. Evaluating Medical Tests: Objective and Quantitative Guidelines. Newbury Park, Calif: SAGE Publications Inc; 1992:103-113.
12.
Stoffers HE, Kester AD, Kaiser V, Rinkens PE, Kitslaar PJ, Knottnerus JA. The diagnostic value of the measurement of the ankle-brachial systolic pressure index in primary health care.  J Clin Epidemiol.1996;49:1401-1405.Google Scholar
13.
Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations.  Stat Med.1993;12:1293-1316.Google Scholar
14.
Littenberg B, Moses LE. Estimating diagnostic accuracy from multiple conflicting reports: a new meta-analytic method.  Med Decis Making.1993;13:313-321.Google Scholar
15.
Irwig L, Macaskill P, Glasziou P, Fahey M. Meta-analytic methods for diagnostic test accuracy.  J Clin Epidemiol.1995;48:119-130.Google Scholar
16.
Haldane JBS. The estimation and significance of the logarithm of a ratio of frequencies.  Ann Hum Genet.1955;20:309-314.Google Scholar
17.
Scheidler J, Hricak H, Yu KK, Subak L, Segal MR. Radiological evaluation of lymph node metastases in patients with cervical cancer: a meta-analysis.  JAMA.1997;278:1096-1101.Google Scholar
18.
Siegman-Igra Y, Anglim AM, Shapiro DE, Adal KA, Strain BA, Farr BM. Diagnosis of vascular catheter-related bloodstream infection: a meta-analysis.  J Clin Microbiol.1997;35:928-936.Google Scholar
19.
Smith ER, Petersen J, Okorodudu AO, Bissell MG. Does the addition of unconjugated estriol in maternal serum screening improve the detection of trisomy 21? a meta-analysis.  Clin Lab Manage Rev.1996;10:176-181.Google Scholar
20.
Becker DM, Philbrick JT, Bachhuber TL, Humphries JE. D-dimer testing and acute venous thromboembolism: a shortcut to accurate diagnosis?  Arch Intern Med.1996;156:939-946.Google Scholar
21.
De Vries SO, Hunink MGM, Polak JF. Summary receiver operating characteristic curves as a technique for meta-analysis of the diagnostic performance of duplex ultrasonography in peripheral arterial disease.  Acad Radiol.1996;3:361-369.Google Scholar
22.
Bonis PA, Ioannidis JP, Cappelleri JC, Kaplan MM, Lau J. Correlation of biochemical response to interferon alfa with histological improvement in hepatitis C: a meta-analysis of diagnostic test characteristics.  Hepatology.1997;26:1035-1044.Google Scholar
23.
Hallan S, Asberg A. The accuracy of C-reactive protein in diagnosing acute appendicitis: a meta-analysis.  Scand J Clin Lab Invest.1997;57:373-380.Google Scholar
24.
Huicho L, Campos M, Rivera J, Guerrant RL. Fecal screening tests in the approach to acute infectious diarrhea: a scientific overview.  Pediatr Infect Dis J.1996;15:486-494.Google Scholar
25.
Reed WW, Byrd GS, Gates Jr RH, Howard RS, Weaver MJ. Sputum Gram's stain in community-acquired pneumococcal pneumonia: a meta-analysis.  West J Med.1996;165:197-204.Google Scholar
26.
Mol BWJ, Dijkman AB, Wertheim P, Lijmer JG, Van der Veen F, Bossuyt PMM. The accuracy of serum chlamydial antibodies in the diagnosis of tubal pathology: a meta-analysis.  Fertil Steril.1997;67:1031-1037.Google Scholar
27.
Mirvis SE, Shanmuganathan K, Miller BH, White CS, Turney SZ. Traumatic aortic injury: diagnosis with contrast-enhanced thoracic CT—five-year experience at a major trauma center.  Radiology.1996;200:413-422.Google Scholar
28.
Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests.  N Engl J Med.1978;299:926-930.Google Scholar
29.
Panzer RJ, Suchman AL, Griner PF. Workup bias in prediction research.  Med Decis Making.1987;7:115-119.Google Scholar
30.
Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to selection bias.  Biometrics.1983;39:207-215.Google Scholar
31.
Hunink MGM, Richardson D, Doubiley PM, Begg CB. Testing for fetal pulmonary maturity: an ROC analysis involving covariates, verification bias and combination testing.  Med Decis Making.1990;10:201-211.Google Scholar
32.
Lijmer JG, Hunink MGM, van den Dungen JJ, Loonstra J, Smit AJ. ROC analysis of noninvasive tests for peripheral arterial disease.  Ultrasound Med Biol.1996;22:391-398.Google Scholar
33.
Quinn MF. Relation of observer agreement to accuracy according to a two-receiver signal detection model of diagnosis.  Med Decis Making.1989;9:196-206.Google Scholar
34.
Fahey MT, Irwig L, Macaskill P. Meta-analysis of pap test accuracy.  Am J Epidemiol.1995;141:680-689.Google Scholar
35.
Detrano R, Janosi A, Lyons KP, Marcondes G, Abbassi N, Froelicher VF. Factors affecting sensitivity and specificity of a diagnostic test: the exercise thallium scintigram.  Am J Med.1988;84:699-710.Google Scholar
36.
Detrano R, Gianrossi R, Froelicher V. The diagnostic accuracy of the exercise electrocardiogram: a meta-analysis of 22 years of research.  Prog Cardiovasc Dis.1989;32:173-206.Google Scholar
37.
Wells PS, Lensing AW, Davidson BL, Prins MH, Hirsh J. Accuracy of ultrasound for the diagnosis of deep venous thrombosis in asymptomatic patients after orthopedic surgery: a meta-analysis.  Ann Intern Med.1995;122:47-53.Google Scholar
38.
Begg CB, Berlin JA. Publication bias and dissemination of clinical research.  J Natl Cancer Inst.1989;81:107-115.Google Scholar
Original Contribution
September 15, 1999

Empirical Evidence of Design-Related Bias in Studies of Diagnostic Tests

Author Affiliations

Author Affiliations: Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, University of Amsterdam, Amsterdam, the Netherlands.

JAMA. 1999;282(11):1061-1066. doi:10.1001/jama.282.11.1061
Abstract

Context The literature contains a large number of potential biases in the evaluation of diagnostic tests. Strict application of appropriate methodological criteria would invalidate the clinical application of most study results.

Objective To empirically determine the quantitative effect of study design shortcomings on estimates of diagnostic accuracy.

Design and Setting Observational study of the methodological features of 184 original studies evaluating 218 diagnostic tests. Meta-analyses on diagnostic tests were identified through a systematic search of the literature using MEDLINE, EMBASE, and DARE databases and the Cochrane Library (1996-1997). Associations between study characteristics and estimates of diagnostic accuracy were evaluated with a regression model.

Main Outcome Measures Relative diagnostic odds ratio (RDOR), which compared the diagnostic odds ratios of studies of a given test that lacked a particular methodological feature with those without the corresponding shortcomings in design.

Results Fifteen (6.8%) of 218 evaluations met all 8 criteria; 64 (30%) met 6 or more. Studies evaluating tests in a diseased population and a separate control group overestimated the diagnostic performance compared with studies that used a clinical population (RDOR, 3.0; 95% confidence interval [CI], 2.0-4.5). Studies in which different reference tests were used for positive and negative results of the test under study overestimated the diagnostic performance compared with studies using a single reference test for all patients (RDOR, 2.2; 95% CI, 1.5-3.3). Diagnostic performance was also overestimated when the reference test was interpreted with knowledge of the test result (RDOR, 1.3; 95% CI, 1.0-1.9), when no criteria for the test were described (RDOR, 1.7; 95% CI, 1.1-2.5), and when no description of the population under study was provided (RDOR, 1.4; 95% CI, 1.1-1.7).

Conclusion These data provide empirical evidence that diagnostic studies with methodological shortcomings may overestimate the accuracy of a diagnostic test, particularly those including nonrepresentative patients or applying different reference standards.

During recent decades, the number of available diagnostic tests has been rapidly increasing. As for all new medical technologies, new diagnostic tests should be thoroughly evaluated prior to their introduction into daily practice. The number of test evaluations in the literature is increasing but the methodological quality of these studies is on average poor. A survey of the diagnostic literature (1990-1993) showed that only 18% of the studies satisfied 5 of the 7 methodological standards examined.1 Different guidelines have been written to help physicians with the critical appraisal of the diagnostic literature consisting of lists of criteria for the assessment of study quality.2-4 Criteria enable readers to check whether studies fulfill methodological criteria on study design, data collection, and methods of reporting the results.

As few diagnostic studies meet all of the methodological criteria, physicians and reviewers are faced with a difficult choice. Strict application of the methodological criteria would imply that only a small minority of the available data can be used in clinical practice. Alternatively, inclusion of a wider range of imperfect studies would require weighting of the evidence according to the relative importance of the criteria that such studies failed to satisfy. One article has reported such weights for methodological criteria. Unfortunately, these weights were established through a consensus procedure in a general internal medicine division at an academic medical center, rather than on empirical data.5 A data-driven approach had been previously used by Schulz et al6 when evaluating the influence of study design features on estimates of treatment effects in randomized controlled trials.

The purpose of our study was to assess empirically the impact of shortcomings in design, data collection, and reporting on the estimates of diagnostic accuracy. We compared estimates of diagnostic accuracy for a given test reported in studies with lower quality with estimates for the same test from studies without these shortcomings. We hypothesized that estimates of diagnostic accuracy would be exaggerated in studies that failed to meet methodological standards.

Methods
Data Sources and Data Extraction

An electronic search of the literature was performed to identify meta-analyses summarizing the accuracy of diagnostic tests. We focused on meta-analyses because they enabled us to identify a large number of studies on a single diagnostic problem. We concentrated on recent meta-analyses as we expected these meta-analyses to include both older studies, using suboptimal designs, as well as recent studies, applying a more up-to-date approach that lives up to current methodological standards. To be included, a meta-analysis had to be based on a systematic search of the literature, had to include at least 5 studies, and had to report sensitivities and specificities of included studies. The latter criterion was introduced to assure that from each reviewed study sensitivity and specificity were available and to allow for easy replication of our work.

The MEDLINE and EMBASE databases were searched (January 1996 to December 1997) using combinations of the words meta-analysis; diagnostic imaging; diagnostic tests, routine; sensitivity and specificity; and review, publication type. In addition, the Cochrane Library and the DARE database of the NHS Centre for Reviews and Dissemination were examined for relevant abstracts.

We retrieved 26 articles that included 5 or more studies. Fifteen articles had to be excluded: 7 were not based on a systematic literature search and 8 reported no list of sensitivities and specificities. A list of excluded articles is available from the authors.

For the 11 remaining articles, all original papers included in the analyses were retrieved (Table 1). The characteristics of these studies were extracted on a standard form by 1 of the authors (J.G.L.). The set of characteristics on the standard form was based on a synthesis from different lists of criteria for study quality.2,4,5 All studies were independently scored a second time by a second reviewer (B.W.M., P.M.M.B.). Disagreement was resolved by consensus, if necessary the judgment of a third reviewer was decisive.

Assessment of Study Quality

The optimal design for assessing the accuracy of a diagnostic test is considered to be a prospective blind comparison of the test and the reference test in a consecutive series of patients from a relevant clinical population.2,7 A relevant clinical population is a group of patients covering the spectrum of disease that is likely to be encountered in the current or future use of the test. There are several threats to the validity of a diagnostic study. Diagnostic accuracy can be overestimated if the test is evaluated in a group of patients already known to have the disease and a separate group of normal patients, rather than in a relevant clinical population.8 This will be referred to as a case-control study.

Selection bias can be present when not all patients presenting with the relevant condition are included in order of entry (consecutive) into the study, and when this selection is not random. If it was not clear from the text that a consecutive series of patients was included or a random subset, the corresponding study was scored as nonconsecutive.

Verification bias looms if the decision to perform the reference test is based on the result of the test under examination. In many diagnostic studies with an invasive reference test, most of the positive test results and only a small part of the negative test results are verified. Alternatively, negative test results are verified by a different, often less thorough, standard, for example follow-up. We will refer to these 2 forms of verification bias as partial verification bias and differential reference standard bias, respectively. In cases in which more than 10% of the study group was not subjected to the reference test, the study was scored as applying partial verification; in cases in which different reference tests were used, the study was scored as differential reference standard. All other cases were scored as complete verification.

Interpreting the reference test with knowledge of the results of the test under study can lead to an overestimation of a test's accuracy, especially if the reference test is open to subjective interpretation. If the sequence of testing is reversed, it is important that the results of the test under study are interpreted without knowledge of the reference test. If it was not clear from the text that the interpretation of both tests was done while investigators were blinded the study was scored as not blinded.

In addition to characteristics of the study design, we also looked at methods of data collection and reporting. The data collection was categorized as either prospective or retrospective. In case of doubt, the method of data collection was scored as unknown. The reference test, the test under study, and the study population should be described with sufficient detail to allow for replication, validation, and generalization of the study.2 Descriptions of the tests were scored as sufficient if clear definitions of positive and negative test results were mentioned in the text. Description of the study population was sufficient if 2 of the following characteristics were described: age of participants, female to male ratio, or distribution of symptoms.

Statistical Analysis
The results of an individual study on diagnostic accuracy can be summarized in a 2×2 table. From this table, frequently used measures such as sensitivity, specificity, and predictive values can easily be calculated.9,10 Another measure for the diagnostic accuracy of a test is the diagnostic odds ratio (DOR), the odds for a positive test result in diseased persons relative to the odds of a positive result in nondiseased persons.11,12 The DOR is a single statistic of the results in a 2×2 table, incorporating sensitivity as well as specificity. Expressed in terms of sensitivity and specificity the formula is: Graphic Jump LocationImage description not available.

The effect of study characteristics was examined with a regression model that is adapted from the summary receiver operating characteristic curve model, developed for meta-analyses of diagnostic tests.13-15 The basic model contains the logarithm of the diagnostic odds ratio computed for a single study as a dependent variable and 2 explaining parameters, 1 for the intercept and 1 for the slope of the curve, for each meta-analysis. The intercept can be interpreted as the common DOR of the corresponding test and the parameter for the slope expresses variation of the DOR across individual studies due to threshold differences.

We added covariates to this model to examine whether, on average, studies that failed to meet the methodological criteria yielded different DORs. The resulting parameter estimates of the covariates can be interpreted after antilogarithm transformation as relative DORs (RDORs). They indicate the diagnostic performance of a test in studies failing to satisfy the methodological criterion, relative to its performance in studies with the corresponding feature. If the RDOR is larger than 1, studies not satisfying the criterion yield larger estimates of the DOR than studies with this corresponding feature.

In summary, the dependent variable of the model was the logarithm DOR. Explaining variables were 2 parameters for each meta-analysis (the common DOR and the threshold parameter) and 9 covariates to examine the effect of the different study characteristics, 1 for each feature. All study characteristics were evaluated simultaneously in a multivariate model.

A weighted linear regression analysis was used, with weights proportional to the reciprocal of the variance of the log DOR. This weighted linear regression assumes fixed effects. In case of zero entries, the DOR is not defined. This problem was solved by adding 0.5 to all cells of the 2×2 table for all studies in a meta-analysis.14-16 The model was fitted using maximum likelihood estimation, and programmed using statistical software (S-plus 4.5, Mathsoft Inc, Cambridge, Mass).

Results

The subjects of the 11 included articles and the reviewed tests are summarized in Table 1. Two articles17,18 reviewed 3 tests, 3 articles19-21 reviewed 2 tests, and 6 articles22-27 reviewed 1 test. This resulted in a total of 18 separate meta-analyses for this analysis. These 18 meta-analyses summarized the results of 193 published studies. Nine studies could not be used in the final analysis because only abstracts were available for 4 and 2×2 table calculations were not possible for 5. Of the 184 studies remaining, some evaluated multiple diagnostic tests. A total of 218 diagnostic test evaluations were available for analysis.

The overall results of the quality assessment of the included studies are listed in Table 2. Most studies used a clinical cohort and described the cut-off that the test evaluated (98% and 89%, respectively). Only 15 (6.8%) of the 218 studies satisfied all 8 criteria used. Sixty-four (30%) of the 218 studies satisfied 6 or more criteria.

The results from the regression analysis are presented in Figure 1. Studies using a case-control design tended to overestimate the DOR 3-fold compared with studies with a clinical cohort (RDOR, 3.0). Studies using different reference tests for positive and negative test results had an RDOR of 2.2, showing approximately a 2-fold overestimation of the DOR compared with studies that used 1 reference test. Studies verifying only part of the population had on average the same DOR as studies that subjected all patients to the reference test (RDOR, 1.0). There were no studies verifying only part of the population and in conjunction using different reference standards. Interpretation of the reference test with knowledge of the outcomes of the test under study resulted in a RDOR of 1.3, causing an overestimation of the DOR by approximately 30% compared with studies with adequate blinding. Selective inclusion of patients into the study did not change the estimation of diagnostic accuracy significantly.

Retrospective data collection was not associated with an overestimation or underestimation of diagnostic accuracy in comparison with studies with prospective or unknown data collection. In a univariate analysis (data not shown) the RDOR of studies with unknown data collection was nearest to that of prospective studies, we therefore collapsed these 2 categories. The DORs in articles without a sufficient description of the test under study or the study population were, respectively, about 70% and 40% higher than estimates in articles reporting sufficient details. Studies reporting no details of the cut-off of the reference test had DORs that were approximately 30% smaller than studies reporting the details of the reference tests.

Comment

This study describes the quantitative effects of characteristics of study design on estimates of diagnostic accuracy. By collecting data from studies in published diagnostic meta-analyses, we were able to examine the effect of study characteristics on diagnostic accuracy. Our analysis shows that studies of lower methodological quality, particularly those including nonrepresentative patients or applying different reference standards, tend to overestimate the diagnostic performance of a test.

The largest effect on the estimation of diagnostic accuracy was generated by studies using cases and controls, also labeled as spectrum bias.28 Often, mild cases that are difficult to diagnose are omitted from case-control studies, causing an overestimation of sensitivity as well as specificity. Another large effect was seen in studies that used different reference tests for the verification of positive and negative test results. The effect of this differential verification depends on the quality of the different reference tests used. Using a "gold" reference test for the positive test results and a poor reference test for the negative results can lead to an overestimation of both sensitivity and specificity of a test.29 For example, some studies evaluating the diagnostic performance of C-reactive protein (CRP) for the diagnosis of acute appendicitis used surgery and pathology as a reference test for patients with a high CRP. Patients with a low CRP were not operated on and clinical follow-up determined whether they were classified as having acute appendicitis. As low-grade infections with low CRPs can resolve spontaneously, this verification strategy fails to identify all false-negative test results. This way the diagnostic performance of CRP will be overestimated. If the poor reference standard fails to identify true-negative test results, the use of different reference standards can lead to an underestimation of the diagnostic performance of a test.

The terms verification bias or workup bias are sometimes used when not all patients are subjected to the reference test.29,30 We prefer the term partial verification bias to differentiate this situation from the situation in which different reference standards were used for verification. In theory, verifying more positive test results than negative test results will lead to an overestimation of sensitivity and an underestimation of specificity, resulting in a either an increase or a decrease of the DOR.29 In the analysis reported here, partial verification resulted in DORs comparable with those from studies with complete verification. The absence of an association with estimation of diagnostic accuracy could be caused by the definition we used. Partial verification will only lead to bias if systematically more abnormal than normal (or more normal than abnormal) test results are subjected to the reference standard. In many studies it was not clear why some patients were not subjected to the reference standard and if this was related to the test under study. Therefore, we scored studies as partial verification when 10% of the patients were not verified, including studies in which patients were not verified due to a random error. This could weaken the possible effect of partial verification in our analysis. Some other studies also have shown no effect of partial verification on the overall diagnostic accuracy of a test. In these studies only a shift of threshold values along the receiver operating characteristic curve was observed.31,32

The average effect of inappropriate blinding was small. The studies included in this analysis used many different reference standards ranging from diagnostic imaging to histology. In case the reference standard is objective, no effect is to be expected. Thus, in clinical situations with a subjective reference standard, the effect of not blinding could be larger.

Given the distinction made between case-control and cohort studies, we did not observe an influence of nonconsecutive sampling of patients. Retrospective collection did not generate different results than studies with prospective data collection (when corrected for all other methodological flaws).

When looking at the criteria for the methods of reporting, we found a sufficient description of the test and a sufficient description of the population associated with an overestimation of diagnostic accuracy in case of their absence. As these criteria are not directly related to the study design, it is unclear how they lead to an overestimation of diagnostic accuracy. Somehow they seem to be predictors of methodological flaws in studies. In contrast with these findings, studies with an insufficient description of the reference standard generated less optimistic results compared with studies with an adequate description. An explanation could be that these studies possibly had a large variation in the interpretation of the reference standard. Large interobserver variation is associated with a poor diagnostic accuracy, leading to an underclassification of diseased persons.33 However, the same argument could be used for lack of description of the test under study. Many times we had difficulties in deciding whether the reference test was described with enough detail. How much detail is needed if the reference standard is histology? The definition was largely dependent on the clinical situation under study. The scores of 20 studies changed after the consensus reading in comparison with the first reading.

Four meta-analyses have examined the quantitative effect of study characteristics on diagnostic performance. These analyses were published before 1996 and, hence, not included in our sample. All were limited to a single test.34-37 Some of these studies found that partial verification and absence of blinding affected the estimates of diagnostic accuracy,33,34,36 while others found no effect of these characteristics.35 As in the first 3 meta-analyses, we found an overestimation of diagnostic accuracy in studies without appropriate blinding, as in the latter we found no effect of partial verification. One of these meta-analyses also looked at the reporting method of the test and the reference test in combination with other criteria and also found that insufficient description was associated with overestimation of the diagnostic accuracy.37

The decision to limit our analysis to data from recent meta-analyses could have affected our results. Extending our sample to older meta-analyses would most likely change the relative frequency of the study characteristics, but not necessarily the relative size of their effects on diagnostic accuracy.

Publication bias also has to be taken into account since only diagnostic studies published in scientific journals were included in the analysis.38 One can speculate that studies have a higher likelihood of being published when they are either of good quality or when they show encouraging results. Such a selective publication policy could lead to an inflation of the associations we found. It is difficult to examine the effect of publication bias since there is no registration of unpublished diagnostic studies. For future research in this field and for reviewers of diagnostic tests, such a central registration of diagnostic research protocols would be useful.

When reading the results of a single diagnostic study, it is difficult to weigh the methodological flaws against the available evidence. How large is the possible overestimation and will it have clinical consequences? In a study with different reference standards, without blinding and lacking a description of the test, the DOR would on average be overestimated by 5-fold based on the results of our analysis. This is equal to reporting a sensitivity and specificity of about 84% when in fact both should be 70%. Differences will be smaller if sensitivity and specificity are higher and if only a few minor criteria are not fulfilled.

Our results stress the importance of adequate methodology and the need for complete and reliable reporting of research. Assessment of quality is only feasible in the light of complete clarity on the methodology. Authors should therefore describe explicitly their methods of patient selection, methods of disease verification, and criteria for interpretation of the test and the reference test.

This study shows that shortcomings in design, data collection, and reporting affect estimates of diagnostic accuracy. Investigators should be aware of this when designing their studies and readers should be aware of this when interpreting the results. Our results can be of help in determining the merits of the available evidence when appraising literature. Greater editorial vigilance could help make researchers aware of current methodological standards and thereby decrease the potential for bias in future diagnostic studies.

References
1.
Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research: getting better but still not good.  JAMA.1995;274:645-651.Google Scholar
2.
Jaeschke R, Guyatt G, Sackett DL. Users' guides to the medical literature, III: how to use an article about a diagnostic test, A: are the results of the study valid?  JAMA.1994;271:389-391.Google Scholar
3.
Jaeschke R, Guyatt GH, Sackett DL. Users' guides to the medical literature, III: how to use an article about a diagnostic test, B: what are the results and will they help me in caring for my patients?  JAMA.1994;271:703-707.Google Scholar
4.
Greenhalgh T. How to read a paper: papers that report diagnostic or screening tests.  BMJ.1997;315:540-543.Google Scholar
5.
Mulrow CD, Linn WD, Gaul MK, Pugh JA. Assessing quality of a diagnostic test evaluation.  J Gen Intern Med.1989;4:288-295.Google Scholar
6.
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias: dimensions of methodological quality associated with estimates of treatment effects in controlled trials.  JAMA.1995;273:408-412.Google Scholar
7.
Feinstein AR. Diagnostic and spectral markers.  Clinical Epimediology: The Architecture of Clinical Research.Philadelphia, Pa: WB Saunders Co; 1985:597-631.Google Scholar
8.
van der Schouw YT, Verbeek AL, Ruijs SH. Guidelines for the assessment of new diagnostic tests.  Invest Radiol.1995;30:334-340.Google Scholar
9.
Griner PF, Mayewski RJ, Mushlin AI, Greenland P. Selection and interpretation of diagnostic tests and procedures: principles and applications.  Ann Intern Med.1981;94:557-592.Google Scholar
10.
Sackett DL, Haynes RB, Guyatt GH, Tugwell P. The selection of diagnostic tests. In: Sackett D, ed. Clinical Epidemiology. Boston, Mass: Little Brown & Co; 1991:47-57.
11.
Kraemer HC. Evaluating Medical Tests: Objective and Quantitative Guidelines. Newbury Park, Calif: SAGE Publications Inc; 1992:103-113.
12.
Stoffers HE, Kester AD, Kaiser V, Rinkens PE, Kitslaar PJ, Knottnerus JA. The diagnostic value of the measurement of the ankle-brachial systolic pressure index in primary health care.  J Clin Epidemiol.1996;49:1401-1405.Google Scholar
13.
Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test into a summary ROC curve: data-analytic approaches and some additional considerations.  Stat Med.1993;12:1293-1316.Google Scholar
14.
Littenberg B, Moses LE. Estimating diagnostic accuracy from multiple conflicting reports: a new meta-analytic method.  Med Decis Making.1993;13:313-321.Google Scholar
15.
Irwig L, Macaskill P, Glasziou P, Fahey M. Meta-analytic methods for diagnostic test accuracy.  J Clin Epidemiol.1995;48:119-130.Google Scholar
16.
Haldane JBS. The estimation and significance of the logarithm of a ratio of frequencies.  Ann Hum Genet.1955;20:309-314.Google Scholar
17.
Scheidler J, Hricak H, Yu KK, Subak L, Segal MR. Radiological evaluation of lymph node metastases in patients with cervical cancer: a meta-analysis.  JAMA.1997;278:1096-1101.Google Scholar
18.
Siegman-Igra Y, Anglim AM, Shapiro DE, Adal KA, Strain BA, Farr BM. Diagnosis of vascular catheter-related bloodstream infection: a meta-analysis.  J Clin Microbiol.1997;35:928-936.Google Scholar
19.
Smith ER, Petersen J, Okorodudu AO, Bissell MG. Does the addition of unconjugated estriol in maternal serum screening improve the detection of trisomy 21? a meta-analysis.  Clin Lab Manage Rev.1996;10:176-181.Google Scholar
20.
Becker DM, Philbrick JT, Bachhuber TL, Humphries JE. D-dimer testing and acute venous thromboembolism: a shortcut to accurate diagnosis?  Arch Intern Med.1996;156:939-946.Google Scholar
21.
De Vries SO, Hunink MGM, Polak JF. Summary receiver operating characteristic curves as a technique for meta-analysis of the diagnostic performance of duplex ultrasonography in peripheral arterial disease.  Acad Radiol.1996;3:361-369.Google Scholar
22.
Bonis PA, Ioannidis JP, Cappelleri JC, Kaplan MM, Lau J. Correlation of biochemical response to interferon alfa with histological improvement in hepatitis C: a meta-analysis of diagnostic test characteristics.  Hepatology.1997;26:1035-1044.Google Scholar
23.
Hallan S, Asberg A. The accuracy of C-reactive protein in diagnosing acute appendicitis: a meta-analysis.  Scand J Clin Lab Invest.1997;57:373-380.Google Scholar
24.
Huicho L, Campos M, Rivera J, Guerrant RL. Fecal screening tests in the approach to acute infectious diarrhea: a scientific overview.  Pediatr Infect Dis J.1996;15:486-494.Google Scholar
25.
Reed WW, Byrd GS, Gates Jr RH, Howard RS, Weaver MJ. Sputum Gram's stain in community-acquired pneumococcal pneumonia: a meta-analysis.  West J Med.1996;165:197-204.Google Scholar
26.
Mol BWJ, Dijkman AB, Wertheim P, Lijmer JG, Van der Veen F, Bossuyt PMM. The accuracy of serum chlamydial antibodies in the diagnosis of tubal pathology: a meta-analysis.  Fertil Steril.1997;67:1031-1037.Google Scholar
27.
Mirvis SE, Shanmuganathan K, Miller BH, White CS, Turney SZ. Traumatic aortic injury: diagnosis with contrast-enhanced thoracic CT—five-year experience at a major trauma center.  Radiology.1996;200:413-422.Google Scholar
28.
Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests.  N Engl J Med.1978;299:926-930.Google Scholar
29.
Panzer RJ, Suchman AL, Griner PF. Workup bias in prediction research.  Med Decis Making.1987;7:115-119.Google Scholar
30.
Begg CB, Greenes RA. Assessment of diagnostic tests when disease verification is subject to selection bias.  Biometrics.1983;39:207-215.Google Scholar
31.
Hunink MGM, Richardson D, Doubiley PM, Begg CB. Testing for fetal pulmonary maturity: an ROC analysis involving covariates, verification bias and combination testing.  Med Decis Making.1990;10:201-211.Google Scholar
32.
Lijmer JG, Hunink MGM, van den Dungen JJ, Loonstra J, Smit AJ. ROC analysis of noninvasive tests for peripheral arterial disease.  Ultrasound Med Biol.1996;22:391-398.Google Scholar
33.
Quinn MF. Relation of observer agreement to accuracy according to a two-receiver signal detection model of diagnosis.  Med Decis Making.1989;9:196-206.Google Scholar
34.
Fahey MT, Irwig L, Macaskill P. Meta-analysis of pap test accuracy.  Am J Epidemiol.1995;141:680-689.Google Scholar
35.
Detrano R, Janosi A, Lyons KP, Marcondes G, Abbassi N, Froelicher VF. Factors affecting sensitivity and specificity of a diagnostic test: the exercise thallium scintigram.  Am J Med.1988;84:699-710.Google Scholar
36.
Detrano R, Gianrossi R, Froelicher V. The diagnostic accuracy of the exercise electrocardiogram: a meta-analysis of 22 years of research.  Prog Cardiovasc Dis.1989;32:173-206.Google Scholar
37.
Wells PS, Lensing AW, Davidson BL, Prins MH, Hirsh J. Accuracy of ultrasound for the diagnosis of deep venous thrombosis in asymptomatic patients after orthopedic surgery: a meta-analysis.  Ann Intern Med.1995;122:47-53.Google Scholar
38.
Begg CB, Berlin JA. Publication bias and dissemination of clinical research.  J Natl Cancer Inst.1989;81:107-115.Google Scholar
×