Balk EM, Bonis PAL, Moskowitz H, Schmid CH, Ioannidis JPA, Wang C, Lau J. Correlation of Quality Measures With Estimates of Treatment Effect in Meta-analyses of Randomized Controlled Trials. JAMA. 2002;287(22):2973-2982. doi:10.1001/jama.287.22.2973
Author Affiliations: Evidence-based Practice Center, Division of Clinical Care Research, Tufts University School of Medicine, New England Medical Center, Boston, Mass (Drs Balk, Bonis, Moskowitz, Schmid, Wang, and Lau); and the Biomedical Research Institute, Foundation for Research and Technology Hellas, Clinical Trials and Evidence-Based Medicine Unit, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece (Dr Ioannidis). Dr Moskowitz is now with the Division of General Pediatrics, Mount Sinai Hospital, Mount Sinai School of Medicine, New York, NY.
Context Specific features of trial quality may be associated with exaggeration
or shrinking of the observed treatment effect in randomized studies. Therefore,
assessment of trial quality is often used in meta-analysis. However, the degree
to which specific quality measures are associated with treatment effects has
not been well established across a broad range of clinical areas.
Objective To determine if quality measures are associated with treatment effect
size in randomized controlled trials (RCTs).
Design Quality measures from published quality assessment scales were evaluated
in RCTs included in meta-analyses from 4 medical areas (cardiovascular disease,
infectious disease, pediatrics, and surgery). Included meta-analyses incorporated
at least 6 RCTs, examined dichotomous outcomes, and demonstrated significant
between-study heterogeneity in the odds ratio (OR) scale.
Main Outcome Measures Relative ORs comparing overall treatment effect (summary OR) of high
vs low-quality studies, as determined by each quality measure, with relative
ORs less than 1 indicating larger treatment effect in low-quality studies.
Results Twenty-four quality measures were analyzed for 276 RCTs from 26 meta-analyses.
Relative ORs of high vs low-quality studies for these quality measures ranged
from 0.83 to 1.26; none was statistically significantly associated with treatment
effect. The proportion of studies fulfilling specific quality measures varied
widely in the 4 medical areas. In analyses limited to specific medical areas,
placebo control, multicenter studies, study country, caregiver blinding, and
statistical methods were significantly associated with treatment effect on
7 occasions. These relative ORs ranged from 0.40 to 1.74. However, the directions
of these associations were not consistent.
Conclusions Individual quality measures are not reliably associated with the strength
of treatment effect across studies and medical areas. Although use of specific
quality measures may be appropriate in specific well-defined areas in which
there is pertinent evidence, findings of associations with treatment effect
cannot be generalized to all clinical areas or meta-analyses.
Several studies have suggested that specific measures of trial quality,
such as concealment of random allocation, blinding of patients and outcome
assessors, and handling of dropouts, may significantly influence observed
treatment effects in single studies,1,2
specific clinical areas,3,4 and
meta-analyses from a mixture of clinical areas.5,6
Proposed quality measures have been incorporated into a growing number of
scales that attempt to quantify overall trial quality.7
These findings have led to recommendations that investigators conducting meta-analyses
should take into account the quality measures and scales when drawing conclusions.8- 12
This approach can have a major impact on inferences drawn. In one study,
Jüni et al3 found a wide range of estimates
for the effectiveness of low-molecular-weight heparin for treatment of deep
vein thrombosis by using different quality scales to divide "high-quality"
from "low-quality" studies in a single meta-analysis. The summary odds ratio
(OR), or the OR calculated by quantitatively combining individual ORs from
similar studies, varied depending on which studies were determined to be of
high quality and were thus included in meta-analysis. In a controversial recent
meta-analysis, Gøtzsche and Olsen13
found that screening mammography did not reduce breast cancer deaths in 2
studies with "adequate randomization," while a highly significant effect was
found among the 5 studies in which randomization was "not adequate." However,
the analysis was criticized for its definition of inadequate randomization
and failure to consider other explanations, including other quality measures.14,15
Furthermore, the quality measures found to be associated with treatment
effect vary among investigators. Schulz et al4
reported that poorly concealed allocation or lack of double blinding resulted
in a significant overestimation of treatment effect by 41% and 17%, respectively,
in 250 studies of perinatal medicine. Moher et al5
reported a similar bias for allocation concealment but no significant bias
for double blinding. Others found generally larger bias for double blinding
but no significant bias for allocation concealment.2,3,6
The uncertain association of different quality measures with treatment
effect and the absence of a gold-standard quality assessment instrument has
resulted in a proliferation of quality scales used in meta-analyses. Jüni
et al3 identified 37 meta-analyses that used
26 different instruments to assess trial quality. The number of specific quality
measures in these scales ranged from 3 to 34, and the weights assigned to
3 common measures (randomization, blinding, and dropouts) ranged from 0% to
Adding to the uncertainty, quality is not consistently defined across
specialties, nor have specific quality measures been shown to correlate with
treatment effects in different clinical areas. A more detailed understanding
of the relationship between specific features of study quality and estimates
of treatment effect is needed. This study was designed to measure the degree
to which study quality, as determined by a wide range of previously described
measures of study design and conduct, is associated with combined estimates
of treatment effect from a variety of meta-analyses that included randomized
controlled trials (RCTs) from several medical and surgical areas.
We selected meta-analyses from 4 medical areas (cardiovascular disease,
infectious disease, pediatrics, and surgery), and extracted data on specific
quality measures and outcomes from the RCTs that had been included in the
meta-analyses. For each quality measure, we then calculated a relative OR
for treatment effect, defined as the ratio of the strength of the treatment
effect in studies in which the quality measure was present to the strength
of the effect in studies in which it was absent.
We identified specific quality measures previously demonstrated or hypothesized
to be associated with estimates of treatment effect by reviewing published
studies of quality measures and quality assessment scales.3- 5,7,16- 33
These studies were compiled from a MEDLINE search for quality and randomized controlled trials and from
reference lists of methodological articles. We used the definitions for each
quality measure as described by authors. For quality measures not clearly
described, we reached consensus on definitions. We aimed to establish definitions
of study quality that could be applied most consistently across a variety
of study types. Thus, we formalized a process that all researchers grading
the quality of studies would have to perform. Definitions of quality measures
are listed in Table 1.
Analyses were performed only on quality measures for which we could
reach consensus on the definition and could dichotomize. Studies that did
not report on a specific quality measure were assumed to be of low quality
for that measure.
We selected meta-analyses in 4 areas (cardiovascular disease, infectious
disease, pediatrics, and surgery) because they represent a variety of medical
areas. We selected cardiovascular meta-analyses from among those used in a
previous analysis by our group.34 Meta-analyses
for other areas were found by searching the MEDLINE database (1966-2000) and
the Cochrane Database of Systematic Reviews (2000, issue 4).
Included meta-analyses incorporated at least 6 RCTs, examined dichotomous
outcomes, and demonstrated significant between-study heterogeneity in the
OR scale (P<.10 for the χ2 statistic
or a nonzero between-study variance, τ2, by the DerSimonian
and Laird random-effects model).35,36
We required statistical heterogeneity of treatment effect across trials within
each meta-analysis because meta-analyses with homogenous treatment effects
across trials are unlikely to find that estimates of treatment effects are
associated with quality measures (or other factors). We excluded abstracts,
letters, unavailable articles, and those for which detailed outcomes data
were not provided. Meta-analyses were selected without a priori knowledge
of the quality of the studies used. All meta-analyses that met inclusion criteria
For cardiovascular studies, the outcome used was mortality. For studies
in the other clinical areas, the outcome used varied across meta-analyses.
Within meta-analyses, only outcomes with heterogeneous treatment effects were
considered. If multiple outcomes were available for analysis, those examined
by the largest number of studies or that were most clearly defined were used.
Failure of treatment or control (eg, death) was considered a positive outcome
in all studies.
We developed the quality assessment form and extracted data in a 4-stage
process. First, 4 clinicians (E.M.B., P.A.L.B., H.M., and C.W.) trained in
clinical epidemiology and study design coded data from the same pilot set
of 8 studies and discussed discrepancies. Second, the quality assessment form
was revised and was again tested by having each investigator extract data
from a different pilot set of 8 studies. Further refinements and clarifications
were performed in the data extraction definitions of specific quality measures.
Third, the 4 investigators independently extracted data from the remaining
English-language RCTs. Data from each trial were extracted by 2 investigators.
The studies were divided so that each investigator would be paired with each
of the 3 other data extractors for approximately one third of the studies.
This helped ensure uniform application of definitions and scaling of the quality
items. When necessary, data were extracted from referenced articles that described
a study's methods. Fourth, discrepancies were reviewed to achieve consensus
between each pair of data extractors. A third investigator arbitrated disagreements.
Data from 13 Spanish-, German-, French-, and Italian-language articles were
extracted by single investigators in consultation with other investigators.
Studies in other languages were excluded.
Quality measures were dichotomized to capture high quality vs low quality.
We estimated the effect of quality measures by calculating relative ORs of
treatment effect for each measure. The relative OR compares the OR of high-quality
studies to that of low-quality studies for each quality measure. Relative
ORs greater than 1 indicate that high-quality studies had larger ORs than
To estimate the relative OR, we used a Bayesian hierarchical model with
random effects.37 This multilevel structure
accounted for the nesting of trials within meta-analyses as well as the variability
across meta-analyses. For each trial, we assumed that the outcomes followed
binomial distributions independently in the treatment and control groups.
The log odds of the probability of an outcome in each control group was assumed
to be normally distributed, centered around an average log odds for the meta-analysis.
The log OR of an outcome, defined as the difference in log odds between the
treatment and control groups, was assumed to be normally distributed, with
mean αj + βj × xij, where
xij is the quality measure in the ith
study of the jth meta-analysis. For a dichotomous
quality measure, βj represented the relative log OR between
the 2 levels of the measure. The exponential of βj is the
relative OR. Both the mean log odds in the control group and the regression
slope and intercept for the log OR differed across meta-analyses.
These regression slopes and intercepts were assumed to be random effects
drawn from a population of such slopes and intercepts. We used 2 different
population models. One model assumed a single common mean intercept and slope
for the population, around which the αj and βj
varied according to a normal distribution with common
variances τα2 and
The other model assumed different αj and βj
by medical area so that there were 4 separate population intercepts and slopes
corresponding to the cardiovascular disease, infectious disease, pediatric,
and surgical areas. Noninformative prior distributions were chosen for all
parameters to simulate the random-effects model.
Models were fit using a Markov chain Monte Carlo algorithm with WinBUGS
software version 1.3 (D. J. Spiegelhalter, A. Thomas, and N. G. Best, Medical
Research Council Biostatistics Unit, Cambridge, England), with appropriate
convergence of the Markov chains.
Assessment of the associations between quality measures and treatment
effect were limited to quality measures that were present in 10% to 90% of
the trials. These cutoffs were chosen to ensure sufficient heterogeneity in
the quality measures for meaningful comparisons. Analysis of the percentage
of dropouts was limited to studies that reported whether there were dropouts.
Analyses of whether dropouts were explicitly recorded and whether the reasons
for dropouts were recorded were limited to meta-analyses that included 6 or
more studies that provided information on dropouts.
Twenty-six meta-analyses were included in the analysis (Table 2). These included 8 cardiovascular disease,38- 45
6 infectious disease,46- 51
5 pediatric,52- 56
and 7 surgical meta-analyses.57- 63
We extracted data from 276 RCTs, which represented 85% of the trials from
the meta-analyses (a list of the trials is available from the author). The
remaining trials were generally reported in abstracts, letters, or unavailable
The final data extraction form included 28 quality measures (Table 1 and Table 3). These included questions on study definition and design,
study location, randomization, blinding, statistical analysis, reporting,
subject withdrawals, and conclusions.
Overall, interrater agreement of quality measures was high. Prior to
reconciliation of discrepancies, a median of 86% of responses agreed for each
quality measure. Outcome assessor blinding, inclusion of a statistician, accounting
for confounders, and randomization site had the poorest agreement, ranging
from 69% to 78%. Study country and outcome appropriateness had the highest
agreement at 97% and 96%, respectively. Determining whether the study was
performed as an intention-to-treat analysis proved to be the most difficult
question to clearly define. After data extraction was complete, all studies
were reviewed in conference to determine the type of analysis using the definition
of the intention-to treat principle by Lachin.64
Quality measures were present in different proportions of studies within
each of the 4 clinical domains. Many of the differences were due to the inherent
differences of studies within the 4 clinical areas. For example, patient and
caregiver blinding and placebo control were rare among surgical trials but
were common among cardiovascular disease studies. Four quality measures could
not be reliably analyzed because either too few or almost all studies included
the quality criteria (Table 4).
When all clinical domains were combined, point estimates for relative
ORs of high-quality vs low-quality studies for the quality measures ranged
from 0.83 to 1.26 (Table 4). However,
none of the 24 tested quality measures was found to be significantly associated
with treatment effect. Based on 95% confidence intervals, there were trends
toward association of study quality and treatment effect for use of valid
statistical methods and reporting of power calculations.
When the 4 clinical areas were considered separately, 5 quality measures
had significant associations with treatment effect in 7 cases (Table 4). However, no consistent patterns emerged. Multicenter studies
appeared to be associated with either an increase or a decrease in treatment
effect in pediatric and surgical studies, respectively.
Figure 1 and Figure 2 display 2 sets of complementary graphs for 4 quality measures
chosen because they are commonly thought to be associated with treatment effect
or because of inconsistent findings in different medical areas (ie, multicenter
study). In Figure 1, the scatterplots
of the unadjusted treatment effects of studies scoring as high quality compared
with those of low quality is roughly the same. Even in the few cases in which
apparently large differences in the mean treatment effects of high- and low-quality
studies occur (eg, multicenter studies and allocation concealment in pediatric
studies), the range of treatment effects across studies was generally similar.
Figure 2 displays the statistical
analysis by adjusting the treatment effects for each clinical area and meta-analysis.
The graphs directly compare the adjusted log OR of combined treatment effect
estimates of high- and low-quality studies of each meta-analysis. Again, no
quality measure consistently differentiated studies by treatment effect across
medical areas, which would be observed in clustering of points to one side
of the diagonal line of identity. Except for occasional outliers, the treatment
effects of high- and low-quality studies were similar within each meta-analysis,
regardless of the quality measure used.
Analyses were also performed using fixed-effects and random-effects
linear regression models, controlling for meta-analysis and medical area.
Results were similar.
Previous studies have described associations between specific quality
measures and treatment effects.2,5,20
In contrast, our analysis did not reveal any consistent associations between
quality measure and the magnitude of the treatment effect in 4 clinical areas.
In particular, double blinding and allocation concealment, 2 quality measures
that are frequently used in meta-analyses, were not associated with treatment
Our sample included studies from heterogeneous meta-analyses in 4 medical
areas. We might have found some of the quality measures to be statistically
significant if we had analyzed a broader range of clinical areas. In particular,
it is possible that various quality measures that trended toward significance
could have been significant if they had been applied to a different set of
clinical areas or if we had included an even larger number of RCTs. However,
the small magnitude of the relative ORs (0.83-1.26, with most ranging from
0.93-1.08) and their lack of consistency suggest that quality effects are
not as large as earlier reports have found. Furthermore, the observation that
only 7 (7%) of 102 associations tested were statistically significant at the P<.05 level suggests that our positive findings may
have been due to chance alone.
The variation in the direction of the treatment effects significantly
associated with quality measures further calls into question whether any of
these associations could provide a general rule for evaluating the quality
of RCTs across clinical areas. For example, multicenter studies were associated
with a stronger treatment effect in cardiovascular and pediatric trials but
a weaker treatment effect in infectious disease and surgical trials. Relative
ORs were less than 1 for 10 quality measures and greater than 1 for 13 measures.
Other studies that have examined this issue have generally focused on
individual meta-analyses or on single clinical categories of meta-analyses.
Furthermore, the majority based their conclusions on a relatively small number
of RCTs. An exception is the study by Moher et al,5
which also included multiple meta-analyses from various clinical categories.
An association was found between treatment effect and both Jadad score16 and adequacy of allocation concealment. Although
the associations were statistically significant, the differences were small.65 Our findings do not discount the possibility that
certain quality measures may be associated with treatment effect in specific
clinical disciplines and for specific questions of interest. However, our
analysis does call into question whether previous findings of quality-related
modification of the treatment effect can be generalized across different medical
disciplines or even across meta-analyses within a discipline.8
Another factor that may have contributed to the differences in our conclusions
compared with previous studies may be the definitions used for the quality
measures. There are innumerable ways to define study quality and specific
quality measures. Thus, interpretation of the meaning of certain quality measures
may have differed from previous reports. Although we met frequently to define
and redefine quality measures to ensure consistency and clarity and analyzed
only those measures that could be clearly defined, our definitions probably
differ slightly from those of other authors. Furthermore, the influence of
quality on treatment effect is frequently difficult to assess because the
details of a trial's methods may not be fully reported. For a variety of reasons,
almost all articles are incomplete in their reporting of various study aspects.
It is frequently difficult to distinguish between methodologically poor studies
and omissions in reporting the methods used.8
Hopefully, the publication of the original and revised CONSORT statements
will lead to more complete reporting.28,66
We found that the proportion of studies rated as high quality using
the different quality measures varied considerably across the different medical
areas, an observation consistent with previous studies.1- 5,29
A possible contributing factor is that certain quality measures may be easiest
to apply within particular types of studies. For example, we found that the
assessment of whether the caregiver was blinded was generally straightforward
in studies of surgical interventions compared with some of the other types
Many factors can explain imprecision of treatment effects and heterogeneity
of study findings found in meta-analyses. In addition to study quality, other
factors include heterogeneity of study populations, treatments, outcomes,
and study design67; biases due to study design,
meta-analysis inclusion criteria, publication bias68
and evolving treatment effects69; and random
error.35 Thus, quality is only one component
of heterogeneity and has an uncertain role in explaining any treatment effect
When evaluating the validity of studies, readers should continue to
assess the quality of the study methods and reporting. This information is
useful for understanding potential shortcomings and biases and for judging
the generalizability of the results. However, it should not be assumed that
any given quality measure will necessarily explain the treatment effect found.
It is reasonable for researchers performing meta-analysis to continue using
quality measures to examine heterogeneity among studies; however, the use
of a given list of quality measures for all meta-analyses is probably not
appropriate. Furthermore, one should consider that any quality measure that
is found to partly explain heterogeneity in a given meta-analysis may do so
purely by chance. Quality-related differences in the treatment effect should
be treated as hypothesis-generating observations.
Our analysis also documents that the appraisal of quality in RCTs and
meta-analyses is not straightforward. Unless definitions of quality measures
are robustly constructed and validated, interrater agreement may often be
unacceptably low. Subtle clarifications may be essential. We used a stringent
approach to define quality measures, with 2 successive pilot phases, to ensure
that quality measures were explicitly defined and clarified. Studies using
less-rigorous methods would probably find even more variability in determination
of study quality than we found.
Our study indicates that it would be inappropriate to quantitatively
adjust the treatment effect of a given study or meta-analysis by using the
average effects of specific quality measures discerned from prior meta-analyses.5,65 Assessment of quality may be useful
in better understanding qualitative aspects of RCTs and meta-analyses on a
case-by-case basis, but their translation to overarching, quantitative adjusting
factors is precarious and should be avoided.