Context Better peer review is needed, but proven methods to improve quality
are unknown. Our objective was to determine whether written feedback to reviewers
improves subsequent reviews.
Methods Eligible reviewers were randomized to intervention or control (receiving
other reviewers' unscored reviews and the editor's decision letter). Study
1 (September 1998–September 2000) included reviewers with a median quality
score of 3 or lower; study 2 (April 2000–January 2002), reviewers with
median score of 4 or lower. Study 1 was designed with a power of 0.80 to detect
a difference in score of 1; study 2, with a power of 0.80 to detect a difference
of 0.5. All reviewers were at a peer-reviewed journal (Annals
of Emergency Medicine). The main outcome measure was the editor's routine
quality rating (1-5) of all reviews (blinded to study enrollment).
Results For study 1, 51 reviewers were eligible and randomized and 35 had sufficient
data (182 reviews) for analysis. The mean individual reviewer rating change
was 0.16 (95% confidence interval [CI], −0.26 to 0.58) for control and −0.13
(−0.49 to 0.23) for intervention. For study 2, 127 reviewers were eligible
and randomized, and 95 had sufficient data (324 reviews). Controls had a mean
individual rating change of 0.12 (95% CI, −0.20 to 0.26) and intervention
reviewers, 0.06 (−0.19 to 0.31).
Conclusions In study 1, minimal feedback from editors on review quality had no effect
on subsequent performance of poor-quality reviewers, and the trend was toward
a negative effect. In study 2, feedback to average reviewers was more extensive
and supportive but produced no improvement in reviewer performance. Simple
written feedback to reviewers seems to be an ineffective educational tool.
Although prepublication peer review of scientific manuscripts by journals
is a crucial part of the scientific process, few journals assess reviewer
ability in advance of appointment, and few monitor reviewer performance. Some
of the inconsistency of peer review may be due to variability in reviewer
skill. Little is known about the training of the peer reviewers, and education
that improved their performance would benefit reviewers, authors, and editors.
We conducted 2 randomized trials to determine whether simple written
feedback provided by editors to peer reviewers improves the quality of subsequent
reviews.
All reviews at Annals of Emergency Medicine
are routinely rated by editors for quality on a scale of 1 to 5,1
similar to that validated by van Rooyen et al.2
The correlations between van Rooyen's global scale and the individual subscales
were so high (0.83 to 0.98) that we used only the latter (S. van Rooyen, written
communication, December 2001). Reviewers were selected from the entire pool
according to their performance in the previous 2 years. The unit of analysis
was the individual reviewer and was based on intention to treat. All eligible
reviewers were randomized (Statview 5.0, version 5.02; SAS Institute, Cary,
NC) to either an intervention, which varied between the 2 trials, or standard
procedure (reviewers were blinded to author identity and the editor's rating
of their review, and authors were blinded to reviewer identity). After review,
the reviewer received a copy of the editor's letter to the author and copies
of the other reviews of the same manuscript, but no information was provided
about the editor's opinion of any reviews. All editors were blinded to study
purpose and reviewer participation. The studies were approved by the Committee
on Human Research of the University of California, San Francisco.
The first study targeted reviewers who reviewed an average volume of
3 articles or fewer annually and had a median quality rating of 3 or lower.
Three was defined as the minimum acceptable rating and was the median for
30% of all reviewers; 47% received a 4, and only 15%, a 5 (defined as unusually
exceptional) (BOX). Thus, these were low-volume, low-quality reviewers. The
intervention included the standard measures, a brief summary of the specific
content goals for a quality review (BOX), and the editor's numerical rating
of their review. The study had a power of 0.80 to detect a difference in score
of 1 (with an SD of 1 about this difference) by using 34 subjects and a 2-tailed α
of .05.
Box. Components of a Quality Review (Scoring Elements)*
The reviewer identified and commented on major strengths and weaknesses
of study design and methods.
The reviewer commented accurately and productively on the quality of
the author's interpretation of the data, including acknowledgment of the data's
limitations.
The reviewer commented on major strengths and weaknesses of the manuscript
as a written communication, independent of the design, methods, results, and
interpretation of the study.
The reviewer provided the author with useful suggestions for improvement
of the manuscript.
The reviewer's comments to the author were constructive and professional.
The review provided the editor with the proper context and perspective
to make a decision about acceptance or revision of the manuscript.
*Scoring: 1 indicates unacceptable effort and content; 2, unacceptable
effort or content; 3, acceptable; 4, commendable; 5, exceptional, hard to
improve (10%-20% of reviews, maximum).
The second study targeted reviewers with a similar review volume and
a median quality score of 4 or less; thus, these were low-volume, average-quality
reviewers. The intervention included all the items in the first study and
the editor's ratings of the other reviews of the manuscript, as well as a
copy of an excellent review of another manuscript. The study was calculated
to have a power of 0.80 to detect a difference in score of 0.5 (with an SD
of 0.5 about this difference) by using 34 subjects and a 2-tailed α
of .05. In both sample-size calculations, the effect size (difference in score
divided by the SD of the difference) was held constant at 1.0.
In both studies, reviewers were eligible for analysis only if they completed
at least 2 rated reviews during the study. The first was excluded from analysis
because the opportunity for change in performance did not truly begin until
after receipt of the first rated review.
The first study was conducted from September 1998 to September 2000.
Fifty-one reviewers were eligible for entry, but 16 (10 control subjects and
6 intervention subjects) completed insufficient rated reviews during the study
period, leaving 35 with sufficient data for analysis who completed 182 reviews.
Results are summarized in Table 1.
Table 1. Low-Quality Reviewers (Study 1)
The second study was conducted from April 2000 to January 2002. One
hundred twenty-seven reviewers were eligible; 32 (15 control subjects and
17 intervention subjects) did not complete sufficient reviews, leaving 95
with sufficient data for analysis (324 reviews). Results are summarized in Table 2.
Table 2. Average Reviewers (Study 2)
Our first study showed no benefit of limited feedback to below-average
reviewers. In fact, the feedback may actually have had a negative effect,
lowering the quality scores of subsequent reviews. In the second study, we
developed more detailed feedback in the hope it would be more effective, but
it was not.
Both studies had limitations. The first group of reviewers may have
had too poor a performance level to improve, which led us to select average
reviewers in the second study but with no change in results. We do not know
how thoroughly reviewers read the materials we sent or if they used them at
all. Even our more detailed feedback materials did not pinpoint specific flaws
or provide point-by-point guidance on how to improve a review.
Reviewers who need improvement may not be able to extrapolate from general
instructions. They may need directed instruction to implement changes or may
discount the feedback to preserve a positive view of their own performance.
This phenomenon has been reported among students in college-level composition3 but has not been studied in postgraduate professionals.
If this trend is so, improving performance would require individualized labor-intensive
feedback, whereas our goal was to identify a simple efficient method. Furthermore,
the closest relevant studies (of college freshmen learning English as a second
language) suggest there is no educational benefit to even specific, concrete,
detailed feedback compared with a fairly superficial kind.4
In our first study, enrollment was lower than planned because 16 eligible
reviewers did not complete enough assigned reviews during the study to meet
entry criteria and in fact did no further reviewing after that time, which
suggests this pool of reviewers was not only poor quality but also unmotivated.
We doubt that receiving feedback is likely to have contributed to their decision
to stop reviewing altogether, since more control than intervention subjects
did so. However, even with the reduced enrollment, the power of both studies
to detect a difference in performance exceeds 80%.
There are additional possible explanations for the failure of this feedback
tool. Our reviewers may be atypical of most journal peer reviewers, but we
doubt it. They covered a range of academic rank appointments at virtually
the full spectrum of US medical schools. It may also be that good reviewers
come to a journal with the needed skills in critical appraisal and writing
a review and do not dramatically improve their skills over time.
Our results contradict traditional and time-tested educational methods,
but the relevant literature is consistent with our results. We have found
that the traditional half-day workshop format improves self-ratings but not
actual performance on reviews, even with a structured, evidence-based approach.5,6 Studies of journal-club formats and
critical appraisal courses for residents show similar results, and a meta-analysis
of such methods concluded there was no evidence supporting their efficacy.7 Traditional educational methods are time-tested but
not scientifically proven.
Few journals assess training or skills in critical review before appointing
reviewers; many journals do not objectively assess performance of existing
reviewers. Therefore, further investigation is needed to identify ways of
improving the peer-review process. Key objectives include determining the
minimal training needed for reviewers before appointment, the characteristics
of candidates who can improve, the best tools for changing performance (if
any), and the type of training needed to change performance. Perhaps the charity
model of voluntary peer review asks too much of participants, and only financial
or other reward will lead to better performance.
Because there are no proven tools for improving performance, journals
should develop, study, and formally adopt mechanisms for more effective initial
screening of potential reviewers.
1.Callaham M, Baxt W, Waeckerle J, Wears R. The reliability of editors' subjective quality ratings of manuscript
peer reviews.
JAMA.1998;280:229-231.Google Scholar 2.van Rooyen S, Black N, Godlee F. Development of the review quality instrument (RQI) for assessing peer
reviews of manuscripts.
J Clin Epidemiol.1999;52:625-629.Google Scholar 3.MacDonald RB. Developmental students' processing of teacher feedback in composition
instruction.
Rev Res Dev Educ.1991;8:143.Google Scholar 4.Shortreed I. Salience of feedback on error and its effect on EFL writing quality.
TESOL Q.1986;20:83-93.Google Scholar 5.Callaham ML, Wears RL, Waeckerle JF. Effect of attendance at a training session on peer reviewer quality
and performance.
Ann Emerg Med.1998;32(suppl 3, pt 1):318-322.Google Scholar 6.Callaham ML, Schriger DL. Effect of structured workshop training on subsequent performance of
journal peer reviewers.
Ann Emerg Med.In press.Google Scholar 7.Norman GR, Shannon SI. Effectiveness of instruction in critical appraisal (evidence-based
medicine) skills: a critical appraisal.
CMAJ.1998;158:177-181.Google Scholar