Effect of Written Feedback by Editors on Quality of Reviews: Two Randomized Trials | JAMA | JAMA Network
[Skip to Navigation]
Sign In
Peer Review
June 5, 2002

Effect of Written Feedback by Editors on Quality of Reviews: Two Randomized Trials

Author Affiliations

Author Affiliations: Department of Emergency Medicine, University of California, San Francisco (Dr Callaham); Department of Emergency Medicine, University of Minnesota, Minneapolis (Dr Knopp); Department of Emergency Medicine, Albert Einstein College of Medicine, Bronx, NY (Dr Gallagher).

JAMA. 2002;287(21):2781-2783. doi:10.1001/jama.287.21.2781

Context Better peer review is needed, but proven methods to improve quality are unknown. Our objective was to determine whether written feedback to reviewers improves subsequent reviews.

Methods Eligible reviewers were randomized to intervention or control (receiving other reviewers' unscored reviews and the editor's decision letter). Study 1 (September 1998–September 2000) included reviewers with a median quality score of 3 or lower; study 2 (April 2000–January 2002), reviewers with median score of 4 or lower. Study 1 was designed with a power of 0.80 to detect a difference in score of 1; study 2, with a power of 0.80 to detect a difference of 0.5. All reviewers were at a peer-reviewed journal (Annals of Emergency Medicine). The main outcome measure was the editor's routine quality rating (1-5) of all reviews (blinded to study enrollment).

Results For study 1, 51 reviewers were eligible and randomized and 35 had sufficient data (182 reviews) for analysis. The mean individual reviewer rating change was 0.16 (95% confidence interval [CI], −0.26 to 0.58) for control and −0.13 (−0.49 to 0.23) for intervention. For study 2, 127 reviewers were eligible and randomized, and 95 had sufficient data (324 reviews). Controls had a mean individual rating change of 0.12 (95% CI, −0.20 to 0.26) and intervention reviewers, 0.06 (−0.19 to 0.31).

Conclusions In study 1, minimal feedback from editors on review quality had no effect on subsequent performance of poor-quality reviewers, and the trend was toward a negative effect. In study 2, feedback to average reviewers was more extensive and supportive but produced no improvement in reviewer performance. Simple written feedback to reviewers seems to be an ineffective educational tool.

Although prepublication peer review of scientific manuscripts by journals is a crucial part of the scientific process, few journals assess reviewer ability in advance of appointment, and few monitor reviewer performance. Some of the inconsistency of peer review may be due to variability in reviewer skill. Little is known about the training of the peer reviewers, and education that improved their performance would benefit reviewers, authors, and editors.

We conducted 2 randomized trials to determine whether simple written feedback provided by editors to peer reviewers improves the quality of subsequent reviews.


All reviews at Annals of Emergency Medicine are routinely rated by editors for quality on a scale of 1 to 5,1 similar to that validated by van Rooyen et al.2 The correlations between van Rooyen's global scale and the individual subscales were so high (0.83 to 0.98) that we used only the latter (S. van Rooyen, written communication, December 2001). Reviewers were selected from the entire pool according to their performance in the previous 2 years. The unit of analysis was the individual reviewer and was based on intention to treat. All eligible reviewers were randomized (Statview 5.0, version 5.02; SAS Institute, Cary, NC) to either an intervention, which varied between the 2 trials, or standard procedure (reviewers were blinded to author identity and the editor's rating of their review, and authors were blinded to reviewer identity). After review, the reviewer received a copy of the editor's letter to the author and copies of the other reviews of the same manuscript, but no information was provided about the editor's opinion of any reviews. All editors were blinded to study purpose and reviewer participation. The studies were approved by the Committee on Human Research of the University of California, San Francisco.

The first study targeted reviewers who reviewed an average volume of 3 articles or fewer annually and had a median quality rating of 3 or lower. Three was defined as the minimum acceptable rating and was the median for 30% of all reviewers; 47% received a 4, and only 15%, a 5 (defined as unusually exceptional) (BOX). Thus, these were low-volume, low-quality reviewers. The intervention included the standard measures, a brief summary of the specific content goals for a quality review (BOX), and the editor's numerical rating of their review. The study had a power of 0.80 to detect a difference in score of 1 (with an SD of 1 about this difference) by using 34 subjects and a 2-tailed α of .05.

Box. Components of a Quality Review (Scoring Elements)*

The reviewer identified and commented on major strengths and weaknesses of study design and methods.

The reviewer commented accurately and productively on the quality of the author's interpretation of the data, including acknowledgment of the data's limitations.

The reviewer commented on major strengths and weaknesses of the manuscript as a written communication, independent of the design, methods, results, and interpretation of the study.

The reviewer provided the author with useful suggestions for improvement of the manuscript.

The reviewer's comments to the author were constructive and professional.

The review provided the editor with the proper context and perspective to make a decision about acceptance or revision of the manuscript.

*Scoring: 1 indicates unacceptable effort and content; 2, unacceptable effort or content; 3, acceptable; 4, commendable; 5, exceptional, hard to improve (10%-20% of reviews, maximum).

The second study targeted reviewers with a similar review volume and a median quality score of 4 or less; thus, these were low-volume, average-quality reviewers. The intervention included all the items in the first study and the editor's ratings of the other reviews of the manuscript, as well as a copy of an excellent review of another manuscript. The study was calculated to have a power of 0.80 to detect a difference in score of 0.5 (with an SD of 0.5 about this difference) by using 34 subjects and a 2-tailed α of .05. In both sample-size calculations, the effect size (difference in score divided by the SD of the difference) was held constant at 1.0.

In both studies, reviewers were eligible for analysis only if they completed at least 2 rated reviews during the study. The first was excluded from analysis because the opportunity for change in performance did not truly begin until after receipt of the first rated review.


The first study was conducted from September 1998 to September 2000. Fifty-one reviewers were eligible for entry, but 16 (10 control subjects and 6 intervention subjects) completed insufficient rated reviews during the study period, leaving 35 with sufficient data for analysis who completed 182 reviews. Results are summarized in Table 1.

Table 1. Low-Quality Reviewers (Study 1)
Table 1. Low-Quality Reviewers (Study 1)
Image description not available.

The second study was conducted from April 2000 to January 2002. One hundred twenty-seven reviewers were eligible; 32 (15 control subjects and 17 intervention subjects) did not complete sufficient reviews, leaving 95 with sufficient data for analysis (324 reviews). Results are summarized in Table 2.

Table 2. Average Reviewers (Study 2)
Table 2. Average Reviewers (Study 2)
Image description not available.


Our first study showed no benefit of limited feedback to below-average reviewers. In fact, the feedback may actually have had a negative effect, lowering the quality scores of subsequent reviews. In the second study, we developed more detailed feedback in the hope it would be more effective, but it was not.

Both studies had limitations. The first group of reviewers may have had too poor a performance level to improve, which led us to select average reviewers in the second study but with no change in results. We do not know how thoroughly reviewers read the materials we sent or if they used them at all. Even our more detailed feedback materials did not pinpoint specific flaws or provide point-by-point guidance on how to improve a review.

Reviewers who need improvement may not be able to extrapolate from general instructions. They may need directed instruction to implement changes or may discount the feedback to preserve a positive view of their own performance. This phenomenon has been reported among students in college-level composition3 but has not been studied in postgraduate professionals. If this trend is so, improving performance would require individualized labor-intensive feedback, whereas our goal was to identify a simple efficient method. Furthermore, the closest relevant studies (of college freshmen learning English as a second language) suggest there is no educational benefit to even specific, concrete, detailed feedback compared with a fairly superficial kind.4

In our first study, enrollment was lower than planned because 16 eligible reviewers did not complete enough assigned reviews during the study to meet entry criteria and in fact did no further reviewing after that time, which suggests this pool of reviewers was not only poor quality but also unmotivated. We doubt that receiving feedback is likely to have contributed to their decision to stop reviewing altogether, since more control than intervention subjects did so. However, even with the reduced enrollment, the power of both studies to detect a difference in performance exceeds 80%.

There are additional possible explanations for the failure of this feedback tool. Our reviewers may be atypical of most journal peer reviewers, but we doubt it. They covered a range of academic rank appointments at virtually the full spectrum of US medical schools. It may also be that good reviewers come to a journal with the needed skills in critical appraisal and writing a review and do not dramatically improve their skills over time.

Our results contradict traditional and time-tested educational methods, but the relevant literature is consistent with our results. We have found that the traditional half-day workshop format improves self-ratings but not actual performance on reviews, even with a structured, evidence-based approach.5,6 Studies of journal-club formats and critical appraisal courses for residents show similar results, and a meta-analysis of such methods concluded there was no evidence supporting their efficacy.7 Traditional educational methods are time-tested but not scientifically proven.

Few journals assess training or skills in critical review before appointing reviewers; many journals do not objectively assess performance of existing reviewers. Therefore, further investigation is needed to identify ways of improving the peer-review process. Key objectives include determining the minimal training needed for reviewers before appointment, the characteristics of candidates who can improve, the best tools for changing performance (if any), and the type of training needed to change performance. Perhaps the charity model of voluntary peer review asks too much of participants, and only financial or other reward will lead to better performance.

Because there are no proven tools for improving performance, journals should develop, study, and formally adopt mechanisms for more effective initial screening of potential reviewers.

Callaham M, Baxt W, Waeckerle J, Wears R. The reliability of editors' subjective quality ratings of manuscript peer reviews.  JAMA.1998;280:229-231.Google Scholar
van Rooyen S, Black N, Godlee F. Development of the review quality instrument (RQI) for assessing peer reviews of manuscripts.  J Clin Epidemiol.1999;52:625-629.Google Scholar
MacDonald RB. Developmental students' processing of teacher feedback in composition instruction.  Rev Res Dev Educ.1991;8:143.Google Scholar
Shortreed I. Salience of feedback on error and its effect on EFL writing quality.  TESOL Q.1986;20:83-93.Google Scholar
Callaham ML, Wears RL, Waeckerle JF. Effect of attendance at a training session on peer reviewer quality and performance.  Ann Emerg Med.1998;32(suppl 3, pt 1):318-322.Google Scholar
Callaham ML, Schriger DL. Effect of structured workshop training on subsequent performance of journal peer reviewers.  Ann Emerg Med.In press.Google Scholar
Norman GR, Shannon SI. Effectiveness of instruction in critical appraisal (evidence-based medicine) skills: a critical appraisal.  CMAJ.1998;158:177-181.Google Scholar