Reliability of Editors' Subjective Quality Ratings of Peer Reviews of Manuscripts | JAMA | JAMA Network
[Skip to Navigation]
Sign In
Peer Review Congress
July 15, 1998

Reliability of Editors' Subjective Quality Ratings of Peer Reviews of Manuscripts

Author Affiliations

From the Division of Emergency Medicine, Department of Medicine, University of California, San Francisco (Dr Callaham); Department of Emergency Medicine, University of Pennsylvania, Philadelphia (Dr Baxt); Department of Emergency Medicine, University of Missouri at Kansas City (Dr Waeckerle); and Department of Emergency Medicine, University of Florida at Jacksonville (Dr Wears).

JAMA. 1998;280(3):229-231. doi:10.1001/jama.280.3.229

Context.— Quality of reviewers is crucial to journal quality, but there are usually too many for editors to know them all personally. A reliable method of rating them (for education and monitoring) is needed.

Objective.— Whether editors' quality ratings of peer reviewers are reliable and how they compare with other performance measures.

Design.— A 3.5-year prospective observational study.

Setting.— Peer-reviewed journal.

Participants.— All editors and peer reviewers who reviewed at least 3 manuscripts.

Main Outcome Measures.— Reviewer quality ratings, individual reviewer rate of recommendation for acceptance, congruence between reviewer recommendation and editorial decision (decision congruence), and accuracy in reporting flaws in a masked test manuscript.

Interventions.— Editors rated the quality of each review on a subjective 1 to 5 scale.

Results.— A total of 4161 reviews of 973 manuscripts by 395 reviewers were studied. The within-reviewer intraclass correlation was 0.44 (P<.001), indicating that 20% of the variance seen in the review ratings was attributable to the reviewer. Intraclass correlations for editor and manuscript were only 0.24 and 0.12, respectively. Reviewer average quality ratings correlated poorly with the rate of recommendation for acceptance (R=−0.34) and congruence with editorial decision (R=0.26). Among 124 reviewers of the fictitious manuscript, the mean quality rating for each reviewer was modestly correlated with the number of flaws they reported (R=0.53). Highly rated reviewers reported twice as many flaws as poorly rated reviewers.

Conclusions.— Subjective editor ratings of individual reviewers were moderately reliable and correlated with reviewer ability to report manuscript flaws. Individual reviewer rate of recommendation for acceptance and decision congruence might be thought to be markers of a discriminating (ie, high-quality) reviewer, but these variables were poorly correlated with editors' ratings of review quality or the reviewer's ability to detect flaws in a fictitious manuscript. Therefore, they cannot be substituted for actual quality ratings by editors.

LITTLE IS KNOWN about assessing the quality of peer review. Most journals do not have standardized methods of selecting reviewers, nor do they screen or train them. We conducted a study to determine if a subjective rating of review quality is a reliable measure; whether it can be easily replaced by more objective reviewer performance statistics, such as individual rate of recommendation for acceptance; and whether ratings are correlated with reviewer accuracy in detecting errors in manuscripts.


When making decisions about manuscripts, editors at Annals of Emergency Medicine routinely rate the quality of each peer review on a subjective ordinal scale developed by senior editors. The scale ranges from 1 (poor) to 5 (excellent) and is not specifically defined other than as "review quality." No special training in use of the scale is provided, although all editors receive a detailed job description plus the reviewer orientation described below.

All manuscript peer reviews that had been rated by any editors between January 1, 1994, and June 1, 1997, were eligible for study. When recruited by the editorial board, reviewers receive a 4800-word, written orientation on journal expectations for reviews and, with each subsequent review, fill out a form rating 12 specific components of study design, originality, interest for readers, and manuscript quality, plus free text comments for editors and authors. Reviewers are routinely asked to recommend whether each manuscript should be accepted.

Routine performance statistics are calculated by the journal for each reviewer. They include the rate of recommending acceptance and rejection and decision congruence, defined as whether the reviewer recommendation for each manuscript matched the editor's final decision regarding manuscript acceptance.

The quality ratings were examined for interreviewer agreement using intraclass correlation coefficients (ICCs) because reviews received only 1 editorial rating each, and the ICC takes into account the varying number of scores per reviewer while still weighting each "data point" (paper, editor, and the like) only once.

As part of a separate study, in fall 1994 all active reviewers were separately sent a fictitious manuscript, which contained 23 deliberate flaws and was masked to its true purpose, to review. The method is described in detail elsewhere.1,2 The performance of reviewers who reviewed the manuscript (and had at least 3 rated reviews) in reporting manuscript flaws was compared with the ratings above.


During the study, 4644 reviews (2922 [63%] of them rated) were performed by 756 reviewers on 1551 manuscripts. To focus on regular (not occasional guest) reviewers, reviewers with 2 or less reviews were excluded, leaving 4161 reviews by 395 reviewers. A total of 2686 (65%) of these were rated by 36 editors, involving 973 manuscripts. The 395 peer reviewers who were studied each reviewed an average of 10.5 manuscripts (95% confidence interval [CI], 9.8-11.2), 6.8 of which were rated (95% CI, 6.3-7.3), receiving an average score of 3.7 (95% CI, 3.58-3.73). They recommended acceptance in 26% of their reviews, rejection in 34%, and revision in 40%. Their decisions were congruent with editors in 50% of these reviews.

Thirty-five percent of reviews were not rated by editors, presumably because of workload, and were therefore excluded. These did not differ in rate of recommendation for acceptance, decision congruence rate, or other identifiable characteristics from rated reviews.

With higher review ratings, the editor's decision was more likely to match the reviewer's recommendation. Reviews rated 1 (poor) demonstrated a 21% congruence between editor and reviewer; those rated a 2, 27% congruence; 3, 41% congruence; 4, 55% congruence; and 5 (excellent), 62% congruence. The average rating of reviews with no decision congruence was 3.51 (SE, 0.03) vs 3.99 (SE, 0.03) for those with congruence (P<.001). Review ratings were not associated with the specific decision made by the editor, however. Accepted papers had lower review scores (3.48; SE, 0.13) than papers sent back for revision (3.74; SE, 0.03) or rejected (3.78; SE, 0.03) (P=.02). The weighted κ between reviewers and editors for decision congruence was 0.034.

The ICC was 0.44 for reviewers, 0.24 for editors, and 0.12 for manuscripts, meaning that about 20% of the variance seen in the review ratings was attributable to the reviewer, 6% to the editor, and 1% to the manuscript.

The reviewer's average quality rating correlated poorly with routine performance statistics. The Spearman rank correlation coefficient was −0.34 for the rate of recommendation for acceptance, 0.12 for the rejection rate, and 0.26 for the decision congruence rate.

Seventy-eight percent of all reviewers available to review in fall 1994 (a typical proportion) evaluated the fictitious manuscript. Sufficient ratings were available for 124 reviewers, who also reviewed 2143 genuine manuscripts during the study period and were rated by 30 editors, receiving a mean rating of 3.65 (SD, 0.7). These reviewers had an average of 12 (SD, 5.3) years of experience since residency; 40% were assistant professors, 38% were associate professors, and 17% were full professors.

The reviewers detected a mean 3.4 (SD, 1.6) of the 10 major flaws (those that invalidated or undermined the methods of the study) and 3.1 (SD, 1.7) of the 13 minor flaws (those not affecting study validity) in the fictitious manuscript. The mean editorial rating for each reviewer was modestly correlated with the number of flaws they detected (R=0.45 for major errors, 0.45 for minor errors, and 0.53 for total errors). Reviewers with average ratings of 4 or more reported about twice as many total errors as those with average ratings of 3 or less. Individual reviewer rate of recommendation for acceptance, decision congruence, and various measures of experience were not associated with detection of errors, nor was academic rank (Table 1).

Spearman Rank Correlation of Individual Reviewer Characteristics to Number of Major Errors Detected
Spearman Rank Correlation of Individual Reviewer Characteristics to Number of Major Errors Detected
Image description not available.


Half of journal editors rely almost exclusively on reviewer recommendations when making acceptance decisions.3 Maintaining the quality of peer reviewers is therefore essential, but most journals have too many reviewers for editors to know their capabilities personally. Most journals do not train reviewers or assess their background in research methodology or critical literature review.4 Editors need a simple rating system to monitor review quality, but this rating should require minimal additional work and should add unique value.

Although the reliability of ratings of manuscripts by reviewers has been previously examined, there is only 1 prior study that examined the reliability of ratings of reviewers by editors.5 (An earlier study used a 9-axis scale of review quality, but did not test the scale's reliability.6) The study by Feurer et al5 examined 53 reviews of 23 manuscripts, rated by 3 editors. Editors used a scale of 7 axes with scores of varying complexity, guided by a detailed written explanation. The scale included timeliness standards, whether the reviewer wrote on the manuscript or provided supporting references, and other criteria not directly related to quality of scientific assessment. It demonstrated an excellent ICC of 0.84.

By contrast, we chose a scale that was simpler to use and provided a fair to good ICC of 0.44. It performed in a satisfactory fashion compared with scales used by a specially trained panel of experienced researchers in rating manuscripts7,8 and exceeded the performance of a multi-axis scale used by peer reviewers for the same purpose.9 It markedly exceeded the reliability of scoring systems used for awarding National Science Foundation grants.10 However, it correlated poorly with routine reviewer performance statistics such as rate of recommendation for acceptance and congruence rate. This is logical, since many factors other than just review and manuscript quality affect editorial decisions. It also means that these statistics cannot replace a subjective scale to assess review quality.

The rating scale also showed a modest correlation (R=0.53) with the ability of reviewers to detect deliberate flaws in a fictitious manuscript. The more flaws they were able to detect and report, the more likely they were to produce reviews judged excellent by editors. Reviewers with high average ratings reported twice as many errors as those with low ratings. Reviewers reported only a minority of such flaws, similar to a prior study.11 There is no "gold standard" for judging a quality peer review, but this at least suggests that our subjective quality rating may be related to accuracy and/or thoroughness.

Our scale had only 1 axis and the definition of quality was not specific; previous research has suggested that multiple, more specific scales increase the reliability of ratings.9 Despite this, our ICC was higher than previous multi-axis scales used to rate manuscripts,7,8 although not as reliable as the much more complex scale of Feurer et al,5 which included objective data.

We have found this simple scale useful as part of a comprehensive evaluation to identify poor reviewers who should be dropped and excellent reviewers who deserve acknowledgment. Its reliability could be improved by defining the qualities of a good review more specifically and by formally training editors in its use.

In conclusion, this simple 5-point scale for editors to rate peer review quality is acceptably reliable and correlates modestly with reviewers' reporting of flaws in a test manuscript. It provides a useful tool for monitoring reviewer performance and cannot be replaced by routine reviewer statistics such as individual rate of recommendation for acceptance and decision congruence.

Baxt WG, Waeckerle JF, Tintinalli JE, Knopp RK, Callaham ML. Evaluation of the peer reviewer: performance of reviewers on a factitious submission.  Acad Emerg Med.1996;3:504.Google Scholar
Baxt WG, Waeckerle J, Berlin JA, Callaham ML. Who reviews the reviewers? the feasibility of using a fictitious manuscript to evaluate peer reviewer performance accepted for publication.  Ann Emerg Med.In press.Google Scholar
Wilkes MS, Kravitz RL. Policies, practices, and attitudes of North American medical journal editors.  J Gen Intern Med.1995;10:443-450.Google Scholar
Schulman K, Sulmasy DP, Roney D. Ethics, economics, and the publication policies of major medical journals.  JAMA.1994;272:154-156.Google Scholar
Feurer ID, Becker GJ, Picus D, Ramirez E, Darcy MD, Hicks ME. Evaluating peer reviews: pilot testing of a grading instrument.  JAMA.1994;272:98-100.Google Scholar
Evans AT, McNutt RA, Fletcher SW, Fletcher RH. The characteristics of peer reviewers who produce good-quality reviews.  J Gen Intern Med.1993;8:422-428.Google Scholar
Oxman AD, Guyatt GH. Validation of an index of the quality of review articles.  J Clin Epidemiol.1991;44:1271-1278.Google Scholar
Oxman AD, Guyatt GH, Singer J.  et al.  Agreement among reviewers of review articles.  J Clin Epidemiol.1991;44:91-98.Google Scholar
Strayhorn Jr J, McDermott Jr JF, Tanguay P. An intervention to improve the reliability of manuscript reviews for the Journal of the American Academy of Child and Adolescent Psychiatry Am J Psychiatry.1993;150:947-952.Google Scholar
Cicchetti D. The reliability of peer review for manuscript and grant submissions: a cross-disciplinary investigation.  Behav Brain Sci.1991;14:119-186.Google Scholar
Nylenna M, Riis P, Karlsson Y. Multiple blinded reviews of the same two manuscripts: effects of referee characteristics and publication language.  JAMA.1994;272:149-151.Google Scholar