Customize your JAMA Network experience by selecting one or more topics from the list below.
From the Division of Emergency Medicine, Department of Medicine, University of California, San Francisco (Dr Callaham); Department of Emergency Medicine, University of Pennsylvania, Philadelphia (Dr Baxt); Department of Emergency Medicine, University of Missouri at Kansas City (Dr Waeckerle); and Department of Emergency Medicine, University of Florida at Jacksonville (Dr Wears).
Context.— Quality of reviewers is crucial to journal quality, but there are usually
too many for editors to know them all personally. A reliable method of rating
them (for education and monitoring) is needed.
Objective.— Whether editors' quality ratings of peer reviewers are reliable and
how they compare with other performance measures.
Design.— A 3.5-year prospective observational study.
Setting.— Peer-reviewed journal.
Participants.— All editors and peer reviewers who reviewed at least 3 manuscripts.
Main Outcome Measures.— Reviewer quality ratings, individual reviewer rate of recommendation
for acceptance, congruence between reviewer recommendation and editorial decision
(decision congruence), and accuracy in reporting flaws in a masked test manuscript.
Interventions.— Editors rated the quality of each review on a subjective 1 to 5 scale.
Results.— A total of 4161 reviews of 973 manuscripts by 395 reviewers were studied.
The within-reviewer intraclass correlation was 0.44 (P<.001),
indicating that 20% of the variance seen in the review ratings was attributable
to the reviewer. Intraclass correlations for editor and manuscript were only
0.24 and 0.12, respectively. Reviewer average quality ratings correlated poorly
with the rate of recommendation for acceptance (R=−0.34)
and congruence with editorial decision (R=0.26).
Among 124 reviewers of the fictitious manuscript, the mean quality rating
for each reviewer was modestly correlated with the number of flaws they reported
(R=0.53). Highly rated reviewers reported twice as
many flaws as poorly rated reviewers.
Conclusions.— Subjective editor ratings of individual reviewers were moderately reliable
and correlated with reviewer ability to report manuscript flaws. Individual
reviewer rate of recommendation for acceptance and decision congruence might
be thought to be markers of a discriminating (ie, high-quality) reviewer,
but these variables were poorly correlated with editors' ratings of review
quality or the reviewer's ability to detect flaws in a fictitious manuscript.
Therefore, they cannot be substituted for actual quality ratings by editors.
LITTLE IS KNOWN about assessing the quality of peer review. Most journals
do not have standardized methods of selecting reviewers, nor do they screen
or train them. We conducted a study to determine if a subjective rating of
review quality is a reliable measure; whether it can be easily replaced by
more objective reviewer performance statistics, such as individual rate of
recommendation for acceptance; and whether ratings are correlated with reviewer
accuracy in detecting errors in manuscripts.
When making decisions about manuscripts, editors at Annals of Emergency Medicine routinely rate the quality of each peer
review on a subjective ordinal scale developed by senior editors. The scale
ranges from 1 (poor) to 5 (excellent) and is not specifically defined other
than as "review quality." No special training in use of the scale is provided,
although all editors receive a detailed job description plus the reviewer
orientation described below.
All manuscript peer reviews that had been rated by any editors between
January 1, 1994, and June 1, 1997, were eligible for study. When recruited
by the editorial board, reviewers receive a 4800-word, written orientation
on journal expectations for reviews and, with each subsequent review, fill
out a form rating 12 specific components of study design, originality, interest
for readers, and manuscript quality, plus free text comments for editors and
authors. Reviewers are routinely asked to recommend whether each manuscript
should be accepted.
Routine performance statistics are calculated by the journal for each
reviewer. They include the rate of recommending acceptance and rejection and decision congruence, defined as whether the reviewer recommendation
for each manuscript matched the editor's final decision regarding manuscript
The quality ratings were examined for interreviewer agreement using
intraclass correlation coefficients (ICCs) because reviews received only 1
editorial rating each, and the ICC takes into account the varying number of
scores per reviewer while still weighting each "data point" (paper, editor,
and the like) only once.
As part of a separate study, in fall 1994 all active reviewers were
separately sent a fictitious manuscript, which contained 23 deliberate flaws
and was masked to its true purpose, to review. The method is described in
detail elsewhere.1,2 The performance
of reviewers who reviewed the manuscript (and had at least 3 rated reviews)
in reporting manuscript flaws was compared with the ratings above.
During the study, 4644 reviews (2922 [63%] of them rated) were performed
by 756 reviewers on 1551 manuscripts. To focus on regular (not occasional
guest) reviewers, reviewers with 2 or less reviews were excluded, leaving
4161 reviews by 395 reviewers. A total of 2686 (65%) of these were rated by
36 editors, involving 973 manuscripts. The 395 peer reviewers who were studied
each reviewed an average of 10.5 manuscripts (95% confidence interval [CI],
9.8-11.2), 6.8 of which were rated (95% CI, 6.3-7.3), receiving an average
score of 3.7 (95% CI, 3.58-3.73). They recommended acceptance in 26% of their
reviews, rejection in 34%, and revision in 40%. Their decisions were congruent
with editors in 50% of these reviews.
Thirty-five percent of reviews were not rated by editors, presumably
because of workload, and were therefore excluded. These did not differ in
rate of recommendation for acceptance, decision congruence rate, or other
identifiable characteristics from rated reviews.
With higher review ratings, the editor's decision was more likely to
match the reviewer's recommendation. Reviews rated 1 (poor) demonstrated a
21% congruence between editor and reviewer; those rated a 2, 27% congruence;
3, 41% congruence; 4, 55% congruence; and 5 (excellent), 62% congruence. The
average rating of reviews with no decision congruence was 3.51 (SE, 0.03)
vs 3.99 (SE, 0.03) for those with congruence (P<.001).
Review ratings were not associated with the specific decision made by the
editor, however. Accepted papers had lower review scores (3.48; SE, 0.13)
than papers sent back for revision (3.74; SE, 0.03) or rejected (3.78; SE,
0.03) (P=.02). The weighted κ between reviewers
and editors for decision congruence was 0.034.
The ICC was 0.44 for reviewers, 0.24 for editors, and 0.12 for manuscripts,
meaning that about 20% of the variance seen in the review ratings was attributable
to the reviewer, 6% to the editor, and 1% to the manuscript.
The reviewer's average quality rating correlated poorly with routine
performance statistics. The Spearman rank correlation coefficient was −0.34
for the rate of recommendation for acceptance, 0.12 for the rejection rate,
and 0.26 for the decision congruence rate.
Seventy-eight percent of all reviewers available to review in fall 1994
(a typical proportion) evaluated the fictitious manuscript. Sufficient ratings
were available for 124 reviewers, who also reviewed 2143 genuine manuscripts
during the study period and were rated by 30 editors, receiving a mean rating
of 3.65 (SD, 0.7). These reviewers had an average of 12 (SD, 5.3) years of
experience since residency; 40% were assistant professors, 38% were associate
professors, and 17% were full professors.
The reviewers detected a mean 3.4 (SD, 1.6) of the 10 major flaws (those
that invalidated or undermined the methods of the study) and 3.1 (SD, 1.7)
of the 13 minor flaws (those not affecting study validity) in the fictitious
manuscript. The mean editorial rating for each reviewer was modestly correlated
with the number of flaws they detected (R=0.45 for
major errors, 0.45 for minor errors, and 0.53 for total errors). Reviewers
with average ratings of 4 or more reported about twice as many total errors
as those with average ratings of 3 or less. Individual reviewer rate of recommendation
for acceptance, decision congruence, and various measures of experience were
not associated with detection of errors, nor was academic rank (Table 1).
Half of journal editors rely almost exclusively on reviewer recommendations
when making acceptance decisions.3 Maintaining
the quality of peer reviewers is therefore essential, but most journals have
too many reviewers for editors to know their capabilities personally. Most
journals do not train reviewers or assess their background in research methodology
or critical literature review.4 Editors need
a simple rating system to monitor review quality, but this rating should require
minimal additional work and should add unique value.
Although the reliability of ratings of manuscripts by reviewers has
been previously examined, there is only 1 prior study that examined the reliability
of ratings of reviewers by editors.5 (An earlier
study used a 9-axis scale of review quality, but did not test the scale's
reliability.6) The study by Feurer et al5 examined 53 reviews of 23 manuscripts, rated by 3
editors. Editors used a scale of 7 axes with scores of varying complexity,
guided by a detailed written explanation. The scale included timeliness standards,
whether the reviewer wrote on the manuscript or provided supporting references,
and other criteria not directly related to quality of scientific assessment.
It demonstrated an excellent ICC of 0.84.
By contrast, we chose a scale that was simpler to use and provided a
fair to good ICC of 0.44. It performed in a satisfactory fashion compared
with scales used by a specially trained panel of experienced researchers in
rating manuscripts7,8 and exceeded
the performance of a multi-axis scale used by peer reviewers for the same
purpose.9 It markedly exceeded the reliability
of scoring systems used for awarding National Science Foundation grants.10 However, it correlated poorly with routine reviewer
performance statistics such as rate of recommendation for acceptance and congruence
rate. This is logical, since many factors other than just review and manuscript
quality affect editorial decisions. It also means that these statistics cannot
replace a subjective scale to assess review quality.
The rating scale also showed a modest correlation (R=0.53) with the ability of reviewers to detect deliberate flaws in
a fictitious manuscript. The more flaws they were able to detect and report,
the more likely they were to produce reviews judged excellent by editors.
Reviewers with high average ratings reported twice as many errors as those
with low ratings. Reviewers reported only a minority of such flaws, similar
to a prior study.11 There is no "gold standard"
for judging a quality peer review, but this at least suggests that our subjective
quality rating may be related to accuracy and/or thoroughness.
Our scale had only 1 axis and the definition of quality was not specific;
previous research has suggested that multiple, more specific scales increase
the reliability of ratings.9 Despite this,
our ICC was higher than previous multi-axis scales used to rate manuscripts,7,8 although not as reliable as the much
more complex scale of Feurer et al,5 which
included objective data.
We have found this simple scale useful as part of a comprehensive evaluation
to identify poor reviewers who should be dropped and excellent reviewers who
deserve acknowledgment. Its reliability could be improved by defining the
qualities of a good review more specifically and by formally training editors
in its use.
In conclusion, this simple 5-point scale for editors to rate peer review
quality is acceptably reliable and correlates modestly with reviewers' reporting
of flaws in a test manuscript. It provides a useful tool for monitoring reviewer
performance and cannot be replaced by routine reviewer statistics such as
individual rate of recommendation for acceptance and decision congruence.
Callaham ML, Baxt WG, Waeckerle JF, Wears RL. Reliability of Editors' Subjective Quality Ratings of Peer Reviews
of Manuscripts. JAMA. 1998;280(3):229–231. doi:10.1001/jama.280.3.229
Coronavirus Resource Center