Yeates P, O’Neill P, Mann K, Eva KW. Effect of Exposure to Good vs Poor Medical Trainee Performance on Attending Physician Ratings of Subsequent Performances. JAMA. 2012;308(21):2226-2232. doi:10.1001/jama.2012.36515
Author Affiliations: Schools of Translational Medicine (Dr Yeates) and Medicine (Drs O’Neill and Mann), University of Manchester (Drs Yeates, O’Neill, and Mann), University Hospital of South Manchester (Drs Yeates and O’Neill), Manchester, United Kingdom; Division of Medical Education, Dalhousie University, Halifax, Nova Scotia, Canada (Dr Mann); and Depart ment of Medicine, Centre for Health Education Scholarship, University of British Columbia, Vancouver, Canada (Dr Eva).
Context Competency-based models of education require assessments to be based on individuals' capacity to perform, yet the nature of human judgment may fundamentally limit the extent to which such assessment is accurately possible.
Objective To determine whether recent observations of the Mini Clinical Evaluation Exercise (Mini-CEX) performance of postgraduate year 1 physicians influence raters' scores of subsequent performances, consistent with either anchoring bias (scores biased similar to previous experience) or contrast bias (scores biased away from previous experience).
Design, Setting, and Participants Internet-based randomized, blinded experiment using videos of Mini-CEX assessments of postgraduate year 1 trainees interviewing new internal medicine patients. Participants were 41 attending physicians from England and Wales experienced with the Mini-CEX, with 20 watching and scoring 3 good trainee performances and 21 watching and scoring 3 poor performances. All then watched and scored the same 3 borderline video performances. The study was completed between July and November 2011.
Main Outcome Measures The primary outcome was scores assigned to the borderline videos, using a 6-point Likert scale (anchors included: 1, well below expectations; 3, borderline; 6, well above expectations). Associations were tested in a multivariable analysis that included participants' sex, years of practice, and the stringency index (within-group z score of initial 3 ratings).
Results The mean rating scores assigned by physicians who viewed borderline video performances following exposure to good performances was 2.7 (95% CI, 2.4-3.0) vs 3.4 (95% CI, 3.1-3.7) following exposure to poor performances (difference of 0.67 [95% CI, 0.28-1.07]; P = .001). Borderline videos were categorized as consistent with failing scores in 33 of 60 assessments (55%) in those exposed to good performances and in 15 of 63 assessments (24%) in those exposed to poor performances (P < .001). They were categorized as consistent with passing scores in 5 of 60 assessments (8.3%) in those exposed to good performances compared with 25 of 63 assessments (39.5%) in those exposed to poor performances (P < .001). Sex and years of attending practice were not associated with scores. The priming condition (good vs poor performances) and the stringency index jointly accounted for 45% of the observed variation in raters' scores for the borderline videos (P < .001).
Conclusion In an experimental setting, attending physicians exposed to videos of good medical trainee performances rated subsequent borderline performances lower than those who had been exposed to poor performances, consistent with a contrast bias.
The usefulness of performance assessments within medical education is limited by high interrater score variability,1 which neither rater training2 nor changes in scale format3 have successfully ameliorated. Several factors may explain raters' score variability,4 including a tendency of raters to make assessments by comparing against other recently viewed learners, rather than by using an absolute standard of competence.5 This has the potential to result in biased judgments.
Competency-based models of medical education require that judgments be made against a fixed or absolute level of ability.6 However, evidence from psychology and behavioral economics suggests that judgments tend to be relational. Two opposite effects are possible. The first is anchoring bias, in which recent experiences remain activated in observers' minds, causing them to pay undue attention to similar features in subsequent experiences.7 As a consequence, they may offer judgments biased toward recently viewed anchor experiences, so that a recent experience of good performances may tend to increase subsequent scores compared with recently viewed poor performances. The second is contrast (or relativity) bias, in which judgments are influenced by the perceived relative rank of items in the immediate context8; thus, recent experience of good performances may tend to decrease subsequent scores compared with recent experience of poor performances.
Anchoring and contrast effects have been observed in many contexts, including perception of physical objects, interpersonal judgments, and various occupational roles,9 including clinical medicine.10 Both contrast and anchoring effects have been observed in performance assessment. Although mediating conditions have been investigated,11 it is not clear which influence may predominate in medical education. Although studies have demonstrated sequence effects in medical education (ie, score increases over successive performances),12 these studies cannot disentangle changes in trainee performance from bias in examiner judgments.
We therefore investigated whether raters' recent exposure to good vs poor performances influenced their scores of subsequent borderline performances, as well as the magnitude of any such bias relative to individual raters' tendencies to score stringently or leniently,13 their sex,14 and their practice experience.14
The study population consisted of consultant physicians from England and Wales (comparable with US specialist attending physicians) working in specialties associated with internal medicine. We included emergency medicine physicians who trained via the examinations system of the Royal College of Physicians because they are frequently involved in the supervision of trainees providing care for patients with emergency presentations of internal medicine problems; this group is highly comparable in their assessments with internal medicine physicians.15
Additionally, participants must have worked as consultants in the United Kingdom for at least 2 years, estimated that they assess trainees using the Mini Clinical Evaluation Exercise (Mini-CEX) at least 5 times per year, and indicated that they feel comfortable assessing trainees on internal medicine case material. Mini-CEX assessments are direct observation assessments of trainees' clinical skills, in which a clinical performance is assessed and the trainee is given scores and feedback. Physicians were excluded if they had participated in previous studies by our group.
Recruitment was aimed at all UK consultant physicians. A standard e-mail invitation was sent by the national UK Foundation Programme to regional foundation directors, and then forwarded to foundation tutors in individual hospitals, and subsequently to individual consultants. In anticipation that e-mails may not have reached all areas, and to ensure geographical representativeness, follow-up direct e-mailing was used to increase recruitment in areas from which few participants had volunteered.
Interested and eligible individuals were invited to e-mail the research team, and were subsequently provided with further information, the study's web address, and a password. It is not possible to determine the proportion of UK consultant physicians who were eligible or who received e-mails. Researchers e-mailed 662 physicians directly, although the number of e-mails that were not received or went unread is unknown. Characteristics of the participants appear in Table 1.
Ethical approval was obtained from the Yorkshire and the Humber regional ethics committees. Consent was obtained online before beginning the study. Participants were not paid for their involvement in the study.
The study used an Internet-administered experimental design, and randomized participants to 1 of 2 experimental groups. In the intervention phase, one group was primed by viewing 3 videos of good performances (ie, those that were competent for the stage of training, with some features of excellence) by postgraduate year 1 (PGY1) medical trainees. The other group was primed by viewing 3 videos of poor performances (ie, those that were below the standard of competence in multiple ways) by PGY1 medical trainees at the same level. In the comparison phase, both groups viewed 3 identical videos of borderline performances (ie, those that were marginally below the expected standard) in the same order.
To avoid confounding due to assessor × case variations, the same 3 cases were used for both groups, repeating between the intervention and comparison phases, but showing different performance levels. Within groups, all performances were viewed in the same order, and both groups saw the borderline performances in the same order. Participants were instructed to imagine that the trainees in the videos had requested a Mini-CEX assessment and that they were being asked to judge and score the performances accordingly.
Mini-CEX assessments are required for all PGY1 trainees nationally, so all supervising consultants should be familiar with them. Consultants are locally trained in conduct of the Mini-CEX. To preserve the ecological validity of our findings, participants did not undergo additional training. Participants were blinded to the study's premise; they were informed the study would investigate “an aspect of the way assessors make decisions.” Participants scored performances consecutively using the UK Foundation Programme standard Mini-CEX scoring format (described below). Demographic data were collected after all cases had been viewed and rated. The study was completed between July and November 2011.
The study videos featured scripted performances of PGY1 trainees interviewing simulated patients; the performances did not represent the actual skills of the featured trainees. Videos were based on 3 clinical cases: pleuritic chest pain in a 54-year-old woman (case 1); unexplained loss of consciousness in a 34-year-old man (case 2); suspected upper gastrointestinal bleeding in a 44-year-old man (case 3).
Three different PGY1 trainees (A, B, and C) were featured in the videos, with each working up each case only once. Owing to video content, the good performance–primed group saw all 3 trainees (A, B, and C) in the priming videos; whereas the poor performance–primed group saw trainee C twice and trainee A once during the priming phase. Borderline videos (seen by both groups) featured trainee B for case 1 (pleuritic chest pain) and case 3 (upper gastrointestinal bleeding) and trainee A for case 2 (transient loss of consciousness). Consequently, the PGY1 trainee featured in the borderline video for case 2 was known to both priming groups, whereas the PGY1 trainee in the borderline video for cases 1 and 3 was known only to the good performance–primed group (Table 2). To enable participants to view multiple videos without placing undue strain on concentration or memory, previously used videos were shortened to approximately 4 minutes each. Each focused on the history of the presenting complaint and sections that involved explaining likely investigations or provisional diagnoses.
Scripts and videos were developed and validated for a previous study5 and demonstrated distinct levels of performance. For this study, we assessed whether they still ranked appropriately following editing by having a 6-member expert panel rank order the videotaped performances within each case. Cases 1 and 2 showed complete agreement with the intended order; for case 3, 5 of 6 experts ordered videos as intended, and 1 expert reversed the order of the borderline and poor performances.
Scores were collected using the standard format of the UK Foundation Programme.16 This required assigning scores to 7 different domains: history taking, physical examination, communication skills, critical judgment, professionalism, organization/efficiency, and overall clinical care. Scores are given on a 6-point Likert scale with the options of 1, well below expectations for foundation PGY1 trainee completion; 2, below expectations; 3, borderline; 4, meets expectations for PGY1 trainee completion; 5, above expectations; and 6, well above. There was also an option for unable to comment.
Based on the distribution of trainees' scores in prior research,17 we considered a difference in scores between groups of 0.5 on the assessments' 6-point scale to be the minimum meaningful difference. A power calculation based on pilot data indicated that a difference in scores of 0.5 between groups could be detected at 80% power with approximately 30 participants. On this basis, we set an a priori recruitment target of 40 participants.
Participants' scores were averaged across the 7 domains to give a single score for each video. Missing data (including unable to comment responses) were excluded from the denominator in the average, so that resulting scores were the average of the available scores. The primary question was addressed through a mixed-design analysis of variance with the dependent variable being the scores participants assigned to each of the 3 borderline performances (case being the repeated measure) and a between-participant factor of the experimental group.
Although Likert items produce ordinal data, the combination of multiple items produces interval data18 that can be validly analyzed by parametric means.19,20 Effect sizes were calculated using the Cohen d statistic (large ≥0.8, medium ≥0.5-<0.8, small ≥0.1-<0.5). The frequencies with which participants categorized performances as consistent with failure (defined as score <3) or passing (defined as score ≥4) were compared between the 2 priming groups using χ2 tests.
The clinical case × group interaction (groups primed with good vs poor performance) was examined to determine if group differences were uniform across cases. Pairwise comparisons were made using t tests with Bonferroni correction applied for 3 comparisons; this approach adjusted the P values so that the criterion for significance remained at a 2-sided P value of .05.21 Analysis of covariance was used to repeat the analysis with duration of consultancy as a covariate to check for any confounding due to this variable. The frequencies of expert assessors (defined as >7 years experience22) was determined in each group.
We determined how consistently raters scored stringently (ie, scores lower than participants' mean) or leniently (ie, scores higher than participants' mean) by calculating an intraclass correlation (class 2, accuracy) within each group. This analysis used participants' scores across all 6 observed videos, and treated raters as the facet of differentiation (ie, the numerator in the equation). The resulting coefficient describes the consistency with which raters could be differentiated based on the scores they assigned. We calculated a within-group z score (stringency index) for each participant based on their mean ratings assigned to the first 3 videos during the intervention phase. This gave a measure of how comparatively high or low each participant scored the intervention phase performances relative to their group.
Multiple linear regression was used to assess potential moderating factors. The dependent variable was the mean of the scores assigned to the 3 borderline videos by each participant. Independent variables were duration of consultancy (continuous), sex (categorical), the stringency index (continuous), and priming condition (good vs poor; categorical). All 4 independent variables underwent forced entry into the regression.
Statistically significant variables were subsequently entered hierarchically to determine if they offered incremental explanatory power. Because the concept of stringency has been previously described, it was entered first and then the novel variable of priming condition was entered last. Interactions between significant variables were then examined using univariate analysis of variance. All statistical analyses were conducted using SPSS software version 15 (SPSS Inc).
There are approximately 8500 physicians in the United Kingdom. Eighty individuals agreed to take part by the study's close; 45 were randomized by the time recruitment exceeded the a priori target of 40 participants, and 41 completed the study (Figure). Of study participants, 32% were women compared with 31% of secondary care consultants in the United Kingdom (Table 1).23
Participants were drawn from 12 of the 15 English and Welsh postgraduate training deaneries (geographical training regions),24 and included 14 of 18 specialties. Consequently, the sample was broadly representative of the target population. Missing scores accounted for 11% of possible responses in the good performance–primed group and 8% in the poor performance–primed group. Participants in the good performance–primed group had a higher mean duration of consultancy than those in the poor performance–primed group (13 years [95% CI, 10-16 years] vs 8 years [95% CI, 5-11 years]; P = .03).
The good performance–primed group contained more cardiologists than the poor performance–primed group (4 vs 0; P = .03). The main analysis was rerun with cardiologists excluded, and there was no alteration in the findings. Other differences between groups were not significant.
The mean scores assigned to each of the 3 borderline videos were higher for participants who were primed with poor performances compared with those who were primed with good performances (3.4 [95% CI, 3.1 to 3.7] vs 2.7 [95% CI, 2.4 to 3.0], respectively; difference of 0.67 [95% CI, 0.28 to 1.07]; P = .001; Table 2). The Cohen d statistic of 0.63 indicated a moderate effect size due to experimental manipulation. The case × group interaction also was statistically significant (P = .01), indicating that the magnitude of the between-group differences varied across cases (case 1, 0.3 [95% CI, −0.22 to 0.84]; case 2, 1.0 [95% CI, 0.59 to 1.42]; case 3, 0.7 [95% CI, 0.24 to 1.17]; Table 1).
Borderline performances were categorized as consistent with failing scores in 33 of 60 assessments (55%) in those exposed to good performances and in 15 of 63 assessments (24%) in those exposed to poor performances (P < .001). They were categorized as consistent with passing scores in 5 of 60 assessments (8.3%) in those exposed to good performances compared with 25 of 63 assessments (39.5%) in those exposed to poor performances (P < .001).
Post hoc pairwise comparisons showed that group differences were significant for cases 2 and 3, but not for case 1 (Table 3). The nonsignificant effect seen in case 1 was in the same direction as the other 2 cases. In the analyses of covariance, there was no significant interaction with duration of consultancy (P = .38), and adjusting for this variable altered the main finding only minimally. There was no significant interaction between experimental condition and rater expertise as defined by 7 years or more of consultant experience (P = .65).
Participants showed a moderately consistent tendency to be either stringent or lenient compared with the other raters in their group. The intraclass correlation was 0.51 for participants primed with good performances and 0.58 for participants primed with poor performances.
Neither participants' sex nor their duration of consultancy showed a statistically significant relationship with the mean scores they gave to the borderline performances. The priming condition (good vs poor performances) and the stringency index showed significant relationships, jointly accounting for 45% of the observed variation in raters' scores for the borderline performances (P < .001) (Table 3). In the hierarchical regression model, raters' stringency index explained 18% of the observed variance, whereas adding experimental group explained a further 24% of the observed score variance. There was no significant interaction between stringency index and study group by univariate analysis of variance (P = .46). Consequently, the influence of recent experience was not different for stringent or lenient raters.
In this study, recently viewed medical trainee performance videos influenced raters' scoring of subsequent performances. The findings were consistent with a contrast effect in that viewing good performances resulted in lower borderline performance scores relative to viewing poor performances. These findings support the notion that recent experience biases raters' performance assessments, and suggest that such biases are not due to anchoring.
The size of this effect in clinical terms is important. In a study by Mitchell et al,17 the mean Mini-CEX score for UK PGY1 and 2 trainees was 3.91 (SD, 0.38). Had the mean effect we observed (0.67 scale points) occurred in that group of trainees, it would have accounted for a change of 1.76 SDs. Case 2 would have been ranked near the bottom of the cohort on the basis of scores given by participants primed with good performances, but above the middle of the cohort on the basis of scores given by participants primed with poor performances.
Participants in this study were primed with performances that were either consistently good or poor. It is unclear whether an effect would occur with less consistent performance across candidates (ie, mixed performances). Perhaps a single good or poor performance could bias scoring on the subsequent candidate, or conversely, it may be that raters assimilate all recent performances and compare against their average level.
In sequential examination formats, such as objective structured clinical examinations, candidates may follow good or poor colleagues sequentially through a series of clinical stations.25 If a contrast effect were to occur after a single extreme candidate (good or poor), then the following candidate might receive biased scores that indicate as much about the ability of the preceding candidate as the current examinee.
Furthermore, because trainees often work in consistent pairings or small groups, consultants may consistently compare a given trainee against the same individuals when conducting workplace-based assessments. They may consistently contrast a trainee against either a very good or very poor peer, potentially biasing the judgments of a trainee and consequently the educational feedback provided, thereby having important implications for individual trainees' development.
Raters who have more than 7 years of preceptor experience have more complex assessment-related knowledge structures than more inexperienced raters.22 Thus it is surprising, given that participants had on average been consultants for approximately 10 years, that viewing just 3 performances might be enough to induce the observed effect and that the effect was unrelated to duration of consultancy. This suggests that despite considerable experience, raters may still not possess well-developed fixed criteria against which to judge observed performance.26 In other contexts such as clinical reasoning, experts have been shown to be just as susceptible to heuristics as novices,27 which further supports this finding.
The study findings need to be considered in the context of its limitations. The videos depicted patient interviews within internal medicine and we studied only the judgments of consultant physicians from England and Wales. Consequently, findings may not generalize beyond this setting, although we have no reason to believe populations would differ with respect to the outcomes we measured. Despite demonstrating comparability with the general population, we cannot exclude some respondent bias in that participants who chose to take part in our study may have been more interested in education or assessment than their nonparticipating colleagues. However, because participants were randomized to intervention study groups, this would not have influenced the study's internal validity. It is possible that individuals who are less enthusiastic about education may possess a less developed understanding of the assessment process and thus might have shown an even greater effect.
The study demonstrated the influence of recent experience on borderline performances; whether this influence occurs when evaluating better or worse levels of performance (ie, beyond the borderline cases studied herein) will require further investigation. The observed contrast effect was restricted to cases 2 and 3; no significant difference was seen for case 1, although the direction of the observed effect was consistent. The reason for this is unclear, although context specificity is a well-established phenomenon in medical education.28
The observed effect occurred whether or not groups of participants had previous exposure to the trainees in the borderline videos, which supports the robustness of the effect and offers an area for further research aimed at exploring any mediating influence of the individual being observed.
The apparent contrast bias accounted for 24% of the observed score variance in addition to raters' tendency to be consistently stringent or lenient. Raters were only moderately consistent in their stringency or leniency despite common intuition that some examiners are harder graders than others. Neither duration of consultancy nor evaluator sex had any relationship to score variation. Thus, further study into the sources of raters' variability is required.
We recommend that our study be repeated in other contexts, particularly with reference to whether this effect occurs across the spectrum of performance quality, for other examination formats, and for other groups of learners and raters. The mediating role of case specificity and mixed performance (recent performances of differing quality) also should be investigated.
With the movement toward competency-based models of education, assessment has largely shifted to a system that relies on judgments of performance compared with a fixed standard at which competence is achieved (criterion referencing).6 Although this makes conceptual sense (with its inherent ability to reassure both the profession and the public that an acceptable standard has been reached), the findings in this study, which are consistent with contrast bias, suggest that raters may not be capable of reliably judging in this way.
Corresponding Author: Peter Yeates, MBBS, MClinEd, University Hospital of South Manchester, Southmoor Road, Manchester, M23 9LT, United Kingdom (email@example.com).
Author Contributions: Dr Yeates had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Yeates, O’Neill, Mann, Eva.
Acquisition of data: Yeates.
Analysis and interpretation of data: Yeates, O’Neill, Mann, Eva.
Drafting of the manuscript: Yeates.
Critical revision of the manuscript for important intellectual content: Yeates, O’Neill, Mann, Eva.
Statistical analysis: Yeates.
Administrative, technical, or material support: Yeates.
Study supervision: O’Neill, Mann, Eva.
Conflict of Interest Disclosures: The authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Dr Yeates reported receiving reimbursement for travel costs from Medtronic Inc to attend a meeting to discuss future collaboration on an unrelated research project. Dr Mann reported receiving honoraria for her role as chair of the objectives committee for the Medical Council of Canada; receiving grant income from the Social Sciences and Research Council of Canada (coinvestigator) for a project on technology in a new curriculum and from the Society for Academic CME for a project on a study of facilitated reflection; and receiving reimbursement for travel costs (without additional honoraria) for invited guest speaker engagements at Durham University, UK, in July 2012, Utrecht Medical Centre in March 2012, and at University College London, UK, in June 2012. No other authors reported conflicts of interest.
Funding/Support: Dr Yeates received a traveling fellowship award from the Association for the Study of Medical Education that was used in support of this study.
Role of the Sponsor: The Association for the Study of Medical Education had no role in the design and conduct of the study; in the collection, analysis, and interpretation of the data; or in the preparation, review, or approval of the manuscript.
Additional Contributions: We thank Julie Morris, MSc (University of Manchester, University Hospital of South Manchester), for providing statistical advice to the project. Ms Morris did not receive payment for her work. We also thank our expert review panel for their assistance in video validation, the physicians at the UK Foundation Programme, and simulated patients featured within the videos, the foundation and their associated regional schools for their assistance with recruitment, and the participants who took part in the study. Physicians at the foundation received a small gratuity for their time and simulated patients received professional rates of pay.