[Skip to Content]
Sign In
Individual Sign In
Create an Account
Institutional Sign In
OpenAthens Shibboleth
Purchase Options:
[Skip to Content Landing]
Figure 1.  Internal Medicine Milestone Example Showing the First Professionalism Subcompetency Milestone (PROF1) Including Critical Deficiencies From the ACGME/ABIM “Internal Medicine Milestone Project”
Internal Medicine Milestone Example Showing the First Professionalism Subcompetency Milestone (PROF1) Including Critical Deficiencies From the ACGME/ABIM “Internal Medicine Milestone Project”

Adapted with permission from American College of Graduate Medical Education and American Board of Internal Medicine and (ACGME/ABIM), copyright 2012.

Figure 2.  Mean Resident Annual Evaluation Summaries and Milestone Ratings by Program Yeara
Mean Resident Annual Evaluation Summaries and Milestone Ratings by Program Yeara

aThe 2 scales represent different ordinal scales, and the new milestones scale uses more criterion-based (vs normative) ratings. A full unit change in the resident annual evaluation summary score is equivalent to a 0.5 unit change on the milestone score.

bRegression-adjusted changes mean milestone ratings across program years is a significantly different change in the mean resident annual evaluation summary (RAES) ratings (P < .001). Regression-adjusted change estimated using a stacked linear regression with resident-fixed effects and a cluster adjustment to account for each resident having 2 ratings (RAES and milestones) as well as being nested in programs.

Figure 3.  Internal Medicine Certification Examination Score Distribution by Medical Knowledge Ratings and Milestone Ratings Among Postgraduate-Year 3 Residents Who Attempted the 2014 Certification Examination (n = 6260)
Internal Medicine Certification Examination Score Distribution by Medical Knowledge Ratings and Milestone Ratings Among Postgraduate-Year 3 Residents Who Attempted the 2014 Certification Examination (n = 6260)

The boxes indicate the first through third quartiles; the lines in the boxes, the median; and the whiskers, end points. All of the values outside the whiskers shown as circles are outliers. The horizontal dashed line indicates the minimum passing examination score.

Table 1.  Demographic Characteristics of 21 284 Internal Medicine Residents in the 2013-2014 Academic Year
Demographic Characteristics of 21 284 Internal Medicine Residents in the 2013-2014 Academic Year
Table 2.  Spearman Correlation Between the Mean Resident Annual Evaluation Summary Score and Milestone Ratings by Program Year
Spearman Correlation Between the Mean Resident Annual Evaluation Summary Score and Milestone Ratings by Program Year
Table 3.  Comparison of Resident Annual Evaluation Summary Ratings for Professionalism With the Resident’s Lowest Rating Across Professionalism Milestones, Postgraduate Year 1 to 3 Residents
Comparison of Resident Annual Evaluation Summary Ratings for Professionalism With the Resident’s Lowest Rating Across Professionalism Milestones, Postgraduate Year 1 to 3 Residents
1.
Swing  SR.  The ACGME outcome project: retrospective and prospective.  Med Teach. 2007;29(7):648-654.Google ScholarCrossref
2.
Beeson  MS, Holmboe  ES, Korte  RC,  et al.  Initial validity analysis of the emergency medicine milestones.  Acad Emerg Med. 2015;22(7):838-844.PubMedGoogle ScholarCrossref
3.
Carraccio  CL, Englander  R.  From Flexner to competencies: reflections on a decade and the journey ahead.  Acad Med. 2013;88(8):1067-1073.PubMedGoogle ScholarCrossref
4.
Green  ML, Aagaard  EM, Caverzagie  KJ,  et al.  Charting the road to competence: developmental milestones for internal medicine residency training.  J Grad Med Educ. 2009;1(1):5-20.PubMedGoogle ScholarCrossref
5.
Holmboe  ES, Yamazaki  K, Edgar  L,  et al.  Reflections on the first 2 years of milestone implementation.  J Grad Med Educ. 2015;7(3):506-511.PubMedGoogle ScholarCrossref
6.
Hauer  KE, Clauser  J, Lipner  RS,  et al.  The internal medicine reporting milestones: cross-sectional description of initial implementation in US residency programs.  Ann Intern Med. 2016;165(5):356-362.PubMedGoogle ScholarCrossref
7.
Bartlett  KW, Whicker  SA, Bookman  J,  et al.  Milestone-based assessments are superior to likert-type assessments in illustrating trainee progression.  J Grad Med Educ. 2015;7(1):75-80.PubMedGoogle ScholarCrossref
8.
American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME).  Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association; 2014.
9.
ACGME A. The Internal Medicine Milestone Project. A joint initiative of the Accreditation Council for Graduate Medical Education and the American Board of Internal Medicine. https://acgme.org/acgmeweb/Portals/0/PDFs/Milestones/InternalMedicineMilestones.pdf. Accessed October 14, 2016.
10.
Cook  DA, Beckman  TJ.  Current concepts in validity and reliability for psychometric instruments: theory and application.  Am J Med. 2006;119(2):166.e7-166.e16.PubMedGoogle ScholarCrossref
11.
Messick  S.  Standards of validity and the validity of standards in performance assessment.  Educ Meas. 1995;14(4):5-8.Google ScholarCrossref
12.
ASH Regional Offices. Office of the Assistant Secretary for Health website. http://www.hhs.gov/ash/about-ash/regional-offices. Last reviewed October 19, 2016. Accessed May 21, 2016.
13.
Ingram  DD, Franco  SJ.  2013 NCHS urban–rural classification scheme for counties.  Vital Health Stat 2. 2014;(166):1-73.PubMedGoogle Scholar
14.
Iobst  W, Aagaard  E, Bazari  H,  et al.  Internal medicine milestones.  J Grad Med Educ. 2013;5(1)(suppl 1):14-23.PubMedGoogle ScholarCrossref
15.
Gunasekara  FI, Richardson  K, Carter  K, Blakely  T.  Fixed effects analysis of repeated measures data.  Int J Epidemiol. 2014;43(1):264-269.PubMedGoogle ScholarCrossref
16.
Cameron  AC, Miller  DL.  A practitioner’s guide to cluster-robust inference.  J Hum Resour. 2015;50(2):317-372.Google ScholarCrossref
17.
Benjamini  Y, Hochberg  Y.  Controlling the false discovery rate: a practical and powerful approach to multiple testing.  J R Stat Soc Ser B Methodol. 1995;57(1):289-300.Google Scholar
18.
Shea  JA, Norcini  JJ, Kimball  HR.  Relationships of ratings of clinical competence and ABIM scores to certification status.  Acad Med. 1993;68(10)(suppl):S22-S24.PubMedGoogle ScholarCrossref
19.
Lipner  RS, Blank  LL, Leas  BF, Fortna  GS.  The value of patient and peer ratings in recertification.  Acad Med. 2002;77(10)(suppl):S64-S66.PubMedGoogle ScholarCrossref
20.
Papadakis  MA, Arnold  GK, Blank  LL, Holmboe  ES, Lipner  RS.  Performance during internal medicine residency training and subsequent disciplinary action by state licensing boards.  Ann Intern Med. 2008;148(11):869-876.PubMedGoogle ScholarCrossref
21.
Carraccio  C, Englander  R, Van Melle  E,  et al; International Competency-Based Medical Education Collaborators.  Advancing competency-based medical education: a charter for clinician-educators.  Acad Med. 2016;91(5):645-649.PubMedGoogle ScholarCrossref
22.
Carraccio  C, Englander  R, Holmboe  ES, Kogan  JR.  Driving care quality: aligning trainee assessment and supervision through practical application of entrustable professional activities, competencies, and milestones.  Acad Med. 2016;91(2):199-203.PubMedGoogle ScholarCrossref
23.
Li  ST, Tancredi  DJ, Schwartz  A,  et al.  Competent for unsupervised practice: use of pediatric residency training milestones to assess readiness [published online July 26, 2016].  Acad Med. doi:10.1097/ACM.0000000000001322PubMedGoogle Scholar
24.
McDonald  FS, Zeger  SL, Kolars  JC.  Factors associated with medical knowledge acquisition during internal medicine residency.  J Gen Intern Med. 2007;22(7):962-968.PubMedGoogle ScholarCrossref
25.
McDonald  FS, Zeger  SL, Kolars  JC.  Associations of conference attendance with internal medicine in-training examination scores.  Mayo Clin Proc. 2008;83(4):449-453.PubMedGoogle ScholarCrossref
26.
Grossman  RS, Fincher  RM, Layne  RD, Seelig  CB, Berkowitz  LR, Levine  MA.  Validity of the in-training examination for predicting American Board of Internal Medicine certifying examination scores.  J Gen Intern Med. 1992;7(1):63-67.PubMedGoogle ScholarCrossref
27.
Kay  C, Jackson  JL, Frank  M.  The relationship between internal medicine residency graduate performance on the ABIM certifying examination, yearly in-service training examinations, and the USMLE Step 1 examination.  Acad Med. 2015;90(1):100-104.PubMedGoogle ScholarCrossref
28.
Sisson  SD, Bertram  A, Yeh  H-C.  Concurrent validity between a shared curriculum, the internal medicine in-training examination, and the American Board of Internal Medicine certifying examination.  J Grad Med Educ. 2015;7(1):42-47.PubMedGoogle ScholarCrossref
29.
Nasca  TJ, Philibert  I, Brigham  T, Flynn  TC.  The next GME accreditation system–rationale and benefits.  N Engl J Med. 2012;366(11):1051-1056.PubMedGoogle ScholarCrossref
30.
Regan  L, Hexom  B, Nazario  S, Chinai  SA, Visconti  A, Sullivan  C.  Remediation methods for milestones related to interpersonal and communication skills and professionalism.  J Grad Med Educ. 2016;8(1):18-23.PubMedGoogle ScholarCrossref
31.
Wynia  MK, Papadakis  MA, Sullivan  WM, Hafferty  FW.  More than a list of values and desired behaviors: a foundational understanding of medical professionalism.  Acad Med. 2014;89(5):712-714.PubMedGoogle ScholarCrossref
32.
Lipner  RS, Young  A, Chaudhry  HJ, Duhigg  LM, Papadakis  MA.  Specialty certification status, performance ratings, and disciplinary actions of internal medicine residents.  Acad Med. 2016;91(3):376-381.PubMedGoogle ScholarCrossref
Original Investigation
December 6, 2016

Correlations Between Ratings on the Resident Annual Evaluation Summary and the Internal Medicine Milestones and Association With ABIM Certification Examination Scores Among US Internal Medicine Residents, 2013-2014

Author Affiliations
  • 1University of California at San Francisco
  • 2American Board of Internal Medicine, Philadelphia, Pennsylvania
  • 3Hess Consulting, Lévis, Québec, Canada
  • 4Accreditation Council for Graduate Medical Education, Chicago, Illinois
  • 5Commonwealth Medical College, Scranton, Pennsylvania
JAMA. 2016;316(21):2253-2262. doi:10.1001/jama.2016.17357
Key Points

Question  What information do milestone-based ratings add to the former nondevelopmental rating system for internal medicine residents regarding their competence, medical knowledge, and professionalism?

Findings  In a cross-sectional study involving 21 284 internal medicine residents, milestone-based assessment was correlated with nondevelopmental ratings, but with a greater difference across the training years; it was also correlated with American Board of Internal Medicine certification examination scores.

Meaning  These findings provide some preliminary evidence to support the validity of milestone-based assessment.

Abstract

Importance  US internal medicine residency programs are now required to rate residents using milestones. Evidence of validity of milestone ratings is needed.

Objective  To compare ratings of internal medicine residents using the pre-2015 resident annual evaluation summary (RAES), a nondevelopmental rating scale, with developmental milestone ratings.

Design, Setting, and Participants  Cross-sectional study of US internal medicine residency programs in the 2013-2014 academic year, including 21 284 internal medicine residents (7048 postgraduate-year 1 [PGY-1], 7233 PGY-2, and 7003 PGY-3).

Exposures  Program director ratings on the RAES and milestone ratings.

Main Outcomes and Measures  Correlations of RAES and milestone ratings by training year; correlations of medical knowledge ratings with American Board of Internal Medicine (ABIM) certification examination scores; rating of unprofessional behavior using the 2 systems.

Results  Corresponding RAES ratings and milestone ratings showed progressively higher correlations across training years, ranging among competencies from 0.31 (95% CI, 0.29 to 0.33) to 0.35 (95% CI, 0.33 to 0.37) for PGY-1 residents to 0.43 (95% CI, 0.41 to 0.45) to 0.52 (95% CI, 0.50 to 0.54) for PGY-3 residents (all P values <.05). Linear regression showed ratings differed more between PGY-1 and PGY-3 years using milestone ratings than the RAES (all P values <.001). Of the 6260 residents who attempted the certification examination, the 618 who failed had lower ratings using both systems for medical knowledge than did those who passed (RAES difference, −0.9; 95% CI, −1.0 to −0.8; P < .001; milestone medical knowledge 1 difference, −0.3; 95% CI, −0.3 to −0.3; P < .001; and medical knowledge 2 difference, −0.2; 95% CI, −0.3 to −0.2; P < .001). Of the 26 PGY-3 residents with milestone ratings indicating deficiencies on either of the 2 medical knowledge subcompetencies, 12 failed the certification examination. Correlation of RAES ratings for professionalism with residents’ lowest professionalism milestone ratings was 0.44 (95% CI, 0.43 to 0.45; P < .001).

Conclusions and Relevance  Among US internal medicine residents in the 2013-2014 academic year, milestone-based ratings correlated with RAES ratings but with a greater difference across training years. Both rating systems for medical knowledge correlated with ABIM certification examination scores. Milestone ratings may better detect problems with professionalism. These preliminary findings may inform establishment of the validity of milestone-based assessment.

Introduction

Residency training programs must assess residents’ performance to ensure progress toward competence.1-4 Before 2014, internal medicine programs rated residents using the resident annual evaluation summary (RAES), with 9 items (ranging from unsatisfactory to superior). Subsequently, internal medicine milestones were introduced, and residency programs were required to submit milestone ratings to the Accreditation Council for Graduate Medical Education (ACGME) and American Board of Internal Medicine (ABIM). The change occurred because milestone ratings may better guide feedback to residents, inform programs about their curricula, and demonstrate that residencies use meaningful data to determine competence and continuously improve.5 The fundamental difference with milestones is the characterization of an expected developmental trajectory through training, whereas the prior ratings form typically generated high ratings throughout training. In a national cross-sectional study from postgraduate year (PGY) 1 to PGY-3, US internal medicine milestone-based ratings differed by residents’ training year.6 Pediatric milestone ratings within a single training program also stratified residents by training year, whereas nonmilestone-based ratings did not.7

It is critically important to assess the validity of any new assessment. Validity evidence is needed to determine the degree to which milestone ratings measure trainees’ provision of high-quality patient care.8,9 Validity evidence includes the response processes of raters (who for milestone ratings are program directors working with clinical competency committees) and the relationship of ratings to other performance measures.

This study examined evidence of validity for internal medicine milestone ratings, with 3 hypotheses. First, milestone ratings compared with nondevelopmental RAES ratings would display a greater range of performance across training levels. Second, RAES and milestone ratings of knowledge would both correlate with internal medicine certification examination scores. Third, residents with lower milestone ratings for some professionalism subcompetencies would be rated lower on the RAES, and a greater number of residents would be rated low in professionalism using milestones rather than the RAES.

Methods

This was a cross-sectional study of all internal medicine residents in US training programs accredited by the ACGME in the 2013-2014 academic year. The study used the Messick unitary concept of validity, in which all validity evidence supports construct validity.10,11 Residency directors submitted 2 types of performance information about their residents at the end of the 2013-2014 academic year, described below. This unique situation arose due to the implementation by the ACGME of required milestones reporting while the ABIM still required data submission using the RAES. Residents of combined specialty programs such as medicine-pediatrics were included; PGY-1 residents in preliminary positions were excluded. The Institutional Review Board at the University of California, San Francisco, approved the study as exempt.

Demographic data included residents’ age; sex; place of birth and medical school location (United States or international); osteopathic vs allopathic training; type of training program (university based, community based university affiliated, community based, military based); Department of Health and Human Service (HHS) geographic region12; and county type (urban or rural).13 Residents’ data from the ACGME and ABIM were matched using first and last name, last 4 digits of the social security number, and date of birth.

The RAES ratings were derived from a form that contained 9 items (patient care, patient care-medical interviewing, patient care-procedural skills, medical knowledge, practice-based learning and improvement, interpersonal and communication skills, professionalism, systems-based practice, and overall clinical competence) rated on a 1 to 9 scale (1-3, unsatisfactory; 4-6, satisfactory; 7-9, superior) without behavioral anchors (eTable 1 in the Supplement). Residents could be rated on the full scale regardless of years of training (eg, “superior” is not restricted to the final training year). A single item on moral and ethical behavior was rated as satisfactory or unsatisfactory.

Milestone ratings were measured in each of the 6 ACGME/American Board of Medical Specialties (ABMS) competencies and 22 subcompetencies (eTable 2 in the Supplement).14 The competencies (with number of subcompetencies) were patient care (5), medical knowledge (2), systems-based practice (4), practice-based learning and improvement (4), professionalism (4), and interpersonal and communication skills (3). Subcompetency milestone ratings (also referred to as milestone ratings) were on a 5-level rating scale with narrative descriptions (milestones) for each item (ratings are at or between levels, for 9 possible ratings: 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5). Ratings of 1 or 1.5 were interpreted as critical deficiencies. Ratings of 4 to 5 were interpreted as readiness for unsupervised practice.

Residents who successfully completed ACGME accredited internal medicine training in the 2013-2014 academic year were eligible to take the ABIM certification examination in 2014, except for PGY-3 residents with unsatisfactory ratings in any RAES item. Total scores from residents’ first attempt on this examination were available as a measure of medical knowledge.

Statistical Analyses

Descriptive statistics were calculated. The number of residents with complete matched data from both the RAES and milestone ratings were quantified. Residents with at least 1 subcompetency rated as not assessable were quantified; these residents were included in the analysis except for cases in which a particular subcompetency was rated as not assessable, in which case none of the milestone ratings in that competency were included.

Analyses used the Spearman correlation to compare mean RAES with milestone ratings by program year. Therefore, each item on the RAES was correlated with the average milestone ratings of subcompetencies in the corresponding competency (eTable 3 in the Supplement). For example, the patient care rating on the RAES was correlated with the average of all 5 patient care subcompetencies in the milestones.

To examine the variation in performance ratings by year of training using the 2 different ratings procedures, the difference in each resident’s RAES and milestone rating for PGY-1, PGY-2, and PGY-3 residents was compared across programs to determine whether the milestone ratings had a significantly different slope across program years from the RAES ratings. To accomplish this, a stacked linear regression with resident-fixed effects and a cluster adjustment accounted for each resident having 2 ratings (RAES and milestones) as well as residents being nested in programs.15,16 In the stacked regression, residents are each represented by 2 observations—RAES and milestone ratings (ie, stacked on top of each another), which are set as the dependent measure. The amount (slope) of change in milestone ratings across PGYs was compared with the amount (slope) in RAES ratings. The resident fixed effects controlled for characteristics of the residents, such as sex, age, race/ethnicity, program, and medical school. To determine the number of programs that demonstrated a statistically significant increase in milestone ratings across program years compared with the RAES, analyses interacted the program indicators with PGY and the milestones indicator. The false-discovery rate method adjusted the program level estimated P values for multiple hypothesis testing.17 A false-discovery rate–adjusted P value <.05, using a 2-sided criteria, was used to determine statistical significance. To assess whether differences in slope between the milestone and RAES ratings across PGYs was related to program characteristics, including the type of training program, HHS geographic region, and urban-rural county type, an interaction between the program characteristic, PGY, and the milestones indicator was included in the regression.

For PGY-3 residents who attempted the 2014 certification examination, mean examination score by RAES and milestone ratings were calculated for the 2 medical knowledge milestones: (1) clinical knowledge and (2) knowledge of diagnostic tests and procedures. This analysis provided convergent evidence of validity for the milestone ratings as an assessment of medical knowledge. Ratings on the RAES have been shown to correlate with performance on the certifying examination, peer ratings of postlicensure performance, and professional behavior in practice.18-20

A descriptive analysis compared the professionalism ratings with the 2 ratings procedures. Residents with low scores in professionalism were identified as those rated as unsatisfactory (<4) in professionalism on the RAES form,20 or those rated below a 2.5 and below a 2 on any of the 4 professionalism subcompetencies of the milestones. The milestone cutoff of 2.5 captures concerning professional behavior within the narrative milestones language, such as behaviors that are demonstrated inconsistently (professionalism subcompetency milestones [PROF] 1) or need reminders, assistance, and oversight (PROF 2-4). The milestone cutoff of 2.0 captures behaviors defined as critical deficiencies (Figure 1).

All analyses used SAS version 9.4 (SAS Institute Inc) and STATA version 14 (STATA Corp).

Results

Of the 21 774 residents with ACGME data, 21 284 (97.7%) were included in the study: 7048 PGY-1, 7233 PGY-2, and 7003 PGY-3. Three hundred forty residents could not be matched to any corresponding ABIM resident record, and 150 residents who were not matched did not have RAES data. About 70% of these excluded residents were either PGY-1 or PGY-2 residents who may not yet have had ABIM accounts. As previously reported, 2814 residents (1569 PGY-1, 888 PGY-2, and 357 PGY-3) had at least 1 milestone subcompetency rated as not assessable.6 Demographic characteristics are shown in Table 1 and eTable 4 in the Supplement.

RAES and Milestone Ratings

Milestone ratings for subcompetencies within a corresponding competency had very high internal consistency reliability, with Cronbach values ranging from 0.86 to 0.93. Correlations of RAES ratings and milestone ratings showed that overall Spearman correlations ranged from 0.20 (95% CI, 0.18-0.23) to 0.52 (95% CI, 0.50-0.54), as shown in Table 2. Correlations were slightly higher for the same competency using the 2 rating systems than across competencies. Correlations of the same competency by year of training showed progressively higher correlations, from 0.31 (95% CI, 0.29-0.33) to 0.35 (95% CI, 0.33-0.37) for PGY-1 residents to 0.43 (95% CI, 0.41-0.45) to 0.52 (95% CI, 0.50-0.54) for PGY-3 residents.

Linear regression showed that the slope across program years was significantly steeper for the milestone ratings than for RAES ratings for all measures (Figure 2, all P values <.001). There was a 0.63 (95% CI, 0.57-0.69) to 0.77 (95% CI, 0.71-0.83) greater unit increase in milestone ratings (1 unit, 0.5 points) across PGYs than in RAES ratings (1 unit equals 1 point) across PGYs (all P values <.001, eTable 5 in the Supplement). All differences in slopes were significant, adjusting for residents nested within programs. The percentage of residency programs that demonstrated this pattern of a significantly steeper slope for milestone ratings than RAES ratings was 53.4% of programs for all 6 competencies combined and 68.3% to 81.7% for individual competencies (eTable 5 in the Supplement). This pattern was similar across HHS regions, but the pattern of steeper slopes from PGY-1 to PGY-3 using milestone ratings compared with RAES ratings was less prominent, meaning that the RAES and milestone rating slopes were more similar among programs from more rural counties (P values <.001) and for community-based residencies than other residency types (P values <.04). For example, the differential change in patient care ratings across PGYs between milestone and RAES ratings (ie, difference in slopes between the 2 rating types) was significantly smaller (−0.8 units per PGY, 95% CI, −1.06 to −0.52; P < .001) for residents trained in rural micropolitan counties (an urban core population of at least 10 000 but less than 50 000) vs large central metropolitan counties. In contrast, the differential change in patient care ratings across PGYs between milestone and RAES ratings was significantly larger among residents training at university affiliated vs unaffiliated community programs (0.3 units per PGY; 95% CI, 0.1 to 0.5; P = .01).

Medical Knowledge

Comparison of internal medicine certification examination scores with medical knowledge ratings using the RAES and milestones for the 6260 PGY-3 residents who attempted the 2014 certification examination is shown in Figure 3. The box-and-whisker plots show that for both rating systems, on average, higher medical knowledge ratings were correlated with higher examination scores (examination score by RAES Spearman r, 0.40; 95% CI, 0.37 to 0.42; milestone medical knowledge 1 r = 0.37; 95% CI, 0.35 to 0.39; milestone medical knowledge 2 r, 0.30; 95% CI, 0.28 to 0.32). The 618 residents who failed the certification examination on the first attempt (despite being eligible to take it based on their RAES ratings) had lower ratings for medical knowledge than those who passed (mean RAES, 6.0 vs 7.0, difference −0.9; 95% CI, −1.0 to −0.8; P < .001; mean milestone medical knowledge 1, 3.8 vs 4.1, difference −0.3; 95% CI, −0.3 to −0.3; P < .001; mean milestone medical knowledge 2, 3.9 vs 4.1; difference, −0.2; 95% CI, −0.3 to −0.2; P < .001). Residents with lower ratings for medical knowledge were also less likely to attempt the 2014 certification examination (eTable 6 in the Supplement), with 15% of residents with a satisfactory (4-6) RAES medical knowledge rating not taking the examination vs 7% of residents with a superior (7-9) RAES rating.

Of the 1299 residents with a milestone rating of less than 4 on either of the medical knowledge subcompetencies, 274 (21%) did not attempt the certification examination compared with 456 (8%) of the 5699 with both subcompetencies rated 4 or greater. The 1025 residents with a milestone rating of less than 4 on either of the medical knowledge subcompetencies who attempted the examination included 219 (21%) who failed and 806 (79%) who passed. Of the 26 PGY-3 residents with milestone ratings below 2.5 on either of the 2 medical knowledge subcompetencies, 12 failed the internal medicine certification examination, 7 passed, and 7 did not attempt the examination.

Professionalism

Comparison of ratings on the RAES for professionalism with residents’ lowest rating across professionalism milestones is shown in Table 3; eTable 7 in the Supplement (Spearman correlation, 0.44; 95% CI, 0.43-0.45; P < .001). Of the 1190 residents (5.6%; 95% CI, 5.3%-5.9%) in all training years with a professionalism milestone rating below 2.5, indicating concerning professional behavior, 1161 (97.6%; 95% CI, 96.7%-98.4%) were rated as satisfactory (n = 809) or superior (n = 352) in professionalism on the RAES. Of the 205 residents (1.0%; 95% CI, 0.8%-1.1%) in all training years with a professionalism milestone rating below 2, indicating a borderline critical or critical deficiency, 183 (89.3%; 95% CI, 85.0%-93.5%) were rated as satisfactory (n = 145) or superior (n = 38) in professionalism on the RAES. The majority of residents with a professionalism rating below 2.5 were in PGY-1 (73.2%; 95% CI, 70.7%-75.7%; eTable 7 in the Supplement). Only 47 of 1190 PGY-3 residents (3.9%; 95% CI, 2.8%-5.1%) received a milestone professionalism rating less than 2.5, although 44 of these received at least a satisfactory RAES rating for professionalism. Of the 13 PGY-3 residents who received a professionalism milestone rating less than 2.0 (1.1%; 95% CI, 0.5%-1.7% of the 1190), 11 received at least a satisfactory RAES rating for professionalism.

Discussion

Using data from the population of internal medicine residents, ratings using milestones demonstrated a greater range of scores across training levels than ratings using the RAES form. This finding supports the validity of milestone ratings, which describe progressive expectations for behaviors in defined subcompetencies to achieve the goal of better distinguishing residents’ development of competence. Using both evaluation methods, on average, residents who failed or did not take the medical knowledge certification examination received lower ratings for medical knowledge than those who passed. Milestone ratings seemed to add information relevant to the detection and characterization of unprofessional behavior, based on the finding that a greater number of residents were rated as having potentially concerning professional behavior using milestones than the RAES.

The finding that milestone ratings appeared to demonstrate higher ratings of competence in more advanced years of training provides evidence of validity that milestones are serving their intended purpose. A core tenet of competency-based medical education is that learners must achieve defined observable outcomes along the continuum of training before advancing to the next level of training.21 This finding also suggests an opportunity to align supervision more closely with learners’ needs to optimize education and patient care.22,23 The finding that the pattern in milestone ratings across training years was different for programs that were rural and community based warrants further study, including subsequent years of data to determine whether this persists. Over time, as data from all programs become available for the same residents over the 3 years of training, it will be possible to examine whether milestone ratings demonstrate progressive development of competence for individual residents.

Correlations of RAES and milestone ratings were small to moderate. Correlations were slightly higher for the same competency using the 2 rating systems than across competencies; however, reasons for correlations not being higher are likely multifold. The RAES system was intended for summative purposes, whereas milestone ratings have been initially used more formatively.6 Items on rating forms may not align with behaviors or qualities that supervisors observe or how supervisors think.7-9 Ratings of trainee performance tend to be inflated, clustered at the high end of the scale, and based on a limited number of trainee characteristics.10-12 To what degree milestone ratings will ameliorate these problems remains to be determined.

Medical knowledge ratings by program directors using both rating systems correlated with internal medicine certification examination scores. However, milestone ratings detected residents who failed the certification examination who were not detected using the RAES and detected residents who chose not to take the examination. This finding suggests that milestone ratings detect residents lacking core medical knowledge. The additional finding that low medical knowledge milestone ratings identified residents choosing not to take the certification examination may reflect that residents use this and other information to optimize their preparation for the examination. This provides evidence of validity for the milestones in relation to other variables.10,11 The milestone rating system entails not only use of milestone language but also the group judgment of clinical competency committee members to identify at-risk residents. This finding offers an opportunity for program directors to intervene by promoting conference attendance and self-directed learning.24,25 These findings may also reflect the fact that there is no milestone rating cutoff that prevents residents from sitting for the examination, as there is with the RAES. Prior studies have shown that internal medicine in-training examination scores predict certification examination scores.26-28 However, in one study, in-training examination scores explained less than half the variance in certification examination scores,27 and milestone ratings may contribute additional information about residents at risk.29 Competency-based remediation is emerging as a framework to support individual residents to achieve specialty-specific milestones.30

Results showed more range in professionalism scores using milestones than with the RAES. Nearly all residents with low professionalism milestone ratings had at least satisfactory RAES ratings. Although it is possible that these low milestone ratings represent overdetection, it is reasonable that they would represent true concerns as articulated in the milestone language. Because they contain more items about specific aspects of professional behavior, the milestones can offer more ways for a program director to signal a concern or recognize particular strengths than with the single RAES item. The complexity of the professionalism construct is well recognized and difficult to reduce to one or a few behaviors.31 However, the finding of greater detection of concerning professional behavior with milestone ratings has potentially important implications because program directors’ professionalism ratings have been correlated with state medical board actions.20,32 An important area for further exploration is whether low milestone ratings of professionalism are corroborated by other evidence of professionalism concerns within training programs to determine whether these milestone ratings are educationally meaningful.

This study has several limitations. First, data were derived from a single specialty, which may not generalize to other specialties. Second, these data describe the first year of milestone implementation; therefore, it is not possible in this cross-sectional study to comment on residents’ learning over time or development in the competencies. Henceforth, program directors together with clinical competency committees may assign milestone ratings differently as they gain experience. Future studies could investigate whether the high correlations among some subcompetency milestones within a competency persist or whether discrimination among these subcompetencies increases with greater experience. Third, the constructs of resident performance are not described exactly the same using the 2 rating systems even though some of the labels are similar. Fourth, milestones were not created to serve as a summative evaluation of performance whereas the RAES was developed specifically for this purpose. If milestone ratings were used summatively, it is possible that ratings would be inflated and ratings using the 2 systems might correlate more closely. Fifth, reasons for the small number of eligible residents not taking the internal medicine certification examination are not known.

Conclusions

Among US internal medicine residents in the 2013-2014 academic year, milestone-based ratings were correlated with RAES ratings but with a greater difference across training years. Both rating systems for medical knowledge were correlated with ABIM certification examination scores. Milestone ratings may better detect problems with professionalism. These findings may inform establishment of the validity of milestone-based assessment.

Back to top
Article Information

Corresponding Author: Karen E. Hauer, MD, PhD, Department of Medicine, University of California at San Francisco, 533 Parnassus Ave, U80, Box 0710, San Francisco, CA 94143 (karen.hauer@ucsf.edu).

Author Contributions: Drs Hauer and McDonald had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Hauer, Lipner, Holmboe, Hood, Iobst, McDonald.

Acquisition, analysis, or interpretation of data: Hauer, Vandergrift, Hess, Lipner, Holmboe, Hamstra, McDonald.

Drafting of the manuscript: Hauer, Vandergrift.

Critical revision of the manuscript for important intellectual content: Hauer, Vandergrift, Hess, Lipner, Holmboe, Iobst, McDonald.

Statistical analysis: Vandergrift, Hess, Lipner.

Administrative, technical, or material support: Lipner, Hood, McDonald.

No additional contributions: Hauer, Holmboe, Iobst.

Conflict of Interest Disclosures: All authors have completed and submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Dr Hauer receives consulting fees from the ABIM. Dr Vandergrift is employed by the ABIM. Dr Hess was employed by the ABIM and is currently a consultant for the ABIM. Dr Lipner is the Senior Vice President for Assessment and Research at the ABIM. Dr Holmboe receives royalties from Mosby-Elsevier for a textbook on assessment and is employed by the ACGME. Ms Hood is employed by the ABIM. Dr Hamstra is employed by the ACGME. Dr McDonald is the Senior Vice President for Academic and Medical Affairs at the ABIM. No other disclosures were reported.

Funding/Support: The study received no external funding.

Additional Contributions: The author team was led by a university-based educator as lead investigator (Dr Hauer) and multiple employees of the ABIM; in addition, the author team included 2 leaders of the ACGME milestone project. Two authors were also recent internal medicine residency program directors (Drs Iobst and McDonald). To limit potential bias or censorship, the author team included these diverse perspectives, and all authors had access to full study data.

References
1.
Swing  SR.  The ACGME outcome project: retrospective and prospective.  Med Teach. 2007;29(7):648-654.Google ScholarCrossref
2.
Beeson  MS, Holmboe  ES, Korte  RC,  et al.  Initial validity analysis of the emergency medicine milestones.  Acad Emerg Med. 2015;22(7):838-844.PubMedGoogle ScholarCrossref
3.
Carraccio  CL, Englander  R.  From Flexner to competencies: reflections on a decade and the journey ahead.  Acad Med. 2013;88(8):1067-1073.PubMedGoogle ScholarCrossref
4.
Green  ML, Aagaard  EM, Caverzagie  KJ,  et al.  Charting the road to competence: developmental milestones for internal medicine residency training.  J Grad Med Educ. 2009;1(1):5-20.PubMedGoogle ScholarCrossref
5.
Holmboe  ES, Yamazaki  K, Edgar  L,  et al.  Reflections on the first 2 years of milestone implementation.  J Grad Med Educ. 2015;7(3):506-511.PubMedGoogle ScholarCrossref
6.
Hauer  KE, Clauser  J, Lipner  RS,  et al.  The internal medicine reporting milestones: cross-sectional description of initial implementation in US residency programs.  Ann Intern Med. 2016;165(5):356-362.PubMedGoogle ScholarCrossref
7.
Bartlett  KW, Whicker  SA, Bookman  J,  et al.  Milestone-based assessments are superior to likert-type assessments in illustrating trainee progression.  J Grad Med Educ. 2015;7(1):75-80.PubMedGoogle ScholarCrossref
8.
American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME).  Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association; 2014.
9.
ACGME A. The Internal Medicine Milestone Project. A joint initiative of the Accreditation Council for Graduate Medical Education and the American Board of Internal Medicine. https://acgme.org/acgmeweb/Portals/0/PDFs/Milestones/InternalMedicineMilestones.pdf. Accessed October 14, 2016.
10.
Cook  DA, Beckman  TJ.  Current concepts in validity and reliability for psychometric instruments: theory and application.  Am J Med. 2006;119(2):166.e7-166.e16.PubMedGoogle ScholarCrossref
11.
Messick  S.  Standards of validity and the validity of standards in performance assessment.  Educ Meas. 1995;14(4):5-8.Google ScholarCrossref
12.
ASH Regional Offices. Office of the Assistant Secretary for Health website. http://www.hhs.gov/ash/about-ash/regional-offices. Last reviewed October 19, 2016. Accessed May 21, 2016.
13.
Ingram  DD, Franco  SJ.  2013 NCHS urban–rural classification scheme for counties.  Vital Health Stat 2. 2014;(166):1-73.PubMedGoogle Scholar
14.
Iobst  W, Aagaard  E, Bazari  H,  et al.  Internal medicine milestones.  J Grad Med Educ. 2013;5(1)(suppl 1):14-23.PubMedGoogle ScholarCrossref
15.
Gunasekara  FI, Richardson  K, Carter  K, Blakely  T.  Fixed effects analysis of repeated measures data.  Int J Epidemiol. 2014;43(1):264-269.PubMedGoogle ScholarCrossref
16.
Cameron  AC, Miller  DL.  A practitioner’s guide to cluster-robust inference.  J Hum Resour. 2015;50(2):317-372.Google ScholarCrossref
17.
Benjamini  Y, Hochberg  Y.  Controlling the false discovery rate: a practical and powerful approach to multiple testing.  J R Stat Soc Ser B Methodol. 1995;57(1):289-300.Google Scholar
18.
Shea  JA, Norcini  JJ, Kimball  HR.  Relationships of ratings of clinical competence and ABIM scores to certification status.  Acad Med. 1993;68(10)(suppl):S22-S24.PubMedGoogle ScholarCrossref
19.
Lipner  RS, Blank  LL, Leas  BF, Fortna  GS.  The value of patient and peer ratings in recertification.  Acad Med. 2002;77(10)(suppl):S64-S66.PubMedGoogle ScholarCrossref
20.
Papadakis  MA, Arnold  GK, Blank  LL, Holmboe  ES, Lipner  RS.  Performance during internal medicine residency training and subsequent disciplinary action by state licensing boards.  Ann Intern Med. 2008;148(11):869-876.PubMedGoogle ScholarCrossref
21.
Carraccio  C, Englander  R, Van Melle  E,  et al; International Competency-Based Medical Education Collaborators.  Advancing competency-based medical education: a charter for clinician-educators.  Acad Med. 2016;91(5):645-649.PubMedGoogle ScholarCrossref
22.
Carraccio  C, Englander  R, Holmboe  ES, Kogan  JR.  Driving care quality: aligning trainee assessment and supervision through practical application of entrustable professional activities, competencies, and milestones.  Acad Med. 2016;91(2):199-203.PubMedGoogle ScholarCrossref
23.
Li  ST, Tancredi  DJ, Schwartz  A,  et al.  Competent for unsupervised practice: use of pediatric residency training milestones to assess readiness [published online July 26, 2016].  Acad Med. doi:10.1097/ACM.0000000000001322PubMedGoogle Scholar
24.
McDonald  FS, Zeger  SL, Kolars  JC.  Factors associated with medical knowledge acquisition during internal medicine residency.  J Gen Intern Med. 2007;22(7):962-968.PubMedGoogle ScholarCrossref
25.
McDonald  FS, Zeger  SL, Kolars  JC.  Associations of conference attendance with internal medicine in-training examination scores.  Mayo Clin Proc. 2008;83(4):449-453.PubMedGoogle ScholarCrossref
26.
Grossman  RS, Fincher  RM, Layne  RD, Seelig  CB, Berkowitz  LR, Levine  MA.  Validity of the in-training examination for predicting American Board of Internal Medicine certifying examination scores.  J Gen Intern Med. 1992;7(1):63-67.PubMedGoogle ScholarCrossref
27.
Kay  C, Jackson  JL, Frank  M.  The relationship between internal medicine residency graduate performance on the ABIM certifying examination, yearly in-service training examinations, and the USMLE Step 1 examination.  Acad Med. 2015;90(1):100-104.PubMedGoogle ScholarCrossref
28.
Sisson  SD, Bertram  A, Yeh  H-C.  Concurrent validity between a shared curriculum, the internal medicine in-training examination, and the American Board of Internal Medicine certifying examination.  J Grad Med Educ. 2015;7(1):42-47.PubMedGoogle ScholarCrossref
29.
Nasca  TJ, Philibert  I, Brigham  T, Flynn  TC.  The next GME accreditation system–rationale and benefits.  N Engl J Med. 2012;366(11):1051-1056.PubMedGoogle ScholarCrossref
30.
Regan  L, Hexom  B, Nazario  S, Chinai  SA, Visconti  A, Sullivan  C.  Remediation methods for milestones related to interpersonal and communication skills and professionalism.  J Grad Med Educ. 2016;8(1):18-23.PubMedGoogle ScholarCrossref
31.
Wynia  MK, Papadakis  MA, Sullivan  WM, Hafferty  FW.  More than a list of values and desired behaviors: a foundational understanding of medical professionalism.  Acad Med. 2014;89(5):712-714.PubMedGoogle ScholarCrossref
32.
Lipner  RS, Young  A, Chaudhry  HJ, Duhigg  LM, Papadakis  MA.  Specialty certification status, performance ratings, and disciplinary actions of internal medicine residents.  Acad Med. 2016;91(3):376-381.PubMedGoogle ScholarCrossref
×