Scores are shown at the postgraduate year 2, 4, and 5 levels. Error bars show 25th and 75th percentiles. Symbols shown outside the error bars are outliers.
Error bars show 25th and 75th percentiles. Symbols shown outside the error bars are outliers.
eFigure. Example Item From a Case Used During the Study Period
Customize your JAMA Network experience by selecting one or more topics from the list below.
Shin JJ, Cunningham MJ, Emerick KG, Gray ST. Measuring Nontechnical Aspects of Surgical Clinician Development in an Otolaryngology Residency Training Program. JAMA Otolaryngol Head Neck Surg. 2016;142(5):423–428. doi:10.1001/jamaoto.2015.3642
Surgical competency requires sound clinical judgment, a systematic diagnostic approach, and integration of a wide variety of nontechnical skills. This more complex aspect of clinician development has traditionally been difficult to measure through standard assessment methods.
This study was conducted to use the Clinical Practice Instrument (CPI) to measure nontechnical diagnostic and management skills during otolaryngology residency training; to determine whether there is demonstrable change in these skills between residents who are in postgraduate years (PGYs) 2, 4, and 5; and to evaluate whether results vary according to subspecialty topic or method of administration.
Design, Setting, and Participants
Prospective study using the CPI, an instrument with previously established internal consistency, reproducibility, interrater reliability, discriminant validity, and responsiveness to change, in an otolaryngology residency training program. The CPI was used to evaluate progression in residents’ ability to evaluate, diagnose, and manage case-based clinical scenarios. A total of 248 evaluations were performed in 45 otolaryngology resident trainees at regular intervals. Analysis of variance with nesting and postestimation pairwise comparisons were used to evaluate total and domain scores according to training level, subspecialty topic, and method of administration.
Longitudinal residency educational initiative.
Main Outcomes and Measures
Assessment with the CPI during PGYs 2, 4, and 5 of residency.
Among the 45 otolaryngology residents (248 CPI administrations), there were a mean (SD) of 5 (3) administrations (range, 1-4) during their training. Total scores were significantly different among PGY levels of training, with lower scores seen in the PGY-2 level (44 ) compared with the PGY-4 (64 ) or PGY-5 level (69 ) (P < .001). Domain scores related to information gathering and organizational skills were acquired earlier in training, while knowledge base and clinical judgment improved later in residency. Trainees scored higher in general otolaryngology (mean [SD], 72 ) than in subspecialties (range, 55 , P = .003, to 56 , P < .001). Neither administering the examination with an electronic scoring system, rather than a paper-based scoring system, nor the calendar year of administration affected these results.
Conclusions and Relevance
Standardized interval evaluation with the CPI demonstrates improvement in qualitative diagnostic and management capabilities as PGY levels advance.
Preparing residents for independent surgical practice requires the development of a wide range of skills during postgraduate training. Evaluation of these skills is an area of developing interest for residency programs. Technical skills are of obvious importance in surgical disciplines and are often measured through well-defined methods, such as objective structured assessment of technical skills, surgical simulators, or cadaveric dissections.1-8 Acquisition of the relevant medical knowledge base is also critical; this is typically measured through written assessment, such as the annual otolaryngology training examination. These measures of educational progress serve as objective metrics to provide individualized feedback to residents and identify potential knowledge gaps that can be addressed with additional attention.
Obtaining surgical skill and medical knowledge are important in residency training, but achieving competency in a specialty goes far beyond the acquisition of operative dexterity or memorized facts. Surgical competency also requires sound clinical judgment, a systematic diagnostic approach, and integration of the wide variety of nontechnical skills. This more complex aspect of clinician development has traditionally been difficult to measure through standard assessment methods.
Fortunately, sophisticated methods exist that measure multifaceted qualitative characteristics. These methods have been applied to the assessment of health-related quality of life and involve validated instruments with proven interrater reliability, internal consistency, discriminant validity, and responsiveness to change. Such instruments have been critical to outcomes research because they provide quantitative tools for measurement of change after interventions.9-11
The assessment of individual resident progress along a continuum from novice to attending equivalent is the ideal foundation of the competency-based educational model espoused by the Accreditation Council for Graduate Medical Education.12-15 To address this need, we have used instrument science to develop a validated tool to assess the acquisition of the nontechnical aspects of clinical practice ability. The Clinical Practice Instrument (CPI) was created to measure 6 key diagnostic and management skills during residency training in our program.16 The CPI has demonstrated all of the aforementioned properties of a validated instrument.
Following its successful validation, we incorporated the CPI into the routine educational assessment of the residents in our program, evaluating a comparatively much larger trainee pool, and incorporating a broader faculty base. As part of this process we transitioned from a paper-based to electronic format. With these additions, we sought to determine whether the CPI demonstrates change in residents’ clinical practice ability between postgraduate years (PGYs) 2, 4, and 5 years of training and to evaluate whether results vary according to subspecialty topic or method of administration.
This prospective study involved trainees and faculty of the Harvard Otolaryngology Residency Program. The study was overseen and approved by the institutional review boards (IRBs) of the Massachusetts Eye and Ear Infirmary and Boston Children’s Hospital, with updates to the sponsoring human subject committees at 1- to 2-year intervals. Because this was part of the standard educational curriculum at the Harvard program, the process was exempted from written informed consent by the governing IRB. Participants were not compensated.
The CPI is a validated assessment tool with previously documented interrater reliability, internal consistency, discriminant validity, criterion validity, and responsiveness to change.16 Briefly, the CPI is a 21-item standardized assessment that can be adapted to a wide range of patient case presentations (see the eFigure in the Supplement). The instrument includes both Likert and dichotomous response options. The instrument is scored during the administration of a structured oral board trainee examination. This format was selected to provide both a practical and meaningful learning experience for our residents.
The administration of the CPI begins with a 1-sentence description of a patient. In response, the trainee elicits the relevant information through requested medical history, physical examination, and diagnostic studies, and then selects a diagnosis and develops a treatment plan with an explained rationale. The recognition and management of a related complication is typically included. Examinations are conducted face to face. During PGY 2, 1 examiner presents and scores the case simultaneously. During PGYs 4 and 5, 2 examiners are present during the administration; 1 provides the case description and interfaces with the resident, while the other scores the responses.
Presented cases span the range of otolaryngological disorders, including general otolaryngology, pediatric otolaryngology, otology-neurotology, facial plastics, and head and neck oncologic surgery. The cases are selected according to a previously described method.16 Briefly, each case has a presenting medical history, physical examination, diagnostic testing, and the expectation to make a definitive diagnosis and selection among treatment options. While no formal randomization procedures are used, cases are selected from the same pool of available options regardless of level of training. Each scheduled administration is planned, so there are no differences in the cases presented among the different years of training.
The CPI was administered as part of the educational curriculum between 2008 and 2014. During the initial phase of this study, the CPI was administered with a paper-based scoring system. More recently, we have transitioned to an electronic system, which collects and stores the data in real time. The electronic version of the CPI was programmed by one of our study investigators (J.J.S.) into a password-protected platform.
The trainee’s performance on the CPI is calculated in a multifactorial fashion. First, there is a final total summary score, which is a measure of the total raw score divided by the total possible score for that specific case. Six distinct domain scores are also generated: information gathering, knowledge base, systematic organization, integration of aggregate patient findings, clinical judgment, and understanding the individual elements required to execute a treatment plan. These domain scores are assessed to provide more targeted feedback. In addition to these composite scores, there is a single overall global score in which faculty provide a gestalt evaluation of the resident’s performance.
Because foreknowledge of the assessment items may alter measured outcomes and the integrity of the evaluation process, testing is arranged such that no residents have access to the cases or scoring mechanism beforehand. This process ensures that no one resident or group of residents has an advantage over others. During the course of these serial assessments, residents do not receive the same case more than once, and, to date, new cases have been created for each academic year. This is purposeful, so as to not confound the measurement of overall clinical practice ability with simple recall of a specific question and answer set. The electronic system is accordingly hosted on the Partners Discovery Informatics Platform; this platform is maintained within hospital-level secure data centers with sufficient encryption to meet Health Insurance Portability and Accountability Act standards.
Summary statistics were calculated for the final total summary score and 6 domain scores, including mean (SD), median, 25th and 75th percentiles, minimum, and maximum. Histograms were plotted according to result frequency in order to assess the data spread. Based on these results, it was determined the data could be evaluated by methods for continuous variables with an approximately normal distribution. Analysis of variance (ANOVA) with nesting was used to test the null hypotheses that there were no differences in final total summary scores or domain scores according to training level, subspecialty topic, and method of administration. If ANOVA results demonstrated a statistically significant difference among the compared groups, then postestimation pairwise comparisons were performed with application of a Bonferroni correction to account for multiple comparisons. When assessing the timing of domain score improvement, comparisons were made between the observed scores and the expected PGY-specific benchmarks as defined by the observed mean overall scores, again with adjustment for multiple comparisons. Box plots were selected for graphical data representation to (1) demonstrate results according to percentile (25th, 50th, and 75th), given the frequent usage of percentile-related results reported in educational scoring, and (2) demonstrate the presence of any outlier data. Pearson correlation coefficient for continuous data was used to evaluate for correlation between continuous variables. Correlation coefficients were interpreted such that 0 to 0.25 suggested little or no relationship, 0.25 to 0.50 suggested a fair relationship, 0.50 to 0.75 suggested a moderate relationship, and coefficients greater than 0.75 suggested a good relationship. Statistical analyses were performed using STATA statistical software (version 13; StataCorp LP).
There were 248 CPI administrations among 45 residents. Each resident completed a mean (SD) of 5 (3) administrations (range, 1-14 administrations) during their training. Among those residents matriculating in 2011 or later, there were 12 scheduled administrations per resident, beginning in PGY 2 and ending in PGY 5.
Final total summary scores (Table) were significantly different according to postgraduate year of training, with lower mean (SD) scores seen in the PGY-2 level (44 ) compared with the PGY-4 (64 ) or PGY-5 (69 ) level (P < .001). The post-ANOVA pairwise comparison with adjustment for multiple comparisons confirmed a significant difference when comparing PGY-2 with PGY-4 levels and when comparing PGY-2 with PGY-5 levels (Bonferroni correction; P < .003). There was no statistically significant difference between PGY-4 and PGY-5 overall scores (P = .22).
Domains scores were also significantly different among postgraduate levels of training, with lower scores seen in the PGY-2 level compared with the PGY-4 or PGY-5 level (P < .001). The post-ANOVA pairwise comparison with adjustment for multiple comparisons confirmed a significant difference when comparing PGY-2 to PGY-4 levels and when comparing PGY-2 with PGY-5 (Bonferroni correction; P < .001). There was likewise no statistically significant difference between PGY-4 and PGY-5 domain scores (information gathering, P = .26; knowledge base, P = .25; organization, P = .39; integration, P = .19; judgment, P = .20; execution, P = .054). Overall, domain scores related to information gathering and organizational skills were acquired earlier in training, while knowledge base and clinical judgment improved later in residency (P < .001) (Figure 1).
Global single item scores similarly improved between PGY 2 and PGY 4 or PGY 5 (ANOVA; P < .001), although the magnitude of the difference was smaller (PGY 2, mean [SD], 1.4 [0.8]; PGY 4, 2.4 [0.8]; PGY 5, 2.7 [0.9]). The single item global score showed good correlation with the final total summary scores (correlation coefficient, 0.81 [95% CI, 0.76-0.85]), although the scores were not identical.
The CPI cases involved 5 subspecialty categories (Figure 2). Residents scored higher in general otolaryngology than the other 4 subspecialties based on ANOVA results (P < .003). The post-ANOVA pairwise comparison with adjustment for multiple comparisons demonstrated a significant difference in particular between general otolaryngology (mean [SD], 72 ) and pediatric otolaryngology (56 ) (P < .001), and between general otolaryngology and head and neck surgery (55 ) (P < .003).
When the instrument was administered and completed using an electronic scoring system, rather than a paper-based scoring system, there was no impact on results (mean [SD] for paper-based, 63 ; for electronic, 58 ) (P = .88). In addition, the calendar year of administration was not an effect modifier; the impact of postgraduate level persisted, regardless of the calendar year of administration (mean [SD] range, 58-66 [15-22]) (P < .001).
Results for the residents evaluated after 2009 were also analyzed separately so as to not overlap any data collected during the instrument validation process. Within this subset (80% of the accumulated data), results were essentially identical to that of the complete set, showing significantly improved final total summary scores (P < .001), domain scores (P < .001), and global item scores (P < .001) in PGY 4 or 5, compared with PGY 2. Residents scored better in general topics than in pediatrics (P < .001) and head and neck surgery (P = .001). There was similarly no impact of electronic administration (P = .88) and the impact of postgraduate training level persisted, regardless of calendar year of administration (P < .001).
This prospective study uses the CPI to evaluate the nontechnical aspects of surgical residency training in otolaryngology, specifically, the ability to work through a clinical case in a comprehensive and methodical fashion. The results demonstrate significant improvement between PGY 2 and the senior years of training in both summary and domain scores, suggesting progressive development of such skills.
The CPI provides a quantitative analysis of the nontechnical aspects of a developing surgeon. Whereas objective structured assessment of technical skills assess surgical abilities and annual in-service examinations quantify knowledge base, we have previously lacked metrics that assess whether a resident is able to make sound, safe, independent decisions for patient care while still in training. Some trainees may be very “book smart” but have difficulty with clinical judgment. Others may have superb dissection skills in the operating room but do not recognize which patients require rapid evaluation and intervention. The CPI evaluates the intangible aspects of clinical practice ability as demonstrable in a structured oral examination format.
The CPI administration is meant to simulate an oral board examination. From the residents’ viewpoint, the interface is similar; they are presented with a clinical scenario and expected to elicit and interpret necessary diagnostic information and provide a cogent plan for management. They also do not know which cases will be presented or what the specific scoring metrics will be. Similar to a formal board examination, measures are taken to ensure examination question security surrounding each assessment. In the medical field at large, structured trainee oral board examinations have been shown to potentially have an impact on actual examination results while garnering resident approval.17-20 Although used specifically for otolaryngology in this study, the CPI may potentially be used in any surgical specialty.
Consistent use of the CPI also provides an opportunity to learn and apply a structured approach to specific aspects of clinical evaluation. For example, residents are instructed to create a differential diagnosis in the format of disease categories to prevent prematurely selecting the incorrect diagnosis. Residents are also taught to apply a methodical technique for evaluating imaging and relaying their interpretation. This organized approach to case evaluation is introduced in PGY 2, then reinforced and reevaluated throughout residency. Engraining this approach throughout the course of residency training is critical not only to facilitate certification examination performance but to optimize lifelong practice.
This educational method also provides an opportunity to witness a trainee’s evolving clinical practice abilities deconstructed into specific components. The CPI evaluates 20 specific items, ranging from taking a history to a systematic approach for considering treatment options. These items, along with domain scores, provide a more specific means of feedback. Rather than simply reporting that a trainee performed poorly on a particular case, precise areas that need improvement can be identified. For example, if a resident scores poorly on systematic organization, then he or she can be advised to apply more structured approaches to their discussion of the case. If a trainee scores poorly in knowledge base, then he or she can be advised to spend more time in study of the related specialty topics.
The aggregate results may also provide insight into the program curriculum itself. For example, after 10 consecutive CPI administrations to senior residents in 1 year, it became clear that in-depth polysomnography interpretation was a knowledge gap for all residents. After this issue was identified, the didactic educational curriculum was revised to incorporate more material on sleep study interpretation, and the pediatric clinical rotation was modified so that residents reviewed preoperative polysomnography study results for patients undergoing tonsillectomy on the day of surgery with the attending surgeon.
The finding that residents scored higher in general otolaryngology cases than they did in the other 4 subspecialties deserves comment. This result may reflect greater exposure to general otolaryngology topics during training or more limited experience with certain subspecialties. In addition, because some assessments are performed at the outset of PGY 2, this may reflect earlier exposure to general otolaryngology. This finding is also interesting given that residents in this training program do not rotate on a dedicated general otolaryngology rotation. As training programs evolve, it suggests that subspecialty-based rotations may still achieve the goal of providing sound general otolaryngology education and training.
The change in summary and domain scores seen between PGY 2 and PGYs 4 and 5 occurred regardless of the calendar year of administration. This result suggests stability within the program with consistent evidence of improvement after 4 years of residency training. In addition, it suggests stability within the CPI measurement process itself because there was no problematic learning curve or implementation barrier that resulted in change over time.
Although these data were collected in a prospective fashion using a previously validated instrument, there are inherent limitations to our study. First, the results were derived from a single training program in which the faculty members are aware of the training levels of each resident. Therefore, the scores are not blinded and have potential expectation bias on the part of the examiners. The overall data spread, however, argues against such bias occurring consistently. Notably, individual PGY-2 residents have occasionally scored higher than PGY-4 or PGY-5 residents, and conversely, examiners have not withheld from rating PGY-5 responses at a level that is below the mean response for a PGY-2.
Our protocol involves 1 faculty member performing the PGY-2 assessment and 2 faculty members performing the PGY-4 and PGY-5 assessments. The latter dual faculty presence was planned to prevent interruptions during the senior resident administrations because this group generally conveys their thoughts in a more rapid and comprehensive fashion. Having 2 faculty members ensures sufficient resources to simultaneously present and score the case, allowing residents to progress through the case at a typical cadence. At the PGY-2 level, 1 faculty member has been sufficient to keep pace with the trainee. This difference in faculty presence could theoretically have an impact on the data; we do not suspect that this significantly affects our results, but further study would be needed to gain definitive insight. We considered having 1 faculty administer at the senior levels, which would have the advantage of halving the staff needed to participate. It has, however, the possible disadvantage of altering the flow of the case to maintain real-time scoring. It could also prompt the faculty to score the instrument following rather than during the interaction, thus introducing a potential for recall bias. Another alternative is to have 2 faculty members administer the instrument in PGY 2. This approach would have the advantage of maintaining consistency with PGY-4 and PGY-5 assessments. Based on our experience to date, however, such is not needed at the PGY-2 administration because their rate of communication is more limited.
Our sample size and range of individuals is preordained by the size of our program rather than by an a priori plan to accumulate a certain sample size to achieve a 90% power as in most prospective studies. Consistent with our results, post hoc calculations suggest there is 100% power to identify significant difference between the mean summary scores at the PGY-2 and PGY-4 level, while there is a 61% power to detect a significant difference between the reported PGY-4 and PGY-5 scores (means [SDs] as reported; Bonferroni adjustment for multiple comparisons). Although no statistically significant difference is observed with respect to PGY-4 data compared with PGY-5 data, further study is needed to establish whether these 2 groups in fact score in a similar or different fashion, particularly given the limited power in that comparison.
With enough aggregate data over multiple years and levels of training, and ultimately over multiple training programs, ideally it will be possible to establish minimum target goals according to PGY level of training. For example, with the expected regression to the mean, if cumulative results suggest PGY-4 residents typically achieve a mean (SD) summary or domain score of 60 (13), then cause for concern and intervention might be defined as performance of more than 2 SDs below the mean, or a score of 34. Similarly, criteria for outlier data might be established according to percentiles. As data collection expands, we hope to advance beyond measuring group results at each level of training to tracking individualized progress over time.
Administration of the CPI has been formally incorporated into the Harvard otolaryngology residency curriculum, and beginning in 2011, cases have been presented to residents at 3 standardized points in training. These intervals have been selected to measure resident progression in clinical decision-making over time. While this yearly coordinated assessment schedule involves commitment, time, programming, and analytics, with the current and anticipated benefits to our residents, it has been a worthy investment. The CPI can be adapted for use in any otolaryngology training program and is potentially applicable to any surgical specialty.
Corresponding Author: Jennifer J. Shin, MD, SM, Department of Otolaryngology–Head and Neck Surgery, Harvard Medical School, 45 Francis St, Boston, MA 02115 (firstname.lastname@example.org).
Submitted for Publication: October 10, 2015; final revision received October 22, 2015; accepted November 18, 2015.
Published Online: February 25, 2016. doi:10.1001/jamaoto.2015.3642.
Author Contributions: Dr Shin had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Shin, Cunningham, Gray.
Acquisition, analysis, or interpretation of data: Shin, Emerick, Gray.
Drafting of the manuscript: Shin.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Shin.
Administrative, technical, or material support: Gray.
Study supervision: Cunningham, Emerick, Gray.
Conflict of Interest Disclosures: Dr Shin receives textbook royalties from Evidence-Based Otolaryngology. Drs Shin and Cunningham receive royalties from Otolaryngology Prep and Practice. Dr Shin is a recipient of a Harvard Medical School Shore Foundation–Center for Faculty Development Grant and Creating Healthcare Excellence through Education and Research award. No other disclosures are reported.
Additional Contributions: We thank and acknowledge our faculty who have administered the CPI 3 times or more within the past 5 years: Donald Keamy Jr, MD, MPH; Derrick Lin, MD, and Steven Rauch, MD, of the Department of Otolaryngology–Head and Neck Surgery, Harvard Medical School. Dr Shin thanks Thomas Y. Lin, BA, for support during the preparation of this manuscript. These individuals were not compensated for their assistance.