eTable 1. Description of Postoperative complications included in the Analysis
eTable 2. Clinical Process of Care Measures Included in Assessment of Hospital Surgical Quality
eTable 3. Cross-Validation Analysis
eTable 4. Independent Hierarchical Generalized Linear Models (HGLMs) for Each of the Four Adverse Events for the Overall Study Population
eTable 5. Sample Characteristics by Residency Program Tertile for the 10-Year Cohort
eTable 6. Mean Adjusted Adverse Event Rates by Residency Program Tertile for the 10-Year Cohort
eTable 7. Mean Adjusted Adverse Event Rates by Residency Program Tertile for the 5-Year Cohort
eAppendix. Statistical Appendix
Bansal N, Simmons KD, Epstein AJ, Morris JB, Kelz RR. Using Patient Outcomes to Evaluate General Surgery Residency Program Performance. JAMA Surg. 2016;151(2):111-119. doi:10.1001/jamasurg.2015.3637
To evaluate and financially reward general surgery residency programs based on performance, performance must first be defined and measureable.
To assess general surgery residency program performance using the objective clinical outcomes of patients operated on by program graduates.
Design, Setting, and Participants
A retrospective cohort study was conducted of discharge records from 349 New York and Florida hospitals between January 1, 2008, and December 31, 2011. The records comprised 230 769 patients undergoing 1 of 24 general surgical procedures performed by 454 surgeons from 73 general surgery residency programs. Analysis was conducted from June 4, 2014, to June 16, 2015.
Main Outcomes and Measures
In-hospital death; development of 1 or more postoperative complications before discharge; prolonged length of stay, defined as length of stay greater than the 75th percentile when compared with patients undergoing the same procedure type at the same hospital; and failure to rescue, defined as in-hospital death after the development of 1 or more postoperative complications.
Patients operated on by surgeons trained in residency programs that were ranked in the top tertile were significantly less likely to experience an adverse event than were patients operated on by surgeons trained in residency programs that were ranked in the bottom tertile. Adjusted adverse event rates for patients operated on by surgeons trained in programs that were ranked in the top tertile and those who were operated on by surgeons trained in programs that were ranked in the bottom tertile were, respectively, 0.483% vs 0.476% for death, 9.68% vs 10.79% for complications, 16.76% vs 17.60% for prolonged length of stay, and 2.68% vs 2.98% for failure to rescue (all P < .001). The differences remained significant in procedure-specific subset analyses. The rankings were significantly correlated among some but not all outcome measures. The magnitude of the effect of the residency program on the outcomes achieved by the graduates decreased with increasing years of practice. Within the analyses of surgeons within 20, 10, and 5 years of practice, the relative difference in adjusted adverse event rates across the individual models between the top and bottom tertiles ranged from 1.5% to 12.3% (20 years), 9.1% to 33.8% (10 years), and 8.0% to 44.4% (5 years).
Conclusions and Relevance
Objective data were successfully used to rank the clinical outcomes achieved by graduates of general surgery residency programs. Program rankings differed by the outcome measured. The magnitude of differences across programs was small. Careful consideration must be used when identifying potential targets for payment-for-performance initiatives in graduate medical education.
The 2014 Institute of Medicine report calls for restructuring of Medicare funding for Graduate Medical Education (GME) to incorporate payment-for-performance methods.1,2 The Institute of Medicine argues that US taxpayers should no longer unconditionally fund physician training but rather fund training that is best able to meet the nation’s health care needs. This call for payment for performance in GME raises the question, “How do we define and measure residency program performance?” To our knowledge, there is no consensus regarding how to evaluate GME. Programs use fellowship match rates, board pass rates, or subjective evaluations of observed encounters as proxy measures of training quality.3,4 However, these measures do not directly capture program performance in the core objective of GME—to train a future generation of physicians to deliver high-quality patient care. Furthermore, experience in measuring hospital performance has shown that process measures are not necessarily correlated with outcomes measures.5,6 The same may be true in measuring GME performance. Funding GME based on performance demands the creation of a system that reliably evaluates residency programs using objective clinical outcomes.
Prior work demonstrated that obstetrics and gynecology residency programs could be ranked by the complication rates of their graduates’ patients.7,8 However, this approach has not yet been applied to fields where there are less clearly defined indications for intervention and more variability in the types of procedures performed. This study expands this work into general surgery, a primary care specialty with a more diverse range of procedures and outcomes. General surgery was selected because there are approximately 2.65 million inpatient admissions for general surgical procedures annually in the United States,9 general surgical procedure outcomes have been widely examined using discharge claims,10- 12 and general surgery training is the foundation for many other surgical specialties. Four outcomes were used to examine the care provided by general surgery residency program graduates and to compare performance across programs.
Patients undergoing 1 of 24 general surgical operations in New York and Florida hospitals between January 1, 2008, and December 31, 2011,13,14 were identified for study inclusion using International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) procedure codes.15 Operations were chosen to capture the breadth of inpatient procedures performed by general surgeons (Table 1).16 New York and Florida were selected for the study because of the ability to link patient claims to information on surgeons and hospitals. Physician identifiers were used to obtain current and historical data from the American Medical Association Physician Masterfile.17 Data on hospital-level quality measures were obtained from the 2014 Hospital Compare database.18 To avoid misclassifying a complex operation as a separately listed component procedure, patients undergoing multiple qualifying procedures were classified by the most comprehensive procedure coded in the discharge claim for each admission as determined by 3 of us (N.B., J.B.M., and R.R.K.) (Table 1). For example, a patient who underwent both a pancreatectomy and a cholecystectomy during the same admission was classified under pancreatectomy.
A total of 952 183 admissions included a qualifying general surgical operation. Patients were excluded if the the physician identifier in the state data set could not be linked to a record in the American Medical Association Physician Masterfile (n = 153), the physician did not identify general surgery as his or her primary or secondary specialty (n = 273 426), the recorded residency was at an institution without a general surgery residency program (n = 39 745), the physician was trained outside the United States (n = 195 741), the physician did not have an MD degree (n = 8593), or the residency completion date was after the date of the qualifying operation (n = 1078). To minimize the effects of practice habits developed after training, observations were excluded if the physician was more than 20 years out of residency at the time of the qualifying discharge (n = 132 775). Finally, patients of surgeons whose residency program could not be identified (n = 63) or whose surgeons trained at residency programs for which fewer than 5 alumni could be identified (n = 69 840) were excluded from the analysis. The final sample included 230 769 patients operated on by 454 general surgeons from 73 general surgery residency programs. The residency programs were located in 24 states, the District of Columbia, and Puerto Rico and represented 28.7% of the 254 currently accredited general surgery residency programs in the United States. The analysis was repeated excluding physicians more than 10 years out of residency and more than 5 years out of residency to examine the program effect at time points closer to the training period. For the analysis of surgeons within 10 years of training, there were 78 575 patients operated on by 319 general surgeons from 36 general surgery residency programs. For the analysis of surgeons within 5 years of training, there were 26 576 patients operated on by 121 general surgeons from 16 general surgery residency programs. Analysis was conducted from June 4, 2014, to June 16, 2015. The study was exempted from review by the University of Pennsylvania Institutional Review Board.
The adverse events examined were death, development of 1 or more complications, prolonged length of stay (PLOS), and failure to rescue (FTR). Death was defined as death during the same hospital stay. Complications were identified by ICD-9-CM diagnosis codes (eTable 1 in the Supplement)19,20 for individual complications and collapsed into a binary variable representing the occurrence of any postoperative complication. To distinguish between complications and comorbidities, diagnosis codes were not considered if they were designated as present on admission. Prolonged length of stay was defined within each hospital as a binary variable indicating procedure-specific length of stay greater than the 75th percentile. Prolonged length of stay is a well-described measure used to reflect inefficiencies in care and to capture complications that prolong care.21,22 Failure to rescue was coded as a binary variable indicating in-hospital death following any complication.23,24 In defining FTR, death was included as a complication with the assumption that patients who died without a documented complication experienced an undocumented complication. Failure to rescue was defined only for the 11 701 patients (5.1% of cohort) who were admitted electively and died or developed complications following surgery performed on hospital day 0 to reflect the context in which FTR was initially developed.
Owing to the nested nature of the data, with multiple patients associated with each surgeon and multiple surgeons associated with each residency program, hierarchical generalized linear models (HGLM) with a logit link function were used to assess the independent association between residency program and adverse events (eAppendix in the Supplement). A separate model was estimated for each of the 4 adverse events. Candidate covariates were chosen based on a review of the literature and clinical judgment and were selected for inclusion in each model using Pearson χ2 tests with a threshold of P < .10. Patient characteristics included age, sex, race, principal payer (Medicare, Medicaid, private insurance, self-pay, and other), Elixhauser index,25- 27 operation type, admission via the emergency department, surgery on the day of admission, operation year, and state. Surgeon characteristics included age, sex, decade of training completion, operative volume in tertiles, and identification of a subspecialty in addition to general surgery (defined as a binary variable). Surgeon subspecialty was included in the analysis to adjust for the effects of advanced training beyond residency. Given the study time frame, many surgeons entered residency before the duty hour requirement was reformed and before the accelerated rate of fellowship enrollment. Therefore, we used surgeon subspecialty as a proxy for fellowship training or focused practice. Hospital characteristics included were bed size, ownership, and setting. Hospital surgical quality was examined using data from the Hospital Value-Based Purchasing Program16 to account for the assumptions that better hospitals attract surgeons trained at better residency programs and that the variance in hospital quality in the form of better preoperative or postoperative care may account for the observed variance in clinical outcomes. Hospital surgical quality was defined as the mean performance score in the surgery-specific clinical process of care measures (eTable 2 in the Supplement). For each model, discrimination was assessed using the C statistic, and the proportion of variation explained was measured using Efron’s pseudo R2.
Using the analytical framework implemented in obstetrics7,8 and further described in the eAppendix in the Supplement, a risk-standardized adverse event rate (RSAER) for each residency program was calculated for each of the 4 adverse events. The RSAER reflects the program-specific HGLM-predicted adverse event rate divided by the HGLM-predicted adverse event rate of the average residency program. The residency programs were then ranked and grouped into tertiles based on their RSAERs for each adverse event. The 4 sets of program rankings were compared on a pairwise basis with the Spearman rank correlation using the Sidak correction for multiple comparisons.28
Using the results of fitting each HGLM, the adjusted adverse event rate (AAER) for each residency program was estimated as the HGLM prediction for the average patient treated by the average surgeon if the average surgeon had attended that specific residency program (eAppendix in the Supplement). Unlike the RSAER, the AAER differs between programs only in the inclusion of the predicted program effects; the characteristics of each program’s graduates and those graduates’ patients do not affect the AAER. The mean AAER was calculated for each tertile. The difference between the top and bottom tertiles was calculated to reflect the absolute risk reduction associated with operations performed by a surgeon from a program ranked in the top tertile compared with operations performed by a surgeon from a program ranked in the bottom tertile. The relative risk reduction was also calculated. To control for differences in case selection by alumni, AAERs were calculated in subset analyses of specific procedures linked to specific indications: emergency appendectomy for appendicitis and elective pancreatectomy for neoplasm. These subset analyses were limited to procedures performed on the day of admission to reduce the heterogeneity of the patient cohorts. In addition, a cross-validation analysis was performed in which half the patients were used to compute RSAERs and rank programs and the other half were used to compute AAERS (eTable 3 in the Supplement).
All analyses were performed using Stata/MP, version 13.1, statistical software (StataCorp) and SAS, version 9.4, software (SAS Institute Inc).
Descriptive statistics are shown in Table 2. Characteristics were clinically similar across included and excluded cohorts. In the study population, the observed rates of adverse events were 1.8% for death, 15.0% for complications, 20.9% for PLOS, and 6.8% for FTR. Complete models for each adverse event are shown in eTable 4 in the Supplement. The model C statistics ranged from 0.74 (FTR) to 0.90 (death). The proportion of variation explained by the models ranged from 8.9% (FTR) to 22.2% (complications). Observed adverse event rates, RSAERs, and selected patient and surgeon characteristics by residency program tertile are shown in Table 3. Adjusted adverse event rates for each program tertile are shown in Table 4. Adjusted adverse event rates for programs ranked in the top tertile were significantly lower than those for programs ranked in the bottom tertile for all procedures as well as for subset populations.
Among the cohort of surgeons within 10 years of graduation from residency, the program effect was notably larger as evidenced by the larger absolute differences between the top and bottom tertiles across all outcomes and models. The relative difference in AAERs between the top and bottom tertiles ranged from 9.1% in the complication model to 33.8% in the FTR model. The RSAERs and AAERs were similar in magnitude to those computed from the full 20-year cohort (eTable 5 and eTable 6 in the Supplement). Among the cohort of surgeons within 5 years of graduation from residency, the program effect was even larger, with the relative difference between the top and bottom tertiles ranging from 8.0% in the PLOS model to 44.4% in the mortality model (eTable 7 in the Supplement).
The tertile rankings of the individual programs were consistent between death and FTR and between complications and PLOS. When comparing death and FTR, 52.1% of the 73 programs remained within the same tertile, 38.4% moved by 1 tertile, and 9.6% moved by 2 tertiles. Similarly, when judged by complications compared with PLOS, 50.7% of the programs remained within the same tertile, 38.4% moved by 1 tertile, and 11.0% moved by 2 tertiles. Rankings were not consistent between FTR and complications or PLOS. Table 5 shows the pairwise Spearman rank correlations comparing rankings for the individual adverse events.
The call to restructure GME funding aligns with a broader movement across the health care industry toward models of payment for performance.29- 32 However, to our knowledge, a national standard for measuring GME performance does not exist. Attempts have been made to rank residency programs based on perception by experts in the field,33 but public perception of program prestige is not a reliable indicator of quality of clinical training.34,35 Given that the ultimate goal of general surgery residency is to prepare surgeons to achieve optimal patient outcomes after graduation, an intuitive measure of performance would be the clinical outcomes of patients of program graduates. Information on program performance in achieving this mission is important to the health care system, residency programs, surgical trainees, and patients.
This study demonstrates that general surgery residency programs can be ranked by the outcomes achieved by their graduates but that the selected measures affect the rank ordering of the programs. Patients whose procedures were performed by surgeons trained in the top and bottom tertiles of general surgery residency programs experienced different rates of adverse events. The differences across the program tertiles were relatively small among the cohort of surgeons with up to 20 years of practice. However, differences tended to be greater among surgeons with less than 10 years of experience and most pronounced among the cohort of surgeons with less than 5 years of experience. This finding suggests that the effects of training on outcomes are greatest at the onset of independent practice.
This article serves as a proof of concept that patient outcomes can be used to rank general surgery residency programs. Similar ranking systems have been attempted previously only in obstetrics and gynecology,7 where programs were ranked by graduates’ rates of patient complications during delivery. That study examined 2 procedures (vaginal and cesarean delivery) with a single indication and discrete outcomes (laceration, hemorrhage, and infection). Our study shows that such a method can be applied to a primary care specialty—general surgery—with a much broader range of procedures. Program rankings were consistent across cohorts of surgeons within 5, 10, or 20 years of practice, suggesting that the analytic strategy can produce stable estimates of the programs’ performances and that the effect of the programs on their graduates’ outcomes is strongest in the early years of independent practice. However, the study was unable to define a single metric for use in program assessment owing to the lack of consistency across all the outcome measures examined.
There are several limitations to this study. First, successful surgical outcomes are determined not just by technical excellence but also by good clinical judgment in determining candidacy for surgery. Selecting the right surgical procedure for the right patient at the right time is a clinical skill taught in residency. By comparing outcomes for the average patient operated on by the average surgeon at each residency program, the clinical judgment required to selectively operate on patients most likely to benefit from surgical rather than medical treatment options is penalized rather than rewarded. Once researchers define a method to assess appropriateness of the surgical intervention, it will be important to include it in the model. Despite this limitation, this study demonstrates that “better” residency programs can be defined based on what matters most—how graduates’ patients fare clinically after surgery.
Second, this study did not include the baseline caliber of the entering trainees. It is possible that the more highly ranked residency programs select more talented trainees with a greater aptitude for excellence in surgery, and the program itself had a minimal effect. In this case, the ranking system would remain an important metric for patients and hospitals when selecting surgeons but would lose its utility in guiding improvements in the training process.
Third, we were not able to directly measure fellowship status. Self-reported specialization was used as a proxy but may reflect a spectrum of additional training and/or narrowing of practice patterns without a formal fellowship. This finding should not significantly affect the results of the study, as many surgeons with additional fellowship training continue to perform procedures outside their area of specialization, and skills learned during residency form the foundation for any additional training or experience gained during fellowship. In addition, the first cohort of surgeons trained in the modern era only began to enter practice in 2008. Given the study time frame, many procedures were performed by surgeons who completed most or all of their residency training before the implementation of the new duty hour standards and the accelerated rate of fellowship enrollment. Thus, the effect of fellowship training is likely to be less important in this study than it will be in the future.
Fourth, the study is limited to information contained in administrative data across 2 states. Therefore, the results are subject to the same limitations common to all studies performed using inpatient claims data. Moreover, we were only able to examine program rankings for 28.7% of general surgery residency programs, and the desire for trainees to practice in certain areas of the country may have influenced the results.
Finally, the program graduates were grouped together during a 20-year period and does not account for potential changes in a given program over time. While subset analyses suggest that a focused analysis of surgeons who graduated more recently would give similar rankings, these analyses were limited by the low numbers of programs included. Future studies designed to control for some of these limitations will help to develop a system that appropriately incentivizes general surgery residency programs to train surgeons to achieve optimal patient outcomes that meet population needs.
The study has several strengths. Results include outcomes across a broad array of surgical procedures performed by general surgeons following the residency period. The study considers the role of advanced training by adjusting for surgeon specialty and examines 4 medical and surgical outcomes that can be influenced by the quality of the care provided to the patients. Results are also adjusted for the major patient, surgeon, and hospital characteristics known to influence outcomes.
This study demonstrates the feasibility of ranking general surgery residency programs using the outcomes of patients treated by the programs’ graduates. The ranking system was able to successfully classify programs based on outcomes achieved by surgeons with variable amounts of clinical experience beyond the training period. However, as the rankings differed by the individual measures tested, careful consideration will need to be put into the choice of metrics used in any residency program assessment system.
Accepted for Publication: July 4, 2015.
Corresponding Author: Rachel R. Kelz, MD, MSCE, Department of Surgery, Center for Surgery and Health Economics, University of Pennsylvania Health System, 3400 Spruce St, 4 Silverstein Bldg, Philadelphia, PA 19104 (firstname.lastname@example.org).
Published Online: October 28, 2015. doi:10.1001/jamasurg.2015.3637.
Author Contributions: Dr Simmons had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Drs Bansal and Simmons contributed equally to this work.
Study concept and design: Bansal, Simmons, Epstein, Kelz.
Acquisition, analysis, or interpretation of data: All Authors.
Drafting of the manuscript: Bansal, Simmons, Kelz.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Simmons.
Administrative, technical, or material support: Epstein, Morris, Kelz.
Study supervision: Kelz.
Conflict of Interest Disclosures: None reported.