A, Mortality; the mortality rate for laparoscopic gastric bypass was zero for all hospital caseloads. B, Severe morbidity. C, Any morbidity.
Krell RW, Hozain A, Kao LS, Dimick JB. Reliability of Risk-Adjusted Outcomes for Profiling Hospital Surgical Quality. JAMA Surg. 2014;149(5):467-474. doi:10.1001/jamasurg.2013.4249
Quality improvement platforms commonly use risk-adjusted morbidity and mortality to profile hospital performance. However, given small hospital caseloads and low event rates for some procedures, it is unclear whether these outcomes reliably reflect hospital performance.
To determine the reliability of risk-adjusted morbidity and mortality for hospital performance profiling using clinical registry data.
Design, Setting, and Participants
A retrospective cohort study was conducted using data from the American College of Surgeons National Surgical Quality Improvement Program, 2009. Participants included all patients (N = 55 466) who underwent colon resection, pancreatic resection, laparoscopic gastric bypass, ventral hernia repair, abdominal aortic aneurysm repair, and lower extremity bypass.
Main Outcomes and Measures
Outcomes included risk-adjusted overall morbidity, severe morbidity, and mortality. We assessed reliability (0-1 scale: 0, completely unreliable; and 1, perfectly reliable) for all 3 outcomes. We also quantified the number of hospitals meeting minimum acceptable reliability thresholds (>0.70, good reliability; and >0.50, fair reliability) for each outcome.
For overall morbidity, the most common outcome studied, the mean reliability depended on sample size (ie, how high the hospital caseload was) and the event rate (ie, how frequently the outcome occurred). For example, mean reliability for overall morbidity was low for abdominal aortic aneurysm repair (reliability, 0.29; sample size, 25 cases per year; and event rate, 18.3%). In contrast, mean reliability for overall morbidity was higher for colon resection (reliability, 0.61; sample size, 114 cases per year; and event rate, 26.8%). Colon resection (37.7% of hospitals), pancreatic resection (7.1% of hospitals), and laparoscopic gastric bypass (11.5% of hospitals) were the only procedures for which any hospitals met a reliability threshold of 0.70 for overall morbidity. Because severe morbidity and mortality are less frequent outcomes, their mean reliability was lower, and even fewer hospitals met the thresholds for minimum reliability.
Conclusions and Relevance
Most commonly reported outcome measures have low reliability for differentiating hospital performance. This is especially important for clinical registries that sample rather than collect 100% of cases, which can limit hospital case accrual. Eliminating sampling to achieve the highest possible caseloads, adjusting for reliability, and using advanced modeling strategies (eg, hierarchical modeling) are necessary for clinical registries to increase their benchmarking reliability.
Clinical registries have had a prominent role in increasing transparency and accountability for the outcomes of surgical care. Many, if not all, of the preeminent surgical clinical registries use risk-adjusted outcomes feedback to benchmark performance and guide surgical quality improvement efforts.1- 4 With the increased prevalence of linking postoperative outcomes to reimbursements and quality improvement efforts, it is important that outcome measures be highly reliable to avoid misclassifying hospitals.1,5
However, a systematic evaluation of the statistical reliability of commonly used outcome metrics in surgery is lacking.6- 8 Because of financial or personnel limitations, not all surgical registries capture 100% of cases from their participating hospitals.9 As a consequence, the yearly maximum number of cases reported by many hospitals in those programs can be limited. The combination of low caseload and low outcome rates reduces the ability of many outcomes to distinguish true quality differences among providers, which results in low reliability—analogous to power limitations in clinical trials.7 Several studies10,11 have called into question the reliability of certain complications for measuring quality in specific clinical populations. A better understanding of the reliability of commonly reported risk-adjusted outcomes and measures to counteract low reliability will help to improve the accuracy of surgical outcome reporting.
In this context, we conducted an evaluation of the statistical reliability of 3 commonly used outcomes (mortality, severe morbidity, and overall morbidity) for profiling hospital performance across multiple procedures. We used logistic regression modeling techniques, a common risk-adjustment method, to calculate risk-adjusted mortality and morbidity rates following 6 different procedures. We then examined the reliability of those measures by investigating the effect of hospital caseload (ie, reported cases) on outcome reliability and then by assessing the number of hospitals that met 2 commonly accepted minimum reliability standards. We hypothesized that limited caseloads and rare event rates would result in low reliability for most commonly reported outcomes, even in clinically rich surgical registries.
We analyzed data from the 2009 American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) clinical registry. Details of data collection and validation in ACS-NSQIP have been provided elsewhere.12 In brief, the registry includes more than 135 variables encompassing patient and operative characteristics, 21 postoperative complications, reoperation, and 30-day mortality. Using relevant Current Procedural Terminology codes, we identified patients undergoing colon resection, pancreatic resection, laparoscopic gastric bypass, open ventral hernia repair, abdominal aortic aneurysm (AAA) repair, or lower extremity bypass procedures.
Our primary outcomes of interest were risk-adjusted overall morbidity, severe morbidity, and mortality. Postoperative complications recorded by ACS-NSQIP include surgical (wound dehiscence, bleeding, graft failure, or superficial, deep, or organ-space surgical site infection), medical (cardiac arrest, myocardial infarction, deep venous thrombosis, pulmonary embolism, urinary tract infection, renal insufficiency, or acute renal failure), pulmonary (pneumonia, prolonged intubation, or unplanned intubation), nervous (coma, stroke, or peripheral nerve injury), and systemic (sepsis or septic shock) complications. In addition, ACS-NSQIP records reoperation and 30-day postoperative mortality rates. For the present study, we defined 30-day morbidity as any of the 21 possible complications. To define severe morbidity, we excluded superficial surgical site infection, deep venous thrombosis, urinary tract infection, peripheral nerve injury, or progressive renal insufficiency.
We entered patient demographics, comorbid conditions, and operative characteristics when applicable into a forward stepwise logistic regression model with each outcome (mortality, severe morbidity, and morbidity) as a dependent variable. Those variables with coefficient P < .05 from the stepwise regression model were then used in a logistic regression model for each outcome to generate a patient’s probability of experiencing that particular outcome. We repeated the process across procedure types. To generate hospital risk-adjusted outcome rates, patient probabilities were then summed for each hospital and compared with each hospital’s observed outcome rate to generate hospital-level observed to expected ratios. Multiplying each hospital’s observed to expected ratio by the mean outcome rate yielded its risk-adjusted rate.
Quiz Ref IDReliability is a quantification of the proportion of provider performance variation explained by true quality differences (ie, statistical signal) and is measured on a scale of 0 (all differences attributable to measurement error) to 1 (all differences attributable to quality differences). A requisite for calculating reliability is the calculation of statistical “noise” for a particular outcome. Reliability is then defined as the ratio of signal to (signal + noise).13 To determine the reliability of each outcome measure, we used hierarchical logistic regression modeling. We defined signal as the variance of hospital random effect intercepts in the logistic model after full adjustment for patient risk factors.10 We quantified a hospital’s statistical noise by estimating that hospital’s measurement error variance in the logistic regression model.6,14 The reliability of each hospital’s risk-adjusted outcome rate was then calculated as signal/(signal + noise).
To assess the influence of caseload on reliability, we created hospital caseload (ie, cases reported by each hospital) terciles for each procedure. We then calculated the mean reliability of each outcome measure across caseload terciles. In further analysis, we quantified the number of hospitals with greater than 0.70 or 0.50 reliability for each outcome by procedure. A reliability of 0.70 is considered adequate for differentiating provider performance.6,15 Finally, we used the hospital-level random intercept variance in the hierarchical model as well as the total measurement error for each procedure group to calculate the number of cases needed to achieve 0.70 and 0.50 reliability for each outcome across procedures.
We performed all statistical analyses using Stata, release 12 (StataCorp). The study protocol was reviewed and determined as “not regulated” by the University of Michigan Institutional Review Board.
There were 55 466 patients in 199 hospitals who underwent colon resection, pancreatic resection, laparoscopic gastric bypass, open ventral hernia repair, AAA repair, or lower extremity bypass procedures. Descriptive characteristics of the patients, unadjusted and adjusted outcome rates, and hospital caseload (ie, their collected cases) are presented in Table 1. Overall morbidity was the most frequent outcome across all procedures and ranged from 5.5% (laparoscopic gastric bypass) to 31.0% (pancreatic resection). Severe morbidity varied widely by procedure, ranging from 2.8% (laparoscopic gastric bypass) to 24.6% (pancreatic resection). Mortality was the least frequent outcome, ranging from 0.2% (laparoscopic gastric bypass) to 5.4% (AAA). Hospital caseload varied widely across procedures as well (Table 1). Colon resection was the most commonly captured procedure performed, with hospitals averaging 114 cases per year, and pancreatic resection was the least commonly captured procedure, with hospitals performing a mean of 17 cases per year.
Mean reliability for each outcome across procedure types and hospital volume is presented in Table 2 and graphically in the Figure. Quiz Ref IDMean reliability for overall morbidity, the most frequent outcome, ranged from 0.17 (lower extremity bypass) to 0.61 (colon resection). Mean reliability for severe morbidity ranged from 0.13 (laparoscopic gastric bypass) to 0.49 (colon resection). Mean reliability for mortality ranged from 0 (laparoscopic gastric bypass) to 0.39 (colon resection).
Reliability for each outcome depended on how frequently the event occurred, with more common outcomes having higher reliability (Figure). Mean reliability for infrequent events such as mortality was lower than that for more frequent events such as overall morbidity. For example, reliability for mortality following pancreatic resection (mean risk-adjusted mortality rate, 4.9%) was 0.06 and ranged from 0.01 in low-accrual hospitals to 0.13 in high-accrual hospitals. In contrast, reliability for overall morbidity following pancreatic resection (mean risk-adjusted overall morbidity rate, 31.0%) was 0.33 and ranged from 0.11 in low-accrual hospitals to 0.60 in high-accrual hospitals (Table 2). An exception to the trends we observed was with lower extremity bypass, in which reliability for severe morbidity was higher than reliability for overall morbidity across hospital caseloads (Table 2).
Reliability was generally higher for more commonly captured procedures (Table 2). For example, mean reliability for overall morbidity was higher for common procedures such as colon resection (mean caseload, 114/y; mean reliability, 0.61) than for less commonly captured procedures such as AAA repair (mean caseload, 25/y; mean reliability, 0.29). This relationship persisted when comparing only the highest-volume hospitals. Mean reliability for morbidity in high-volume hospitals for colon resections was 0.75, and mean reliability for high-volume hospitals for AAA repair was 0.47 (Table 2).
Moreover, reliability for all outcomes increased in a stepwise fashion as hospital caseload increased for all procedures (Figure). For example, reliability for overall morbidity following AAA repair (mean reliability, 0.29) ranged from 0.12 in low-caseload hospitals to 0.47 in high-caseload hospitals (Table 2). Pancreatic resection and laparoscopic gastric bypass showed the largest variation in outcome reliability across hospital caseloads. For example, mean reliability for severe morbidity following pancreatic resection ranged from 0.08 in low-caseload hospitals to 0.52 in high-caseload hospitals, and mean reliability for overall morbidity following laparoscopic gastric bypass ranged from 0.19 in low-caseload hospitals to 0.68 in high-caseload hospitals (Figure). An exception to this general trend was reliability for mortality following laparoscopic gastric bypass. All hospitals had reliability of zero for mortality regardless of caseload (Figure).
Table 3 reports the proportion of hospitals that met 2 common reliability benchmarks for each outcome. For overall morbidity, the most frequent outcome, colon resection (37.7% of hospitals), pancreatic resection (7.1%), and laparoscopic gastric bypass (11.5%) were the only procedures for which any hospitals met a reliability threshold of 0.70, which is considered good.6 When assessing a reliability threshold of 0.50, which is considered fair, few hospitals met the reliability benchmark for most procedures (Table 3). An exception was colon resection, in which 80.4% of hospitals met a 0.50 reliability threshold for overall morbidity. For lower event rate outcomes (ie, severe morbidity and mortality), fewer hospitals met reliability thresholds. Colon resection (2.5% of hospitals) and pancreatic resection (3.0% of hospitals) were the only procedures for which hospitals met a 0.70 reliability threshold for severe morbidity. Quiz Ref IDColon resection (1.5% of hospitals) was the only procedure for which hospitals met a 0.70 reliability threshold for mortality (Table 3). No hospitals met a reliability threshold of 0.70 for any outcome following ventral hernia repair, AAA repair, or lower extremity bypass.
Table 4 lists the calculated number of cases required to achieve reliability benchmarks for each outcome across procedures. In general, as outcomes became less frequent, hospitals would have to provide larger caseloads to achieve 0.50 or 0.70 reliability. For example, to meet 0.50 reliability for mortality, a hospital would have to perform 147 colon resections, 237 pancreatic resections, 520 ventral hernia repairs, 1342 AAA repairs, or 151 lower extremity bypass procedures (Table 4). With more frequent outcomes (overall morbidity), hospitals would require smaller caseloads to meet reliability thresholds.
As quality measurement platforms are increasingly used for public reporting and value-based purchasing, it has never been more important to have reliable performance measures.5,16,17 Reliability is the most widely used indicator to assess an outcome’s capability to detect differences in quality if they exist.13 This is analogous to a power calculation used to avoid type II errors (failure to detect a real difference between groups) in clinical trials. Similar to the need for sufficient sample size and large enough treatment effect to have adequate power in a clinical trial, hospital outcomes measurements require both large enough caseloads and frequent enough adverse event rates to reliably capture quality differences.6 We have demonstrated that commonly used outcome measures have low reliability for hospital profiling for a diverse range of procedures. Hospital caseload was a strong driver for outcome reliability, with higher-caseload hospitals showing the most reliable outcomes. However, with infrequent outcomes, the number of submitted cases needed for adequate outcome reliability was much larger than most hospitals were able to provide. Our findings underscore the importance of carefully considering reliability when designing outcomes feedback programs for providers.
There have been few studies6,7,10,15 assessing outcome measure reliability using claims or clinical registry data. Most have shown that many hospitals lack the caseloads to reliably detect differences in performance for certain outcomes in specific clinical populations. Dimick et al7 demonstrated that few hospitals met caseload requirements to detect meaningful differences from performance benchmarks following cardiovascular, pancreatic, esophageal, or neurosurgical procedures. In a study similar to ours, Kao et al10 used ACS-NSQIP data to evaluate the reliability of surgical site infection as a quality indicator following colon resection and found that only half of the hospitals examined had adequate caseloads to meet reliability benchmarks. The present study goes further and provides a comprehensive evaluation of the reliability of 3 commonly used outcomes across a collection of general and vascular procedures and highlights the reliability problems that can occur with low caseloads and infrequent outcomes.
Quiz Ref IDOutcomes with low reliability can mask both poor and outstanding performance relative to benchmarks. Hospitals with poor outcomes might assume they have no quality problems when they do (analogous to a type II error). Likewise, outcomes with low reliability may cause average (or well-performing) hospitals to be spuriously labeled as poor performers (analogous to a type I error: detecting a difference between groups when none exists). Without a formal assessment of outcome reliability, it is unclear whether a hospital’s performance is the result of quality or if it simply lacks an adequate caseload. When reporting outcomes, most quality reporting programs use P values and/or CIs to assign significance to a hospital’s performance relative to benchmarks. However, these significance measures are often relegated to a footnote or dismissed. When hospitals act to investigate and amend a spuriously high outcome rate, they may direct resources to where they do not have a problem—this is known as tampering in the quality improvement lexicon.18,19 Given the cost of maintaining and implementing quality improvement programs, hospitals have a vested interest in using highly reliable outcome measures to minimize misclassification and unnecessary spending.
Quiz Ref IDThere are 3 main strategies to improve the reliability of outcome measures. One approach is to increase the caseload by sampling 100% of certain procedures.20,21 An alternative approach gaining momentum is the use of reliability adjustment. This technique has been discussed extensively elsewhere22 and is gaining traction in several statewide and national outcomes reporting programs. In brief, reliability adjustment uses empirical Bayes techniques to shrink a provider’s risk-adjusted outcome rate toward the overall mean rate, according to the provider’s caseload.23 Reliability adjustment has been demonstrated11,24 to more accurately predict future hospital performance for both general surgical and vascular procedures. A third option to increase reliability is by using composite quality indicators that combine quality signal from other measures and procedures within a hospital, such as outcomes from multiple related procedures, length of stay, and reoperation rate.23,25,26 Composite measures have been shown25 to more accurately predict future hospital performance compared with a single risk-adjusted outcome measure. Although these strategies are far from universal, they are gaining traction in some registries. For example, ACS-NSQIP has been among the leaders in implementing best practices to increase the reliability of outcome measures. Specifically, ACS-NSQIP now offers 100% sampling for certain procedures, uses hierarchical modeling and reliability adjustment for reporting outcomes, and has investigated using composite measures for certain procedures for use in quality profiling.26
There are several important limitations to the present study. Our results may not be generalizable to clinical registries that already capture nearly 100% of their patients.22 However, even with 100% case capture, some hospitals that participate in clinical registries may not have the caseload for reliable benchmarking, especially if considering rare outcomes (eg, mortality) or uncommon procedures (eg, pancreatectomy). This underscores the importance of using other methods for increasing reliability (eg, composite measures and reliability adjustment) as well. Another limitation of this study is that ACS-NSQIP may not be generalizable to all US hospitals because it oversamples larger teaching hospitals.
Currently, outcomes reported by many clinical registries may have low reliability for profiling hospital performance for most commonly performed general and vascular surgery procedures. Implementing procedure-targeted data collection and accounting for statistical reliability when reporting outcomes will better inform hospitals of where they stand relative to their peers. More broadly, providers and payers should consider strategies to improve reliability when using clinical registry data for performance profiling, such as 100% sampling of high-risk conditions, reliability adjustment for outcomes reporting, and use of composite measures. Such measures should give more insight into quality differences between providers and better target high leverage areas for quality improvement.
Accepted for Publication: July 15, 2013.
Corresponding Author: Robert W. Krell, MD, Department of Surgery, University of Michigan, 2800 Plymouth Rd, Bldg 16, Office 016-100N-13, Ann Arbor, MI 48109 (firstname.lastname@example.org).
Published Online: March 12, 2014. doi:10.1001/jamasurg.2013.4249.
Author Contributions: Dr Dimick had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Krell, Dimick.
Acquisition, analysis, or interpretation of data: Krell, Hozain, Dimick.
Analysis and interpretation of data: All authors.
Drafting of the manuscript: Krell, Hozain, Dimick.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Krell, Hozain, Dimick.
Obtained funding: Dimick.
Administrative, technical, or material support: Dimick.
Study supervision: Dimick.
Conflict of Interest Disclosures: Dr Dimick has a financial interest in ArborMetrix, Inc, which had no role in the analysis herein. No other disclosures were reported.
Funding/Support: Dr Krell is supported by grant 5T32CA009672-22 from the National Institutes of Health.
Role of the Sponsor: The National Institutes of Health had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Disclaimer: The ACS-NSQIP and the hospitals participating in the ACS-NSQIP are the source of the original data and cannot verify or be held responsible for the statistical validity of the data analysis or the conclusions derived by the authors.