Dimick JB, Welch HG, Birkmeyer JD. Surgical Mortality as an Indicator of Hospital QualityThe Problem With Small Sample Size. JAMA. 2004;292(7):847–851. doi:10.1001/jama.292.7.847
Author Affiliations: VA Outcomes Group, Department of Veterans Affairs Medical Center, White River Junction, Vt (Drs Dimick and Welch); Center for the Evaluative Clinical Sciences, Dartmouth Medical School, Hanover, NH (Drs Dimick and Welch); and Michigan Surgical Collaborative for Outcomes Research and Evaluation (M-SCORE), Department of Surgery, University of Michigan Medical Center, Ann Arbor (Drs Dimick and Birkmeyer).
Context Surgical mortality rates are increasingly used to measure hospital quality.
It is not clear, however, how many hospitals have sufficient caseloads to
reliably identify quality problems.
Objective To determine whether the 7 operations for which mortality has been advocated
as a quality indicator by the Agency for Healthcare Research and Quality (coronary
artery bypass graft [CABG] surgery, repair of abdominal aortic aneurysm, pancreatic
resection, esophageal resection, pediatric heart surgery, craniotomy, hip
replacement) are performed frequently enough to reliably identify hospitals
with increased mortality rates.
Design and Setting The US national average mortality rates and hospital caseloads of the
7 operations were determined using the 2000 Nationwide Inpatient Sample (NIS),
and sample size calculations were performed to determine the minimum caseload
necessary to reliably detect increased mortality rates in poorly performing
hospitals. A 3-year hospital caseload was used for the baseline analysis,
and poor performance was defined as a mortality rate double the national average.
Main Outcome Measure Proportion of hospitals in the United States that performed more than
the minimum caseload for each operation.
Results The national average mortality rates for the 7 procedures examined ranged
from 0.3% for hip replacement to 10.7% for craniotomy. Minimum hospital caseloads
necessary to detect a doubling of the mortality rate were 64 cases for craniotomy,
77 for esophageal resection, 86 for pancreatic resection, 138 for pediatric
heart surgery, 195 for repair of abdominal aortic aneurysm, 219 for CABG surgery,
and 2668 for hip replacement. For only 1 operation did the majority of hospitals
exceed the minimum caseload, with 90% of hospitals performing CABG surgery
having a caseload of 219 or higher. For the remaining operations, only a small
proportion of hospitals met the minimum caseload: craniotomy (33%), pediatric
heart surgery (25%), repair of abdominal aortic aneurysm (8%), pancreatic
resection (2%), esophageal resection (1%), and hip replacement (<1%).
Conclusion Except for CABG surgery, the operations for which surgical mortality
has been advocated as a quality indicator are not performed frequently enough
to judge hospital quality.
Patients and policy makers increasingly use rates of surgical mortality
to assess hospital performance. New York and Pennsylvania have long-standing
systems for tracking and publicly reporting risk-adjusted mortality rates
after cardiac surgery1,2; California
and New Jersey have more recently adopted this approach.3,4 The
Leapfrog Group, a large coalition of employers and purchasers, has made surgical
mortality rates one of the criteria for "evidence-based referral" for cardiac
procedures.5 As part of its broader efforts
to develop a core set of quality indicators, the Agency for Healthcare Research
and Quality (AHRQ) has recently endorsed the use of surgical mortality rates
for 7 surgical procedures including repair of abdominal aortic aneurysm, esophageal
resection, and hip replacement.6
However, there are 2 reasons to question whether rates of surgical mortality
can reliably detect quality problems. First, the targeted operations are infrequently
performed at individual hospitals. Second, the mortality rates for many of
these procedures are often relatively low. Small samples and low event rates
combine to limit the statistical power of a comparison between an individual
hospital and a population-based benchmark. The practical implication of limited
power is that patients and policymakers may not identify a hospital with quality
problems. Although the general problem of failing to detect important differences—type
II error—is well recognized in the context of clinical trials, it is
often overlooked in quality measurement.7
This study was designed to explore this problem for the 7 surgical procedures
suggested for mortality measurement by the AHRQ. Using data from the Nationwide
Inpatient Sample (NIS), we determined the national average mortality rate
for each procedure and the number of cases performed in each hospital. We
then estimated the minimum sample size needed to identify a poorly performing
hospital as significantly different from the national average mortality rate.
Finally, we determined the proportion of US hospitals that exceed this minimum
caseload—ie, those hospitals for which mortality would reliably reflect
The data are from the 2000 NIS maintained by the AHRQ as part of the
Healthcare Cost and Utilization Project.8 The
NIS is a database of all discharges from a nationally representative sample
of 994 hospitals (randomly selected within strata for region, number of hospital
beds, teaching status, urban vs rural location, and hospital ownership) containing
data for approximately 20% of all acute care hospitalizations in the United
States. Hospital weights were used to generate estimates of mortality rates
and caseload distributions that represent all hospitals in the United States.
Because the NIS includes different hospitals each year, we were not able to
directly determine hospital caseloads over several years. We therefore assumed
that hospital caseloads are constant over time and estimated 3- and 5-year
caseloads using the 2000 NIS data.
We examined the 7 surgical procedures for which mortality has been advocated
as a performance measure by the AHRQ Inpatient Quality Indicators6: coronary artery bypass graft (CABG) surgery, repair
of abdominal aortic aneurysm, pancreatic resection, esophageal resection,
pediatric heart surgery, craniotomy, and hip replacement. For 6 of the 7 operations,
discharges were identified in the NIS database using the appropriate combination
of International Classification of Diseases, Ninth Revision,
Clinical Modification (ICD-9-CM)9 procedure
and diagnostic codes suggested by the AHRQ. For craniotomy, the AHRQ selection
criteria were based on a diagnosis related group code 01, which includes less-extensive
procedures (eg, sinus surgery, shunt placement). We therefore restricted the
analysis to discharges with an ICD-9-CM procedure
code for craniotomy.
The analysis had 4 steps. First, we used the NIS data to determine our
benchmark, ie, the national average mortality rate for each procedure. Mortality
was defined as any death during the index hospital stay.
Second, we performed a sample size calculation for each procedure to
determine the minimum caseload necessary to reliably detect increased mortality
in a poorly performing hospital. For the baseline analysis, we defined poorly
performing hospitals as having a mortality rate of twice our benchmark (ie,
the effect size for the sample size calculation was the difference between
the national average mortality rate and twice the national average mortality
Sample size calculations were based on 1-sample, 1-sided tests (α
= .05) with a power of 80%. Although unusual, 1-sample, 1-sided tests are
appropriate for the task: 1-sample tests because we are interested in detecting
whether an individual hospital is significantly different than a population
benchmark, and 1-sided tests because we are only interested in determining
whether the hospital mortality is higher than the benchmark. The effect of
both assumptions is to reduce the minimum caseload necessary for each procedure;
thus, the assumptions are conservative given our question. Sample size calculations
were performed using STATA version 8.0 (STATA Corp, College Station, Tex).
Third, we determined the proportion of hospitals that met or exceeded
the minimum caseload in the NIS. The numerator of each proportion was the
number of hospitals that met or exceeded the minimum caseload over a 3-year
period. The denominator included all hospitals performing at least 1 procedure.
Finally, because hospital performance may be measured over different
time periods, we conducted sensitivity analyses varying the period of observation
from a low of 1 year to a high of 5 years. We also varied the definition of
poor performance and repeated the analysis using a more subtle increase of
1.5 times the national average mortality rate (our benchmark).
Table 1 shows the national
average mortality rates for the 7 procedures examined, which ranged from 0.3%
for hip replacement to 10.7% for craniotomy. Table 1 also shows that annual caseloads of individual hospitals
varied widely. The median caseloads ranged from 4 cases per hospital for pediatric
heart surgery to 491 for CABG surgery. The variability is also evident within
operations. For example, the caseloads for repair of abdominal aortic aneurysm
ranged from 1 to 199 per hospital.
The minimum hospital caseload necessary to detect a mortality rate of
twice the national benchmark was inversely related to the operative mortality
rate of each procedure. The minimum caseloads varied from 64 cases for craniotomy
to 2668 cases for hip replacement. The minimum caseloads for other operations
were as follows: esophageal resection (77), pancreatic resection (86), pediatric
heart surgery (138), repair of abdominal aortic aneurysm (195), and CABG surgery
For only 1 operation did the majority of hospitals exceed the minimum
caseload: 90% of hospitals performing CABG surgery had a caseload of 219 or
higher. For the remaining operations, only a small proportion of hospitals
met the minimum caseload: craniotomy (33%), pediatric heart surgery (25%),
repair of abdominal aortic aneurysm (8%), pancreatectomy (2%), esophagectomy
(1%), and hip replacement (<1%). Figure
1 presents a detailed view of the data, showing the distribution
of actual hospital caseloads relative to the minimum caseload needed to detect
a doubling of the operative mortality rate.
The impact of changing the number of years of data and altering the
definition of poor performance is shown in Table 2. Even after increasing the sample size by using 5-year hospital
caseloads, CABG surgery was still the only operation with more than half of
US hospitals having more than the minimum caseload. Changing the definition
of poor performance to 1.5 times the benchmark (ie, a 50% increase above the
national average mortality rate) dramatically decreased the number of hospitals
having more than the minimum caseload. Under these conditions, only 54% of
US hospitals performing CABG surgery met the minimum caseload in a 3-year
Although the problem of small sample size has received considerable
attention in the context of clinical trials, it has received less consideration
in quality measurement. The ability to detect a difference between the mortality
rate at an individual hospital and a benchmark rate is dependent on both the
baseline mortality rate and the number of cases performed. For mortality to
be a useful measure of quality, the procedure must both have a relatively
high mortality rate and be performed frequently. Our findings suggest that
CABG surgery fulfills these criteria. For the other 6 operations in our analysis,
however, less than half of hospitals in the United States perform enough cases
to detect a doubling of the mortality rate. For these procedures, other approaches
to measuring quality will be required.
The problem of continuing to use rates of surgical mortality as an indicator
of hospital quality is perhaps most pronounced for hospitals with truly poor
performance. These hospitals are falsely reassured that their performance
is "average" and therefore have less incentive to improve. Payers are falsely
reassured that they are buying a good product and miss an opportunity to steer
patients away from poorly performing hospitals through selective contracting
or other mechanisms. Patients are falsely reassured that they are choosing
a safe hospital.
Regarding the generalizability of our findings, the sample used in our
analysis included only 20% of hospitals in the United States. However, the
hospitals were chosen as a stratified random sample that is specifically intended
to represent all US hospitals.8 Furthermore,
our analysis was limited to 7 surgical procedures, which were selected because
of their inclusion in the AHRQ Inpatient Quality Indicators.6 These
operations represent a wide range of mortality rates and hospital caseloads.
Although other operations could have been assessed, it is difficult to identify
any that are both common enough and sufficiently high-risk for mortality to
be a useful quality measure.
Regarding the assumptions used in our calculations, in our baseline
analysis we sought to detect a doubling of the mortality rate. Few would argue
that this increase is not clinically significant. In fact, both patients and
physicians would likely be interested in detecting more subtle differences
in performance. Based on our sensitivity analysis, it is clear that the usefulness
of mortality rates markedly declines when attempting to detect a mortality
rate of 1.5 times the benchmark. Furthermore, we used 3-year hospital caseloads
in our baseline analysis. Increasing the period of observation to 5 years
in our sensitivity analysis had little effect on our findings. Mortality rates
based on longer periods of observation are less relevant to current performance,
since surgical staff and practices may change over time.
We admittedly did not explicitly consider risk adjustment in our analysis.
Indeed, over the past 2 decades, the question of whether mortality rates are
useful measures of quality has focused largely on issues of risk adjustment.
However, while we acknowledge that adjusting for differences between hospitals
is of crucial importance, we believe that concerns of adequate sample size
must be addressed first. Without sufficient sample size, even perfect risk
adjustment does not matter.
Given the limited usefulness of procedure-specific mortality rates,
it is worth considering additional approaches to judging surgical quality.
The first alternative approach would be to increase the number of observations
by combining operations to produce an aggregate mortality rate. Perhaps the
most visible example of this approach is the National Surgical Quality Improvement
Program used in Veterans Affairs hospitals.10 This
approach, however, lacks information on the quality of care for individual
procedures. For instance, a hospital's high mortality with carotid endarterectomy
may be masked by the very low mortality of more common operations such as
laparoscopic cholecystectomies, hernia repairs, and appendectomies. When determined
from heterogeneous groups of procedures, an increased mortality rate does
not provide clear guidance about where to focus efforts at quality improvement.
A second approach would be to focus on other outcomes. To be useful,
these other outcomes must occur more frequently than mortality. Postoperative
complications often meet this requirement, but the clinical severity of complications
is extremely broad (eg, from superficial wound infection to ventilator-associated
pneumonia), and there is currently no standard approach to their classification
for quality measurement. Measures that are based on patient-oriented outcomes
such as quality of life, time until the return to work, or patient satisfaction
offer several advantages. Unlike mortality, all patients undergoing a given
operation experience these outcomes—making it feasible to detect clinically
important differences with smaller sample sizes. The usefulness of these measures,
however, depends on the availability of detailed clinical data. Such data
are not widely available and are often not practical on a large scale given
the expense of data collection.
The third approach would be to focus on indirect measures of quality,
including processes of care and procedural volume. Focusing on processes of
care has been standard for quality measurement for medical conditions (eg, β-blockers
for patients after myocardial infarction). Although this approach could be
applied to surgery in principle, the number of specific processes of care
that would be useful for this purpose is limited. One quality indicator that
has received significant attention recently is hospital or surgeon volume;
both have been shown to be associated with lower mortality rates for many
complex operations.11 Information on hospital
volume is easy and inexpensive to obtain. However, volume is often a poor
predictor of performance for individual hospitals or surgeons.
No single quality measure will be appropriate for all operations. Mortality
may work well for CABG surgery but will be too imprecise for use with other
procedures, such as pancreatic resection. Policy makers should consider sample
size in selecting the best quality measure for specific procedures, particularly
when data are used for public reporting. Otherwise, they run the risk of mislabeling
hospitals and misinforming patients.