Comparison of outcomes for incisional (A) and organ/space (B) surgical site infections: aggregate cohort unadjusted, aggregate cohort risk-adjusted, propensity-matched cohort unadjusted, and meta-analysis outcomes. Error bars represent 95% confidence intervals. Meta-analysis data from Sauerland et al.13
Hemmila MR, Birkmeyer NJ, Arbabi S, Osborne NH, Wahl WL, Dimick JB. Introduction to Propensity ScoresA Case Study on the Comparative Effectiveness of Laparoscopic vs Open Appendectomy. Arch Surg. 2010;145(10):939-945. doi:10.1001/archsurg.2010.193
To demonstrate the use of propensity scores to evaluate the comparative effectiveness of laparoscopic and open appendectomy.
Retrospective cohort study.
Academic and private hospitals.
All patients undergoing open or laparoscopic appendectomy (n = 21 475) in the Public Use File of the American College of Surgeons National Surgical Quality Improvement Program were included in the study. We first evaluated the surgical approach (laparoscopic vs open) using multivariate logistic regression. We next generated propensity scores and compared outcomes for open and laparoscopic appendectomy in a 1:1 matched cohort. Covariates in the model for propensity scores included comorbidities, age, sex, race, and evidence of perforation.
Main Outcome Measures
Patient morbidity and mortality, rate of return to operating room, and hospital length of stay.
Twenty-eight percent of patients underwent open appendectomy, and 72% had a laparoscopic approach; 33% (open) vs 14% (laparoscopic) had evidence of a ruptured appendix. In the propensity-matched cohort, there was no difference in mortality (0.3% vs 0.2%), reoperation (1.8% vs 1.5%), or incidence of major complications (5.9% vs 5.4%) between groups. Patients undergoing laparoscopic appendectomy experienced fewer wound infections (odds ratio [OR], 0.4; 95% confidence interval [CI], 0.3-0.5) and fewer episodes of sepsis (0.8; 0.6-1.0) but had a greater risk of intra-abdominal abscess (1.7; 1.3-2.2). An analysis using multivariate adjustment resulted in similar findings.
After accounting for patient severity, open and laparoscopic appendectomy had similar clinical outcomes. In this case study, propensity score methods and multivariate adjustment yielded nearly identical results.
Enthusiasm for “comparative effectiveness” research is at an all-time high. Comparative effectiveness research is aimed at providing information on the relative strengths and weaknesses of various medical treatments. Randomized clinical trials are widely heralded as the best method of evaluating the efficacy of medical treatments. The randomization process ensures that the 2 treatment groups are balanced for all potential patient characteristics and makes certain that inferences about the effectiveness of the treatment are not threatened by confounding variables.
However, randomized trials evaluate only the efficacy of the treatment in a narrow context and do not provide information on the effectiveness of the interventions when they are applied to a wider population. To evaluate interventions in the real world, we often must rely on observational studies. The Achilles heel of observational studies is potential confounding by differences in patient characteristics. Unlike randomized trials, where the chance assignment of treatment balances patient characteristics, there exists significant potential for selection bias in observational studies.
Studies that use propensity scores are appearing with increasing frequency in the surgical literature.1,2 Propensity scores are a statistical technique for dealing with selection bias in observational studies.3,4 Selection bias arises when certain types of patients are more or less likely to receive treatment owing to possible confounding by indication. For example, when selecting a surgical approach for appendicitis, physicians who suspect perforation may be more likely to perform an open (vs a laparoscopic) appendectomy. Thus, comparison of outcomes in the laparoscopic vs open group would be confounded by this difference in selection; that is, we would expect worse outcomes in the open group, even if there was no true difference between the approaches.
Given the growing use of this technique in the literature, it is important for surgeons to be familiar with propensity score analysis. With propensity scores, patient and provider characteristics are used to calculate the probability that a patient will receive the intervention.5,6 These scores are then added to multivariate models to risk adjust the analyses. Alternatively, propensity scores can be used to create matched patient cohorts.6- 8 Both approaches aim to adjust for or balance patient characteristics, thus minimizing confounding due to selection bias.
In the present study, we use the comparison of laparoscopic vs open appendectomy as a case study to introduce propensity scores. Study of the surgical treatment of this disease is ideal for the use of propensity scores because the choice of technique is often confounded by observable factors, such as severity of illness and the presence of appendiceal perforation. Thus, an unadjusted comparison of the 2 techniques will yield inaccurate results. Indeed, the laparoscopic approach has had much more favorable results in recent large observational studies9,10 than in randomized clinical trials. To perform this case study, we used data from the Public Use File of the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP). We present an evaluation of laparoscopic vs open appendectomy for acute appendicitis.
This study was performed using 2005 to 2007 data from the Public Use File of the ACS-NSQIP. The study cohort consisted of patients with the postoperative diagnosis of acute appendicitis who received either an open or a laparoscopic procedure to resect the appendix. Specifically, patients were included who had a Current Procedural Terminology (CPT) code of 44950 (appendectomy), 44960 (appendectomy for ruptured appendix with abscess or generalized peritonitis), 44070 (laparoscopy, surgical, appendectomy), or 44979 (unlisted laparoscopy procedure, appendix) recorded as the principal operative procedure and a postoperative International Classification of Diseases, Ninth Revision, Clinical Modification, code of 540 (acute appendicitis), 540.0 (acute appendicitis with generalized peritonitis), 540.1 (acute appendicitis with peritoneal abscess), 540.9 (acute appendicitis without mention of peritonitis), 541 (appendicitis, unqualified), or 542 (other appendicitis). Patients who underwent exploratory laparotomy for acute appendicitis were excluded. In this “aggregate cohort,” 2 groups were formed. Patients in the open appendectomy group underwent an operative procedure with a CPT code of 44950 or 44960. The laparoscopic group consisted of patients who had CPT code 44070 or 44979 recorded as their principal operative procedure. Evidence of appendiceal perforation or rupture was defined by a CPT code of 44960 or an International Classification of Diseases, Ninth Revision, Clinical Modification, code of 540.0 to 540.1.
Data were compared using univariate and multivariate statistical measures. Continuous variables were analyzed using an unpaired 2-tailed t test for data with a normal distribution. Continuous data exhibiting a skewed distribution, such as length of stay (LOS), were analyzed using the Wilcoxon rank sum test. Discrete variables were compared using a χ2 analysis. Multivariate analysis was performed using multiple linear regression or logistic regression and adjustment for significant covariates to generate risk-adjusted outcomes. All covariates with a P < .20 based on univariate analysis were entered into the forward stepwise regression model. The significance-level criterion for entry into the regression model was 0.1 and for removal was 0.2. Before multivariate analysis, continuous right-skewed data were natural log transformed, the regression analysis was then conducted, and the coefficient from the regression model was exponentiated to determine the proportionate increase in LOS associated with the selected treatment group. A previously created complications classification system divided complications into 2 groups, minor and major.11,12 A list of the complications classification system is in Table 1. Major complications were those considered significant enough to result in increases to the LOS or a need for substantial additional treatment interventions. All statistical analysis was performed using a software program (STATA SE 9.2; StataCorp LP, College Station, Texas). Results are presented as mean (SD) unless otherwise noted. Statistical significance was defined as P < .05.
Propensity scores were generated for surgical technique using nonparsimonious logistic regression and adjusting for important known baseline covariates, including evidence of perforation or rupture. All the covariates were entered into a logistic regression analysis, and a maximum-likelihood probit model was fitted based on these covariates as predictors of surgical technique. The probit coefficients for these predictors of surgical technique were used to calculate a propensity score of 0 to 1 for each patient. Based on the calculated propensity scores, 2 evenly matched groups were formed regarding surgical technique using a matching algorithm with the common caliper set at 0.005. Caliper is the maximum distance or difference that is acceptable for a propensity score match. The matching approach technique of using propensity scores, as opposed to stratification or regression adjustment, was chosen because it is the closest approximate to a randomized clinical trial and provides the greatest balance between treated and untreated cases.7 This data set is referred to as the “propensity-matched cohort.” The matched cohort was evaluated for balance between the 2 surgical technique groups regarding each of the potential confounding factors. Differences in outcomes such as mortality, hospital LOS, and complications were explored using univariate tests. Additional risk adjustment was performed using multivariate analysis to adjust for covariates that remained unbalanced between the 2 groups after propensity score matching.
The ACS-NSQIP and the hospitals participating in the ACS-NSQIP are the source of the data used herein; they have not verified and are not responsible for the statistical validity of the data analysis or the conclusions derived by the authors. Approval for this study was obtained from the University of Michigan Health System institutional review board.
A total of 21 475 patients underwent appendectomy during the study, with 28% of these patients undergoing open appendectomy and 72% a laparoscopic approach. The characteristics of patients in the laparoscopic and open groups showed many important differences (Table 2). Notably, there were some key differences between the laparoscopic and open groups that could act as confounding variables. For example, patients in the open appendectomy group were more likely than those in the laparoscopic group to have evidence of a ruptured appendix (33% vs 14%). In summary, of the 41 ACS-NSQIP preoperative risk factors, 28 were found to have differences present between the 2 groups on univariate analysis (not all data shown).
In unadjusted analysis, there were large differences in outcomes between the laparoscopic and open groups (Table 3). All complications for which a difference was found favored the laparoscopic technique. In the aggregate cohort, the rate of mortality, LOS, return to the operating room, incisional surgical site infection (SSI), sepsis with or without septic shock, and minor and major complications were significantly higher on univariate analysis for patients who underwent open surgery (Table 4).
After adjustment for confounding variables using multivariate logistic regression, the differences in outcomes were entirely changed (Table 4). Operative mortality and return to the operating room were no longer different between the laparoscopic and open approaches. Patients in the laparoscopic group had a reduction in their hospital LOS and rates of incisional SSI, sepsis, and major and minor complications compared with the open operation group. However, patients who underwent laparoscopic appendectomy experienced a significantly higher rate of organ/space SSI or intra-abdominal abscess (odds ratio [OR], 1.9; 95% confidence interval [CI], 1.5-2.3). The rate of organ/space SSI was substantially higher in the multivariate-adjusted analysis compared with the univariate result for the aggregate cohort (OR, 1.9 vs 1.0) (Figure, B).
In the propensity-matched cohort, there were 5666 patients in each of the laparoscopic and open groups (Table 2). Most of the differences in patient characteristics were no longer present. There was still a detectable difference in the rate of appendiceal perforation; however, in practicality, these rates were nearly identical in the open and laparoscopic groups (29.0% vs 31.0%) compared with the large difference (33.0% vs 14.0%) in the aggregate cohort. The previously seen differences in preoperative risk factors declined from 28 to just 3 of 41 comorbid conditions.
In the propensity-matched cohort, there were fewer differences in univariate unadjusted outcomes between the 2 groups (Table 3). There were no differences in the rate of mortality, return to the operating room, or major complications. Patients who underwent open appendectomy had higher rates of incisional SSI, wound disruption, and sepsis. However, the incidence of organ/space SSI was much higher in the laparoscopic approach compared with the open approach in the propensity-matched cohort (OR, 1.7; 95% CI, 1.3-2.2). The Figure illustrates selected outcomes between the aggregate cohort without and with adjusted analysis compared with the propensity-matched cohort. The Figure also includes meta-analysis data from a 2004 Cochrane review of randomized trials for appendectomy.13 The meta-analysis results for incisional and organ/space SSI were similar to the present results for the aggregate cohort with multivariate adjustment and the propensity-matched cohort analysis.
To further explore the differences between these 2 surgical approaches, we performed a risk-adjusted analysis of the propensity cohort. Variables with differences on univariate testing were entered into the multivariate model. Risk adjustment did not change any of the previously positive results found on univariate analysis (data not shown). Patients who underwent an open operation had an increase in their median hospital LOS by a half day attributable to the surgical technique.
Using appendicitis as a case study, we present an introduction to the use of propensity scores to compare 2 surgical treatments. We chose to evaluate the operative treatment for appendicitis for several reasons. First, this is a clinical scenario familiar to most surgeons, which makes it an ideal case study to introduce a new statistical technique, such as propensity scores. Second, the 2 approaches to operative treatment (laparoscopic vs open) are often applied to different groups of patients. For example, the open approach is more often applied in patients with evidence of appendiceal perforation or septicemia. Because these factors are also associated with outcomes, they can potentially act as confounding variables in the comparison of the 2 approaches. An unadjusted comparison of laparoscopic vs open appendectomy would, therefore, yield misleading results. Indeed, in the present analysis, we found differences in most outcome variables in the unadjusted comparisons.
There are several approaches to dealing with potentially confounding variables. Multivariate regression is the technique most often used to adjust for the presence of confounding variables. In this study, there were dramatic changes in the results when we applied multivariate adjustment. Many of the differences in outcomes between laparoscopic and open appendectomy were absent after this adjustment. The results of the multivariate adjustment are identical to the results of a meta-analysis of randomized trials.13 Specifically, open appendectomy had more wound infections, but laparoscopy had more intra-abdominal abscesses. For those using observational studies to evaluate the comparative effectiveness of treatments, these results are encouraging.
We next evaluated the use of propensity scores to adjust for these confounding variables. To use this technique, a propensity score is first assigned to each patient. This score is the likelihood that the patient receives the treatment based on all observed characteristics. For this study, the score was calculated by performing a logistic regression model in which laparoscopy was the dependent variable and all other patient characteristics were included as independent variables. The predicted probability of laparoscopy—the propensity score—can then be estimated for each patient from this model.14 The propensity score can then be used as an additional covariate in the multivariate adjustment or can be used to create matched pairs of patients. For this study, we created a matched cohort, which is meant to simulate a randomized trial. These cohorts are well matched on all observed variables. However, there is one key difference between propensity scores and a randomized trial. In randomized trials, patients are also matched on unobserved variables. If any important unmeasured confounding variables are not captured in the data (ie, they are unobserved), the propensity score will yield a biased estimate of the treatment effect. In the present study, the propensity score analysis yielded results similar to those of a meta-analysis13 of randomized clinical trials, implying that the important confounding variables are present in this data source, which is not surprising given the amount of detailed data collected in the ACS-NSQIP data set.
Creating case matching based on propensity scores allows for balancing of measured variables between treated and untreated patients and elimination of bias. Greater balance is typically achieved after matching directly on the propensity score rather than stratifying on quintiles of the propensity score.7 Different methods exist for choosing which covariates to include in a propensity score model: inclusion of only true confounders, inclusion of all variables associated with the outcome, inclusion of all measured variables, and inclusion of only variables associated with treatment selection. Inclusion of only true confounders can result in up to 24% more matched pairs compared with models that include all variables or potentially weak confounders.7 The all-variables model resulted in the nonmatching of 364 of 6030 open appendectomy cases (6.0%) with an equivalent laparoscopic case. Therefore, we did not pursue a more parsimonious propensity score model.
One key finding of this study was that the propensity score analysis was nearly identical to the analysis using multivariate adjustment. Those who advocate the use of propensity scores argue that they are superior to multivariate adjustment at addressing confounding. However, there is little evidence that propensity scores are actually better in this regard. Similar to the present study, many analyses show little difference between multivariate adjustment and propensity score adjustment. For example, Stukel and colleagues8 compared different approaches for dealing with confounding in observational studies (multivariate adjustment and propensity scores). They evaluated the impact of cardiac catheterization on long-term acute myocardial infarction mortality and found that multivariate adjustment and propensity scores yielded similar findings, with a moderate reduction in mortality with cardiac catheterization.
However, there is one scenario in which propensity scores are always useful: when the treatment is common but the outcome of interest is rare.15,16 When studying rare outcomes, it is sometimes not possible to use multivariate adjustment. To construct a “stable” regression model, there must be at least 10 events in the study population for each independent variable (ie, potential confounder). For example, in the present study, unplanned reintubation occurs in 0.3% of the study population (67 total events). In the unadjusted analysis, it seems that open appendectomy is associated with significantly higher rates of unplanned intubation (0.6% vs 0.2%, P < .001). Because there are so few events and so many potential covariates to include in the multivariate adjustment, it is impossible to create a stable logistic regression model for this outcome.
In contrast, propensity scores can still be used to perform a risk-adjusted comparison for this situation. The logistic regression model used to create propensity scores included the surgical approach (laparoscopy vs open) as the outcome, which provides more than 6000 events. The propensity score can then be used to create matched cohorts, and rates of unplanned intubation can then be compared. In contradistinction to the unadjusted results, the rates of unplanned intubation were identical in the propensity-matched cohorts (0.4% vs 0.4%, P = .50), highlighting the importance of this approach for dealing with confounding with rare outcome variables.
This study provides an introduction to propensity score analysis using the operative treatment of appendicitis as a case study. We found that a multivariate adjustment provided the same results as did the analysis from a propensity-matched cohort. Specifically, we found that the laparoscopic approach has a higher rate of intra-abdominal abscess and that the open approach has a higher rate of incisional SSI. These results are consistent with those of a published meta-analysis13 of randomized clinical trials. This finding suggests that observational studies could be a valid study design for comparative effectiveness research, especially when randomized clinical trials are not feasible or when the goal is to understand the real-world impact of a treatment. Surgeons should use this case study to further understand the use of propensity scores for risk adjustment or cohort matching as increasing focus is placed on observational studies in the context of comparative effectiveness research.
Correspondence: Mark R. Hemmila, MD, Department of Surgery, University of Michigan Medical School, 1B407 University Hospital, 1500 E Medical Center Dr, SPC 5033, Ann Arbor, MI 48109-5033 (email@example.com).
Accepted for Publication: June 14, 2010.
Author Contributions:Study concept and design: Hemmila, Arbabi, and Dimick. Acquisition of data: Hemmila and Dimick. Analysis and interpretation of data: Hemmila, Birkmeyer, Osborne, Wahl, and Dimick. Drafting of the manuscript: Hemmila, Birkmeyer, Osborne, and Dimick. Critical revision of the manuscript for important intellectual content: Hemmila, Birkmeyer, Arbabi, Osborne, Wahl, and Dimick. Statistical analysis: Hemmila, Birkmeyer, and Dimick. Study supervision: Hemmila.
Financial Disclosure: None reported.
Funding/Support: This study was supported by grant K08-GM078610 from the National Institutes of Health with joint support from the American College of Surgeons and the American Association for the Surgery of Trauma (Dr Hemmila).
Previous Presentation: This paper was presented at the American College of Surgeons 95th Clinical Congress; October 12, 2009; Chicago, Illinois; and is published after peer review and revision.