Evaluation of Clinical and Economic Outcomes Following Implementation of a Medicare Pay-for-Performance Program for Surgical Procedures

Key Points Question What is the association between the Hospital-Acquired Conditions Present on Admission program by the Centers for Medicare & Medicaid Services pay-for-performance program and surgical care quality and costs? Findings In this cross-sectional study, the Hospital-Acquired Conditions Present on Admission program was associated with a decreased incidence of surgical site infection (0.3 percentage points) in the targeted procedures and a reduction in length of stay (0.5 days) and hospital costs (8.1%). Deep vein thrombosis and in-hospital mortality did not improve. Meaning The findings of this study suggest that the pay-for-performance program was associated with improvement on several dimensions of surgical care, including small reductions in surgical site infection and length of stay, and moderate reductions in hospital costs.


Selection of control procedures
Following the literature, 1 we selected nontargeted surgical procedures for a control group based on the following criteria: 1) procedures have similar complication rates on average; 2) procedures are not subject to spillover effects from the intervention procedures; 3) trends in outcomes prior to the policy implementation are parallel between the intervention and control procedures which is a key requirement of our statistical approach; and 4) procedural volume and infection rates are sufficiently large to ensure statistical robustness and account for potential unobserved motivation of providers and hospitals to improve quality of care. To this end, we selected laparoscopic appendectomy and laparoscopic cholecystectomy as control procedures.
We believe that laparoscopic appendectomy and laparoscopic cholecystectomy make the most suitable control group for the following reasons: 1) Laparoscopic appendectomy and laparoscopic cholecystectomy has similar complication rates compared to the intervention procedures (Criterion 1). 2) Laparoscopic appendectomy and laparoscopic cholecystectomy are different surgical specialties (i.e., general surgery) in comparison to the intervention procedures and are thus not likely to be affected by care from intervention procedures (less subject to spillovers) (criterion 2). 3) Our analysis shows parallel trends (Criterion 3) for the majority of outcomes, with the exception length of stay. 4) These procedures are high in volume and sufficiently large complication sample size (Criterion 4).
One might question whether it is reasonable to compare infection rates from clean procedures to clean contaminated procedures. The difference-in-differences method eliminates unobserved time-invariant biases by comparing changes over time between two groups (details of the difference-in-differences method are in Supplement eStudy Methods page 6). Assuming that bacterial load remains consistent in each group over time, the effects of bacterial load would be eliminated by subtracting the between group differences during the prepolicy period from the between group differences during the post-policy period. Nevertheless, to address concerns related to wound class incomparability between groups, we ran sensitivity analyses with different control procedures, including carotid endarterectomy and synthetic control methods.
Control procedure selection criteria checklist 1. In this cross-sectional study, we used a difference-in-differences strategy, which allowed us to compare the changes in outcomes between the intervention and control groups' characteristics that might not have been equivalent. We used laparoscopic appendectomy and laparoscopic cholecystectomy as control procedures. Control procedures have different bacterial compositions than intervention procedures do. Thus, the surgical site infection rate is expected to be worse. In fact, infection rates are higher in the control procedures (Criterion 1). If this were an experimental study with a random assignment of the policy, these two groups might not be the best comparisons for each other due to the procedures' unequal characteristics. 5. We limited the control procedure pool to the top 100 high-volume procedures with sufficiently large complication rates to avoid unstable estimates due to small sample size and to account for unobserved motivational changes (Criterion 4). 6. We consulted existing literature in the selection and evaluation of candidate control procedures. We evaluated four groups of procedures as candidate control procedures: total shoulder replacement, non-Medicare who patients underwent the same procedures, carotid endarterectomy, and laparoscopic appendectomy and laparoscopic cholecystectomy (excluded open appendectomy and open cholecystectomy).
(1) Total shoulder replacement: This was a control procedure used in Kwong et al. (2017) 2 (2) Non-Medicare patients who underwent the same procedures: This was a control procedure used in Kwong et al. (2017) 2 and Gidwani and Bhattacharya (2015) 3 .
(3) Carotid endarterectomy: This is another high-volume procedure that is in the clean wound class. (4) Laparoscopic appendectomy and cholecystectomy, excluding open appendectomy and cholecystectomy: These are control procedures that have similar complication rates (eTable 2).
Group (1): Total Shoulder Replacement in Kwong et al. (2017) We considered total shoulder replacement as a control group procedure because of its similarity in terms of service line, having the same wound class and similar levels of microbiome at the surgical site, similar complication rates (Criterion 1), high volume (Criterion 4), and use as a control group in a study by Kwong et al. (2017). 2 We hypothesized that procedures in similar parts of the body could be affected by spillover effects from the policy (Criterion 2), and we tested whether the control group was associated with spillover effects in line with the study by Ryan et al. 4 After careful evaluation, we opted not to use this group because it is subject to spillover (Criterion 2) and violated the parallel trend assumption (Criterion 3). One of the targeted procedures is arthrodesis of the shoulder. For example, a surgeon who performs arthrodesis of the shoulder (the intervention procedure) is also likely to perform total shoulder replacement (the control procedure). Some aspects, such as the surgeon's motivation and the care plan, might influence clinical care in the procedures unexposed to the policy. We statistically tested the parallel trend assumption and found that it is violated and also found that it is associated with spillover effect. We concluded that total shoulder replacement is not appropriate for use as a control group procedure.
Group (2): Non-Medicare Patients Who Underwent the Same Procedures in Kwong et al. (2017) 2 and Gidwani and Bhattacharya (2015) 3 We have concerns with using non-Medicare patients as a control group because a patient's care plan is not payer specific, and other payers implemented similar programs followed by the CMS. Congruent with another study 5 our test also showed evidence of spillover from the intervention group to non-Medicare patients who underwent the same procedures. We opted not to use this group due to these spillover concerns (Criterion 2).

Group (3): Carotid Endarterectomy
We selected this group because it is in the same wound class, it is high in volume (Criterion 4), and it involves a procedure from a different service line. Thus, it is less subject to spillover (Criterion 2). We conducted the same main analyses using this group as a control and had consistent results. We opted not to change this group to the control group in the manuscript because it has more violations in statistical assumptions than our current control group. The parallel trend assumption is not met (Criterion 3), but it was not associated with spillover effects.
Group (4): Laparoscopic Appendectomy and Cholecystectomy (Excluding Open Appendectomy and Cholecystectomy) We selected this group as control procedures because these procedures are (1) control procedures that have similar complication rates as the intervention procedures, (2) less likely to spillover because they involve different surgical specialties from the intervention procedures, (3) different surgical specialties (i.e., general surgery) in comparison to the intervention procedures and are thus not likely to be affected by care from intervention procedures (less subject to spillovers), and (4) high in volume and yield a sufficiently large complication sample size (Criterion 4). We excluded open appendectomy and cholecystectomy procedures in response to the comments and concerns about comparing procedures with different complication rates (Criterion 1). 7. To account for differences in procedure effects in the outcomes, we included the procedure indicator variables (procedure fixed effect) in the model estimate.
8. As we reported, we were limited in our ability to assess the robustness of our estimates because trends in the two groups were inconsistent before the policy implementation (Criterion 3). To address this issue, we augmented the standard difference-in-differences methodology using four-group propensity score matching. This minimized the differences in the intervention and control groups (the standardized difference in means was < 0.25 in all patient demographic variables and hospital characteristics). To assess the robustness of our results, we constructed the synthetic control procedures to match the levels of outcomes and covariates, indicating an excellent pre-policy fit in the outcomes. However, we did not present this as part of the main analysis because the synthetic control method uses hospital-level difference-in-differences estimation, whereas our main analysis was at the patient-level. Patient-level specification has an advantage in that it controls for individual-and hospital-level heterogeneity across the intervention and control procedures. 4 The table below

Difference-in-differences models
Difference-in-differences is an econometric method to overcome issues with the selection on unobservables. Difference-in-differences has been widely used in the evaluation of health care policies. 4 Policy programs rarely select individuals at random. Instead, such programs purposefully select the target procedures. 6 Thus, the target procedures may have different characteristics compared with the non-target procedures. The basic idea of difference-indifferences is that outcomes are measured for the intervention procedures (the target procedures) and the control procedures (the procedures not exposed to the intervention) before and after the intervention. Any difference between the two groups in the pre-intervention period and the postintervention period is calculated and defined as a difference-in-differences. 4 Difference-indifferences removes the biases (1) from the permanent difference between the intervention and control procedures due to the omitted variables, thus unobservables, and (2) from differences that could have resulted from trends, 7 if the trends in outcome changes for the intervention and control procedures are similar over the two time periods in the absence of the intervention. 7 Whether the assumption of difference-in-differences is violated can be confirmed by testing the significance of the interaction term between the linear time trend and the intervention procedures during the pre-P4P policy period. 4 Applying difference-in-differences in the context of this study, we specified the difference-in-differences model at the patient level as follows: where for patient i in hospital j receiving procedure k at time t. is equal to 1 if the procedure is targeted by the CMS P4P, Post is equal to 1 for after the third quarter of 2008, P is a vector of patient characteristics, Z is a vector of hospital characteristics, and H, I, and Year are vectors of hospital, procedure and year fixed effects, respectively. The interaction between P4P and Post is the difference-in-differences estimator. If 3 has a negative sign, this would indicate that the CMS P4P has improved surgical outcomes (decreased incidence of complications, mortality, length of stay, and hospital costs).

Difference-in-difference-in-differences models
We specified our difference-in-difference-in-differences model by entering a triple interaction term -Treat * Post * Complicationand estimated the model using the following equation to examine whether the policy has had different effects in certain subgroups: In this equation, coefficient β 7 is a difference-in-difference-in-differences estimator for patients with surgical complications. The following table shows how the difference-indifference-in-differences estimator can be calculated.
Complication Before P4P After P4P Difference

Using propensity score in difference-in-differences
We used propensity score weighting based on a study that suggested a specific matching method for the difference-in-differences design, 8 to address the issue that the parallel trend assumption was violated for all outcomes except deep vein thrombosis. We applied propensity score weights to each of the following four groupspre-policy intervention procedures, postpolicy intervention procedures, pre-policy control procedures, and post-policy control procedures then performed a difference-in-differences analysis. The following steps describe how we integrate propensity scores into a difference-in-differences model.
First, we defined four groups: • Group 1 -Intervention surgical procedures in the pre-policy period • Group 2 -Intervention surgical procedures in the post-policy period • Group 3 -Control surgical procedures in the pre-policy period • Group 4 -Control surgical procedures in the post-policy period Second, we estimated the propensity score by regressing a group as a function of patient and hospital characteristics using a multinomial logistic regression. As a result, each observation in our sample has four propensity scoresprobability of being in Groups 1, 2, 3, and 4.
Third, we created the weight for each individual using the following formula: a propensity score of being in Group 1 a propensity score of being in the group in which they actually were By doing so, observations that were actually in Group 1 have the weights equal to 1, and other observations have weights that represent the similarity to Group 1.
Fourth, we applied the weights and estimated a difference-in-differences model.
We chose this matching method because 1) a particular concern about applying matching in difference-in-differences models exists (there are two elements to consider in a difference-indifference model: the intervention status and time), 2) the application of this method is appropriate with cross-section data, and 3) this matching method generates fewer covariates with the standardized difference in means greater than 0.25 (represents a substantial difference). 8,9 We tested the balance of covariates the standardized mean difference, which is defined as the difference in means divided by the standard deviation. 8 As shown in eTable 2, there were substantial differences in covariates before applying the propensity score weights, especially between the intervention procedures and the control procedures. However, the propensity score weighting reduced the standardized difference in means to less than 0.25 in all the patient demographic variables and hospital characteristics (Table 1). To examine the sensitivity of our results from the propensity score weighted difference-in-differences analyses, we also performed difference-in-differences analyses with a matched sample using one-to-one matching without replacement, calipers of 0.02 (calculated by 0.25 * standard deviation of propensity score), 10 and enforcing common support. The results were identical, with a slightly larger effect (the results are not presented).

Sensitivity analyses
We performed a series of sensitivity analyses to assess the robustness of results across model specifications. First, we used the logistic regression to model surgical complications and mortality (eTable 5). Second, we used a one-part generalized linear model (GLM) with gamma distribution and log link to test the robustness of results from the hospital cost models. A Box-Cox approach test and modified Park tests were performed to assure the use of appropriate link and distribution family for a one-part GLM model, respectively. 11,12 Third, we investigated the association between the HAC-POA policy and the incidence of SSI and DVT using a synthetic control method and found consistent effects of the policy (the details of synthetic control methods are provided in the next section; eFigures 1-4). Fourth, we performed placebo difference-in-differences models by repeating the main analyses with a binary placebo P4P indicator to denote that the HAC-POA policy would be implemented a year before (eTable 6). We also conducted another placebo test by aggregating two years (2006 and 2007) and using a set of those two years as a placebo P4P indicator (the results are not presented). If the placebo P4P policy variables were associated with improvement in surgical care outcomes during placebo years, it would indicate that our results might be due to secular changes in the outcomes. Fifth, we estimated models, adjusted for procedure-specific time trends, to allow for differential time trends between the intervention and control procedures during the pre-policy period (eTable 7). 4 A difference-in-differences model that includes intervention and control procedure-specific time trends allows the pre-existing trends to differ in the two groups, and it can be useful to check the robustness of the results. 13,14 Sixth, we conducted analyses using a different control procedure, including carotid endarterectomy to address concerns related to wound class incomparability between groups (eTables 8). We selected carotid endarterectomy because it is in the same wound class, it is high in volume, and it involves a procedure from a different service line. Finally, we assessed potential cost shifting/decrease in quality in non-complication procedures (eTable 9).
Estimates for hospital costs measured for an inpatient stay with cost-to-charge ratios, are changes in percentage. d Adjusted for inflation using the personal consumption expenditures health-by-function index for the 2017 dollar value.    b We performed placebo difference-in-differences models by repeating the main analyses with a binary placebo P4P indicator to denote that the HAC-POA policy would be implemented a year before. c The difference in the intervention procedures is the difference in the average marginal effect of the outcome between the pre-and post-policy implementation period for patients in the intervention procedures (i.e., cardiac [implantable electronic devices], orthopedic [spine, neck, shoulder, elbow], and obesity-related bariatric procedures for SSI; total knee and total hip replacement for DVT; and all of the stated procedures for length of stay, mortality, and hospital costs). d The difference in the control procedures is the difference in the average marginal effect of the outcome between the pre-and post-policy implementation period among patients in the control procedures (i.e., laparoscopic cholecystectomy and laparoscopic appendectomy). The difference in the intervention procedures is the difference in the average marginal effect of the outcome between the pre-and post-policy implementation period for patients in the intervention procedures (i.e., cardiac [implantable electronic devices], orthopedic [spine, neck, shoulder, elbow], and obesity-related bariatric procedures for SSI; total knee and total hip replacement for DVT; and all of the stated procedures for length of stay, mortality, and hospital costs). c The difference in the control procedures is the difference in the average marginal effect of the outcome between the pre-and post-policy implementation period among patients in the control procedures (i.e., laparoscopic cholecystectomy and laparoscopic appendectomy).  a Surgical complications are defined as any of SSI or DVT. b Intervention procedures include patients who underwent the HAC-POA policy's targeted procedures (cardiac [implantable electronic devices], orthopedic [spine, neck, shoulder, elbow, total knee, and total hip replacement], and obesity-related bariatric procedures). c Control procedures include patients who underwent laparoscopic appendectomy and laparoscopic cholecystectomy. d The difference-in-difference-in-differences estimate is a differential effect of the policy between the intervention and the control procedures before and after the policy implementation (difference-in-differences) across patients with surgical complications vs. no complications. Estimates for length of stay are changes in days. Estimates for mortality are predicted probability changes in percentage points. Estimates for hospital costs are changes in percentage. e The difference in the intervention procedures without surgical complications is the difference in the average marginal effect of the outcome in the intervention procedures between pre-and postpolicy implementation among patients without surgical complications (SSI or DVT). f The difference in the intervention procedures with surgical complications is the difference in the average marginal effect of the outcome in the intervention procedures between pre-and postpolicy implementation among patients with surgical complications. g The difference in the control procedures without surgical complications is the difference in the average marginal effect of the outcome in the control procedures between pre-and post-policy implementation among patients without surgical complications. h The difference in the control procedures with surgical complications is the difference in the average marginal effect of the outcome in the control procedures between pre-and post-policy implementation among patients with surgical complications.