Comparison of observed vs predicted mortality rates by risk level by applying the Catalan Study,9 Parsonnet et al,10 and Higgins et al11 models, respectively, to our multicenter population. Note that these model references are to note the results of the application of these models to the current study's population and not the results of the external models in the original publications. Solid lines represent actual mortality rate; dotted lines, perfect fit for each model.
Ranking of centers by standardized mortality ratio and by applying the results of the Catalan Study,9 Parsonnet et al,10 and Higgins et al11 models, respectively, to our multicenter population. Note that these model references are to note the results of the application of these models to the current study's population and not the results of the external models in the original publications.The horizontal bars are given to help the reader see if the center has an observed mortality rate that is higher or lower than that expected and also, depending on if the 95% confidence interval of the standardized mortality ratio excludes or includes the value 1, that there is or is not a statistically significant difference.
Pons JMV, Espinas JA, Borras JM, Moreno V, Martin I, Granados A. Cardiac Surgical MortalityComparison Among Different Additive Risk-Scoring Models in a Multicenter Sample. Arch Surg. 1998;133(10):1053-1057. doi:10.1001/archsurg.133.10.1053
To compare the performance of several risk-scoring models to predict surgical mortality following open heart surgery.
A prospective observational study.
Seven tertiary cardiac centers (3 private and 4 public and teaching hospitals) in Catalonia (Spain).
A consecutive sample of 1287 patients submitted to open heart surgery during a 612-month period (February 14, 1994, to August 31, 1994).
Main Outcome Measure
Model discrimination capability was assessed with the c-statistic. A χ2 test to compare observed and predicted mortality rates was used as a measure of model calibration. Performance of centers was evaluated through the standardized mortality ratio and using the center as an indicator variable in a logistic regression model. The agreement among models for individual predictions was tested using weighted κ statistics.
Models developed in other health care contexts showed, as expected, lower c-statistics and an inappropriate calibration. There were no statistically significant differences among hospitals after adjusting for baseline patients' risk factors with the use of any of the different models. Models also agree in the standardized rank of centers. Weighted κ statistics indicated poor agreement among models for individual patient risk prediction.
Models can be a useful tool to compare providers' performance and to give a more in-depth look at the process of care when appropriately customized to the context. Severity-adjusted models can also play a role in supporting the informed and subjective surgeon's assessment, but it is inappropiate to use them for individual predictions.
SPECIFIC HOSPITAL mortality rates have received increasing attention as a measure of health care outcome. However, crude hospital mortality rate is an inaccurate indicator since it does not consider the severity of illness. When comparisons are made, it is essential to adjust mortality rates according to the presence of factors that might determine the risk of an adverse outcome.
Recent data on hospital cardiac surgical mortality have generated both controversy and confusion among stakeholders in health care systems: consumers, purchasers, and health care providers (hospitals and surgeons).1 On the one hand, proponents of releasing this kind of information argue that despite potential inaccuracies, hospitals with very high mortality rates are likely to provide poor quality of care, and that increased consumer knowledge will lead to a greater demand for all hospitals to ensure quality of care.2 On the other hand, there are arguments against public disclosure of provider-specific mortality rates. Criticism is directed at the quality of data used, the inaccuracy of models, and the misunderstanding of this information by the media. The use of additional indicators that can also be risk-adjusted, such as perioperative complications, improvements in functional capacity, quality of life, and patient satisfaction, as well as cost-benefit analysis, has also been suggested.3
Several severity measurement tools to assess surgical risk are now available. They differ in their classification approach, conceptual foundation, risk factors included, outcome definition, potential reliability, resistance to manipulation, and availability of documentation.4 The characteristics of the population analyzed and the way a system was developed may affect its applicability to other health care settings.5 To our knowledge, there are few published studies that analyze the performance of several predictive models for patients who have undergone coronary artery bypass graft (CABG) surgery. Two of these assess the validity of 4 severity-adjusted models in an independent surgical database coming from a single center.6,7 Another study compared the performance of different CABG providers by use of severity models developed for hospitalized patients.8
The aim of our study is to compare how different additive risk-scoring models work in a multicenter sample of patients subjected to open heart surgery. Models were compared with regard to their calibration and discrimination capability, the assessment of differences among providers in surgical mortality, and individual patient prediction.
The population in this study came from the Catalan Study (CS) on Open Heart Surgery and detailed methods have been referred to previously.9 All consecutive open heart procedures carried out in adult patients in 7 centers in Barcelona, Spain, identified by a number, were included during a 612-month period (February 14, 1994, to August 31, 1994). Data were registered on a specifically designed sheet. Overall, there were 1287 open heart procedures collected after excluding heart transplantations (22 cases performed in only 2 hospitals).
Three risk stratification models to predict surgical mortality were selected for our analysis. The CS model came from the study referred to above.9 The 2 other models were selected because they encompassed a range of extracorporeal cardiac procedures wider than CABG alone, were additive, and were not based on administrative data sets. The Parsonnet et al10 method is addressed to acquired adult heart disease; it stratified patients into 5 risk categories. Two factors ("catastrophic state" and "rare circumstances") in the first version were valued subjectively. In the use of this risk model in our population, we gave a fixed value to these subjective items depending on the presence of any catastrophic state or rare circumstance and on the preoperative subjective risk assessment made by the surgeon. This subjective assessment, based on clinical judgment and data available before surgery, used 5 categories of risk. The other model selected was the Higgins et al11 clinical severity score addressed to patients who had undergone CABG surgery and those who had accompanying procedures. For this study, we recoded the 9 severity categories used by Higgins et al into an ordinal scale with 5 risk levels based on similarity in observed mortality rates as shown in Figure 2 of the study by Higgins et al.
Predicted mortality rates were calculated for each model according to the original criteria. For the Parsonnet et al model, the predicted rate for each risk level was calculated averaging the individual scores within each category. For the Higgins et al model, predicted mortality rates were estimated from the figures of the original publication because the exact numbers were not reported. For the CS model, predicted rates were calculated in the validation subsample applying the observed mortality rates of the training subsample. A χ2 test to compare observed and predicted mortality rates was used as a measure of the calibration of the model. For each model, a c-statistic, which equals the area under a receiver operating characteristic curve, was used as a measure of discrimination (a c value of 0.5 suggests no ability to discriminate, and a value of 1.0 indicates perfect discrimination).12
Differences in centers' performances were assessed through the standardized mortality ratio (SMR), which is the ratio of the observed and the expected mortality rates. Correlations of centers' SMR order were assessed by the Spearman rank correlation coefficient. To test for centers' homogeneity in surgical mortality, a logistic regression analysis for each model was used. Finally, the agreement among models for individual predictions was tested using a weighted κ statistic in the cross-classification tables generated with each pair of models.
The population characteristics of the Higgins et al and the CS models are given in Table 1. Unfortunately, data on the Parsonnet et al model were not available from the original publication. The most striking inequalities were related to reoperation rates, the prevalence of chronic obstructive pulmonary disease while taking medication, and kidney disease. Other more clearly defined factors (demographic, surgical, and creatinine level) did not differ substantially.
Table 2 gives the sample population where the model was applied, the score values for each category, patients' distribution by risk level, and the observed mortality rate for each of the 5 risk categories. All models combined the highest scores in the worst risk category. Except for the Parsonnet et al model, the first 2 risk levels composed more than 60% of patients.
In Table 3 we present the c-statistic, predicted and observed mortality rates in the population selected for the different models, and the χ2 test for calibration. The highest c-statistic corresponded to the model specifically designed for this population (CS model). Statistically significant differences between observed and predicted mortality rates were seen in the 2 external models. The Parsonnet et al model underestimated the low risk level and overestimated the poor, high, and extremely high risk levels. The Higgins et al model uniformly underestimated the risk through all categories (Figure 1).
There were no statistically significant differences in mortality among centers when adjusting for any risk model as given in Table 3. Accordingly, all SMR 95% confidence intervals included the value 1, as shown in Figure 2. There was an almost-perfect agreement among models in the order assigned to centers depending on the SMR. Spearman rank correlation coefficient showed statistically significant values (P≤.02) \ for all the pairwise comparisons (rs between the CS and the Higgins et al models=0.89; rs between the Higgins et al and the Parsonnet et al models=1.00 and, rs between the CS and the Parsonnet et al models=0.88).
The weighted κ statistic between the Parsonnet et al and the CS models was 0.29 (n=1287); between the Higgins et al and the CS models, 0.50 (n=715); and between the Parsonnet et al and the Higgins et al models, 0.40 (n=715).
This study underlines the applicability of risk stratification models when assessing cardiac surgical mortality. Risk models can be used to compare different providers and to offer a more objective and adjunctive assessment of patients' risks, but it is inappropriate to use them to make individual predictions, or to base clinical decisions only on this assessment.
Risk models applied to our surgical database used different methods to select risk factors; they included different numbers of variables, assigned different weights to risk factors selected, and produced different classifications of patients by levels of surgical risk. To avoid any subjective assessment of risk factors, most of the models try to include factors that can be objectively measured. Nevertheless, some of the variables that contributed the most in the first version of the Parsonnet et al model were valued subjectively; although in a more recent version, the subjective input has been eliminated.13 Except for the CS model, the other models have been developed and validated in a single institution although these external models have been applied by other institutions in other settings.6,7
The characteristics of the population where the models are developed should be a primary criterion for selecting a model to be applied in other institutions and in other settings.14 Most developed models on cardiac surgery deal with CABG surgery, the most common open heart procedure in developed countries. However, international registers have shown heterogeneity in the type of open heart procedures among countries.15 Therefore, as has been suggested, small differences in population selection may lead to different combinations of variables being selected for any predictive model.14 When applying the different models to the same multicenter population, we found that any model can be used to test for heterogeneity among centers. The models' ranking of providers according to SMR showed that there was a good concordance in the order assigned.
To compare centers, predictive accuracy of an external model can be restored by an analytic adjustment for the differences in mortality prevalence in the 2 different populations.14,16 However, if the analyses of adjusted surgical outcomes have to be interpreted as an indicator that one should look more deeply into a specific center situation, models designed specifically for the study population are needed.
This study also points out the limitations of any predictive model when applied to predict individual risk, as has been the case with severity systems used in patients in intensive care settings. Using these predictive models as an adjunct to informed but subjective opinions made by surgeons is a reasonable and prudent choice, but using them to dictate individual patient decisions does not seem appropriate.17
There are some limitations to our study. One is sample size; the number of patients operated on by a single center in this study falls well below the hundreds of patients needed to detect meaningful differences in mortality rates. All predictive models also share another type of limitation. They cannot adjust for all patient characteristics that may have, at least in some cases, an important impact on surgical mortality. Neither can they consider other patient nonclinical factors or inequalities in technical and therapeutic resources available in cardiac services nor the administrative or managerial differences in practices. Also, specifically for predictive models in surgical mortality, they cannot consider technical skills of surgeons that can be related to the learning curve and continual practice. Some risk assessment models have presented surgical adjusted mortality as a measure of surgeons' technical skills, but this approach is still open to debate, and its potential ramifications are being scrutinized.18,19 Finally, models cannot assess another important issue in any procedure: its appropriateness.
Our study showed that predictive models can be a useful tool to standardize and to assess the performance of different providers. Although external models for open heart surgical risk are not as reliable as the model specifically designed for our study population, there is an agreement among them in the SMR relative value among centers. Risk models can play a role as an adjunct to the informed, although subjective, risk assessment made by the surgeons, but it is inappropriate to use them to dictate individual patient decisions. A specific approach might help to assess factors associated with the observed and expected differential rates. This analysis can provide insight into the process of care and, therefore, improve quality of care.
We are indebted to the cardiac surgeons at Centre Quirúrgic Sant Jordi, Clínica Quirón, Hospital de Barcelona, Hospital Clínic i Provincial, Hospital General de la Vall d'Hebron, Barcelona, Spain; Hospital Prínceps d'Espanya de Bellvitge, L'Hospitalet, Spain; and Hospital de la Santa Creu i Sant Pau, Barcelona, for their support and cooperation; to the following surgeons as a representative of participating centers: Alejandro Aris, MD, Eduard Castells, MD, Lluïsa Camera, MD, Josep M. Caralps, MD, Carles Fontanillas, MD, Francisco Murillo, MD, Jaume Mulet, MD, Marcos Murtra, MD, Jose Luis Pomar, MD, Félix Rovira, MD, Josep Oriol Sole, MD; to Maria Cardona, MD, as research assistant for the study; to A. Ginel, MD, and J. Montiel, MD, for their data collection assistance; to Cari Almazan, MD, Albert J. Jovell, MD, and Laura Sampietro-Colom, MD, for their helpful comments; and to David Lavine for his assistance in manuscript preparation.
Reprints: Joan M. V. Pons, MD, Catalan Agency for Health Technology Assessment, Travessera de les Corts 131-159, Pavelló Ave Maria, 08028 Barcelona, Spain (e-mail: email@example.com).