Patient recruitment and losses. Exclusion criteria included a main diagnosis of not hip or knee osteoarthritis (OA); malignant, severe organic, or psychiatric diseases; and failure to undergo the surgical intervention for any reason (death, intervention at another hospital, or refusal to undergo the intervention) 1 year after inclusion in the study. Each percentage is estimated based on the previous frequency.
Quintana JM, Escobar A, Arostegui I, Bilbao A, Azkarate J, Goenaga JI, Arenaza JC. Health-Related Quality of Life and Appropriateness of Knee or Hip Joint Replacement. Arch Intern Med. 2006;166(2):220-226. doi:10.1001/archinte.166.2.220
Copyright 2006 American Medical Association. All Rights Reserved. Applicable FARS/DFARS Restrictions Apply to Government Use.2006
We studied the association between explicit appropriateness criteria for total hip joint replacement (THR) and total knee replacement (TKR) with changes in health-related quality of life of patients undergoing these procedures.
Prospective observational study of 1576 consecutive patients with diagnoses of osteoarthritis on waiting lists to undergo THR or TKR. Explicit appropriateness criteria using the RAND appropriateness method were applied. Patients completed 2 questionnaires that measured health-related quality of life, the Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36) and the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), before the procedure and 6 months afterward.
Patients who were considered appropriate candidates for these procedures had greater improvements than those who were considered inappropriate candidates in all 3 WOMAC domains (pain, functional limitation, and stiffness; THR: 43.0, 40.6, and 40.4 vs 14.7, 19.1, and 15.9; TKR: 34.9, 32.5, and 30.2 vs 23.2, 18.9, and 17.1; P<.001 for all comparisons). Patients who underwent THR and were judged to be appropriate candidates had greater improvements in the physical function, role–physical, bodily pain, and social function domains of the SF-36 than those judged to be inappropriate candidates (34.4, 35.1, 33.1, and 26.6 vs 19.6, 9.2, 5.7, and 7.0; P = .04, P = .03, P < .001, and P < .001, respectively). Appropriate candidates for TKR demonstrated greater improvement in the social function domain of the SF-36 after the procedure than those deemed inappropriate candidates (19.9 vs 7.9; P = .004) but not in the other domains of functional status.
These results suggest a direct relationship between explicit appropriateness criteria and better health-related quality-of-life outcomes after THR and TKR surgery. Our results support the use of these criteria for clinical guidelines or evaluation purposes.
As life expectancies increase, the rates of hip joint and knee replacement are expected to increase.1- 3 Although these procedures are expensive,4- 6 they also are among the most effective in terms of patient benefits. Substantial variations in the indications for a variety of surgical procedures, including hip joint and knee replacement, have been reported during the last 20 years.7,8
The RAND appropriateness method9 combines expert opinion with available scientific evidence to create explicit appropriateness criteria. Following this model, our group assembled 2 panels of experts: one to develop appropriateness criteria for total hip joint replacement (THR)10 and the other for total knee replacement (TKR).11
In an effort to validate the explicit appropriateness criteria, which share similar variables, we conducted a prospective observational study to examine the relationship between appropriateness evaluation and outcomes measured by 2 validated health-related quality of life (HRQoL) instruments: the generic Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36)12 and the specific Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC).13 We hypothesized that if the appropriateness criteria did indeed offer good clinical guidance, patients considered appropriate would have higher HRQoL improvements in all relevant domains in these instruments.
First, we performed an extensive literature review to summarize the existing knowledge on the effectiveness and risks of THR and TKR for treating patients with osteoarthritis. Second, from this review, comprehensive and detailed lists of mutually exclusive and clinically specific scenarios (indications) were developed in which THR or TKR might be performed. This list contained 216 scenarios for THR and 624 for TKR. For THR, these scenarios included the following variables: age, bone quality measured by x-ray examination according to the classification of Singh et al,14 surgical risk (based on the American Society of Anesthesiologists criteria15), previous nonsurgical procedures performed, pain, and functional limitations assessment (based on the American College of Rheumatology classification16 and need for a mobility aid). For TKR, the scenarios included the following variables: age, previous surgical interventions, anatomical location, symptoms and functional impairment, joint mobility and stability, and radiology of the lesion (based on the Ahlbäck classification17).
An appropriate procedure is one in which “the expected health benefit exceeds the expected negative consequences by a sufficiently wide margin that the procedure is worth doing, exclusive of cost.”9(p54) Ratings were based on a 9-point scale. The use of each procedure for a specific scenario was considered appropriate if the panel's median rating was between 7 and 9 without disagreement, inappropriate if the value was between 1 and 3 without disagreement, and uncertain if the median rating was between 4 and 6 or if panel members disagreed. Disagreement was defined as a minimum of one third of the panelists rating an indication from 1 to 3 and a minimum of one third rating it from 7 to 9.
Third, we formed 2 independent national panels. The panelists were provided with the literature review and the list of indications and asked to rate each one for the appropriateness of performing each procedure. The ratings were confidential and took place in 2 rounds, using a modified Delphi process. The results of both panels were reported previously.10,11
The prospective observational study took place in 5 large and 2 medium-sized public teaching hospitals with similar human and technical resources located in the Basque Country, serving a total population of 2 million inhabitants. These medical institutions belong to the network of public hospitals of the Basque Health Care Service–Osakidetza, a local government that is part of the Spanish National Health Service, which provides free unrestricted care to nearly 100% of the population. Physicians in each hospital were blinded to the study goals. The hospitals' ethics review boards approved both projects.
Consecutive patients with osteoarthritis scheduled to undergo THR or TKR in any of the 7 hospitals were eligible for the study. Between March 1999 and March 2000, 1495 patients were placed on waiting lists to undergo THR and 1369 to undergo TKR. Patients with severe comorbidities, such as cancer, terminal disease, or psychiatric conditions, and those whose main diagnosis was not hip or knee osteoarthritis or who failed to undergo the surgical intervention 1 year after inclusion in the study were excluded from analysis. All patients were assessed before the procedure and 6 months afterward. The Figure shows the recruitment process.
All patients on the waiting list for THR or TKR were sent a letter that described the study and asked for their voluntary participation. This mailing included the SF-36 and WOMAC questionnaires and sociodemographic information. A reminder letter was sent to patients who had not replied after 15 days. We sent the questionnaires again and contacted by telephone those who still had not replied after another 15 days. Six months after the intervention, patients were sent another letter with the questionnaires and additional questions on the clinical aspects of their disease and satisfaction with the intervention. The satisfaction question was dichotomized as being satisfied or not. At this time, patients answered a transitional question about their joint improvement after the intervention. The possible responses included “a great deal better,” “somewhat better,” “equal,” “somewhat worse,” or “a great deal worse.” Those who had not replied were followed up as described previously.
The SF-3612 covers 8 domains and 2 summary scales, physical and mental. The scores for the SF-36 scales range from 0 to 100, with a higher score indicating better health status. The SF-36 has been translated into Spanish and validated in Spanish populations.18
The WOMAC13 covers 3 dimensions: pain, stiffness, and physical function. We used the categorical version with 5 response levels for each item. The data were standardized to a range of values from 0 to 100, with 0 representing the best health status and 100 the worst possible status. The original and Spanish questionnaire versions are reliable, valid, and sensitive to the changes in the health status of patients with hip or knee osteoarthritis.19,20
We retrieved data from the hospital and physician medical records that included variables before the intervention, at admission, and at discharge. Besides those variables that belonged to the appropriateness algorithm, other variables collected included sociodemographic data, all comorbidities included on the Charlson Comorbidity Index,21 local and general complications perioperatively and postoperatively, reintervention, death, and length of hospital stay. Six months after discharge, all medical records were reviewed to determine if the patient had been readmitted, had any complication resulting from the intervention, or had died.
Three physicians blinded to the specific study goals extracted the data from the patients' medical records and recorded them on a prepared form. The physicians were trained to retrieve the medical record data in a particular way and were tested for agreement. Members of the research team also reviewed those variables in a sample of 121 records to test the accuracy of the data retrieved by the reviewers. This review resulted in a κ value for dichotomous variables (diagnosis, surgical risk, death, readmissions, and reintervention) above 0.99 and an intraclass correlation coefficient for length of hospital stay of 0.99.
The unit of study was the patient. In cases in which 2 interventions were performed in the same patient, we selected the first one. Longitudinal data analysis was performed in all available cases. A general linear model was executed with HRQoL as the dependent variable over time for the 8 domains, the 2 summary measures of the SF-36, and the 3 WOMAC domains. Appropriateness was the independent variable, adjusted by sex. Multilevel analysis with mixed models was performed to test differences among hospitals in HRQoL improvement for the 3 appropriateness categories.
For all WOMAC and SF-36 domains, we estimated, by procedure, the standard error of measurement by the following formulae:
where SD is the standard deviation of the sample at baseline and R is the reliability coefficient.22 We used the Cronbach α as a reliability measure.23 From the standard error of measurement, we derived the minimal detectable change (MDC), which is the result of the following multiplication:
We established a 95% confidence interval, which corresponds to a z score of 1.96. The MDC represents the smallest change in score that likely reflects true change more than measurement error alone.24 The MDC proportion is the proportion of patients with changes in scores that exceeded the MDC. The minimal clinically important difference (MCID) has been defined as the smallest difference between the scores in a questionnaire that the patient perceives to be beneficial. It is an anchor-based method that fixes a threshold that demarcates trivial from small but important differences. The MCID was calculated for those patients who, at the 6-month visit, answered that their articulation was “somewhat better than before the intervention” to a transitional question.25 The MCID proportion reflects the proportion of the sample with change scores exceeding the MCID.
All effects were considered statistically significant at P<.05, unless otherwise noted. All statistical analyses were performed using SAS for Windows statistical software, version 8.2 (SAS Institute Inc, Cary, NC).
A total of 1576 patients were included in this study, with 784 undergoing THR and 792 undergoing TKR (Figure). No statistically significant differences occurred among responders and nonresponders at 6 months regarding sociodemographic variables, main clinical characteristics (including pain and functional limitation), or appropriateness evaluation.
The mean age of the patients undergoing THR was 69.1 years, and 48.3% were women. For patients undergoing TKR, the mean age was 71.9 years, and 73% were women. The inappropriateness rate was higher for TKR than THR (12.4% vs 5.2%).
Of the patients who underwent THR, compared with patients who underwent procedures that were deemed inappropriate, those undergoing appropriate procedures had significantly higher improvements in the physical function, role–physical, bodily pain, and social function domains and the physical component summary scale of the SF-36, as well as the 3 domains of the WOMAC. Differences were also observed in the improvement between those judged as uncertain compared with those judged inappropriate in the same domains, except for the social function of the SF-36 (Table 1). Among patients who underwent TKR, significant differences occurred by category of appropriateness for the social function of the SF-36, as well as for the 3 domains of the WOMAC (Table 2).
Multilevel analysis showed no differences among hospitals in HRQoL improvement by appropriateness categories for either surgical intervention. Therefore, we did not include the hospital as an interaction variable in the models.
The MDC values ranged from 12 to 29 for the WOMAC domains and from 19 to 42 for the SF-36 domains. For the whole sample, the proportion of patients who surpassed the MDC was higher than 50% for the SF-36 physical function and the WOMAC pain and functional limitation in both procedures (Table 3). The MCID values varied by procedure and questionnaire. More than half of patients undergoing THR surpassed the MCID in all SF-36 domains; approximately 75% did so on all the WOMAC domains. The percentages were lower for patients undergoing TKR.
A significantly higher proportion of patients undergoing THR judged as appropriate candidates surpassed the MDC or MCID than those judged as inappropriate candidates on all the WOMAC domains and the physical function, role–physical, bodily pain, social function, and mental health domains of the SF-36 (Table 4). Among patients undergoing TKR, a similar association was observed with all the WOMAC domains and the vitality and social function domains of the SF-36 (Table 4).
No differences were observed among the 3 appropriateness categories for any of the clinical indicators analyzed up to 6 months after discharge, except for the fatality rate among patients undergoing THR. Patients reported higher satisfaction with the intervention in the appropriate group than in the inappropriate group for both procedures. Among THR patients, those who underwent appropriate procedures reported better perception of their general health status compared with the previous year, greater relief of symptoms, and greater recovery than those who underwent inappropriate procedures (Table 5).
This prospective observational study of more than 1500 patients who underwent THR or TKR supports the validity of the criteria developed by the RAND appropriateness method for 2 procedures that share many characteristics, such as the role of symptoms in the clinical decision process or the outcome measures. In general, we found that patients who were deemed appropriate candidates by the criteria were more likely to have had better outcomes and greater improvements in HRQoL following THR or TKR than patients deemed inappropriate candidates.
Determining the appropriateness of a surgical intervention is important. However, it is difficult to developed evidence-based criteria because, in most cases, high-quality evidence from clinical trials is not available, either because clinical trials are rarely performed for surgical procedures or they do not cover an important range of indications. For this reason, the RAND method was developed to create explicit appropriateness criteria. A major criticism of this method is the absence of studies that demonstrate the validity of such criteria.26,27
Our main hypothesis was that patients classified by our explicit appropriateness criteria for THR and TKR10,11 as having undergone an appropriate procedure would have larger improvements in HRQoL than patients classified as having undergone inappropriate procedures. Our results support this hypothesis. Both the SF-36 and the WOMAC demonstrated such larger improvements, but there was more evidence with the WOMAC. Therefore,our results support the validity of the explicit criteria because the appropriate compared with the inappropriate group had much greater benefit and similar low risks.
Our results identified several issues that deserve further comment. First, we found that the uncertain group had similar improvements to those in the appropriate group, suggesting that in some cases uncertain indications might also be appropriate.
Second, an important unresolved debate currently exists about how to determine the smallest difference between the scores in an HRQoL questionnaire that the patient perceives to be beneficial.28 We tried to establish individual measures of improvement by estimating the MDC and the MCID. In our study, more patients in the appropriate group than in the inappropriate group had relevant gains on those 2 parameters on the 3 WOMAC domains and, among patients undergoing THR, on those domains of the SF-36 most related to the physical component of HRQoL. However, these scores varied a lot. As some authors have pointed out, establishing a definite MCID seems to be an almost impossible task.29,30
Third, patients who underwent TKR had fewer differences among the appropriate compared with the inappropriate groups in HRQoL domains compared with those who underwent THR. Knee articulation is more complex than hip articulation, and among patients with osteoarthritis, the benefits they experience from TKR are usually less than those experienced after THR.31 However, even patients who underwent inappropriate TKR had substantial deterioration of their HRQoL before the intervention, measured by both HRQoL tools, especially when compared with the inappropriate cases in the THR sample, who had better HRQoL scores.
Fourth, in this study, differences between the appropriateness categories were observed in all domains of the WOMAC in both procedures but in only some of the domains of the SF-36. This finding is not unusual because, as many authors have reported, specific HRQoL questionnaires such as the WOMAC generally have greater responsiveness than generic ones such as the SF-36.19,32
Fifth, the negative consequences of the intervention (clinical complications, reinterventions, or deaths) were minor and similar among the 3 appropriateness categories. Among those undergoing THR, 2 patients in the inappropriate group died, but the deaths were unrelated to the inappropriateness of the intervention. Nevertheless, other outcomes, such as satisfaction with the intervention, differed among the appropriateness categories. Although only a small number of patients underwent an inappropriate procedure, they were slightly less satisfied with the intervention and its result, which constitutes a negative consequence from the patient's perspective.
Another recent study33 evaluated the association between appropriateness and HRQoL among patients receiving a hip prosthesis. In that study, the surgeons were asked to complete a clinical indications form based on a set of proprietary guidelines. Then information gathered was reviewed to determine if the case profile matched the guidelines. Those investigators used the same HRQoL tools as ours. Limitations of this study were that the authors used privately published criteria, not available, to determine the appropriateness of an indication and did not attempt to test the validity of the criteria. Appropriateness was judged by a binary response: appropriate or not. The sample size of this study was smaller than ours (n = 488), and the response rate was low (44%).
Our study has several strengths. We administered 2 widely used and validated HRQoL questionnaires that have been recommended by different authors for studying patients with hip or knee osteoarthritis.34,35 Furthermore, we collected the information prospectively in a large sample of patients, thus minimizing selection and information bias that may influence results of studies such as these. However, as a main limitation, we had a 25% rate of missing data at follow-up. Also, the fact that the appropriate group as a whole achieved a better HRQoL does not mean that each appropriate candidate procedure achieved a better HRQoL too.
In conclusion, the results of this study support the predictive validity of our explicit appropriateness criteria by showing a greater benefit in HRQoL among patients considered to be appropriate candidates for these procedures compared with those classified as inappropriate candidates. These results support the use of our criteria for clinical guidelines or to determine the degree of appropriateness and variations in the use of THR or TKR. Although variations in the indication of THR or TKR are potentially attributable to variations in clinical decision making, other issues, such as lack of human or technical resources in a specific center, could limit the generalizability of our findings in other settings. Finally, as suggested by some authors,36 this method may be useful when comparing levels of appropriateness among populations but not to direct care for individual patients. When used as a utilization review tool, interventions considered inappropriate should undergo an individualized revision before being considered inappropriate.37
Correspondence: José M. Quintana, MD, PhD, Unidad de Investigación, Hospital de Galdakao, Barrio Labeaga s/n, 48960 Galdakao, Vizcaya, Spain (firstname.lastname@example.org).
Accepted for Publication: August 28, 2005.
Financial Disclosure: None.
Funding/Support: This study was supported in part by grants from the Fondo de Investigación Sanitaria (98/001-01 to 03) and the thematic networks Red IRYSS of the Instituto de Salud Carlos III (G03/220) (Madrid, Spain). Ms Bilbao received a grant from the Department of Health of the Basque Government (Vitoria-Gasteiz, Spain).
Acknowledgment: We thank the following physicians for their contribution to this study: Jose M. Ordoñez, MD, Jose M. Vilarrubias, PhD, Jordi Ballester, PhD, Carlos Barrios, PhD, Mikel Sánchez, MD, Francisco Villar, PhD, Luis Gutierrez, PhD, Francisco Buendía, MD, Andrés Peña, MD, Félix Araluze, PhD, Joaquim Cabot, MD, Javier Vaquero, MD, Alfredo Queipo de Llano, MD, Manuel Gala, MD, Victor Alvarez, MD, Jose R. Caso, MD, Antonio Murcia, PhD, Alejandro Lizaur, MD, Jose R. Vesga, MD, Aníbal Ruiz, MD, Manuel Figueroa, MD (TKR panel); Angel Alfageme, PhD, Jose M. Aranburu, PhD, Jesús Azkoaga, MD, Pedro Armendariz, PhD, Enrique Cáceres, PhD, Arsenio Diego, MD, Begoña Goicoetxea, PhD, Iñigo Guisasola, MD, Manuel Martínez-Grande, PhD, Enrique Queipo de Llano, PhD, Ramón Tobio, PhD, Jose Villar, MD (THR panel). We also thank Arantza Higelmo, MD, Iratxe Lafuente, MSc, Alfonso Rodriguez, MD, and Ignacio Vidaurreta, MD, for their contribution to the development of the panel of experts, data retrieval, and data entry and to the Research Committee of the Galdakao Hospita. We are grateful for the support of the staff members of the different services, research, and quality units, as well as the medical records sections of the participating hospitals.