Mean physical function (A) and social-emotional function (B) scores (range, 0-100) by time from surgery for patients with oral and oropharyngeal cancer. Each of the first 3 time groups reflects a different set of patients, overlapping but with patients represented only once in each group: preoperatively, the closest to 3 months after operation but before 9.5 months, and the closest to 12 months but after 9.5 months. The last group comprises all other quality of life data after 9.5 months from surgery, representing a mixed bag of questionnaire data with multiple responses from patients ranging up to 15 years after surgery but excluding patients in the third time group. IQR indicates interquartile range.
Rogers SN, Lowe D, Yueh B, Weymuller Jr EA. The Physical Function and Social-Emotional Function Subscales of the University of Washington Quality of Life Questionnaire. Arch Otolaryngol Head Neck Surg. 2010;136(4):352-357. doi:10.1001/archoto.2010.32
To perform a factor analysis using the University of Washington Quality of Life Questionnaire version 4 (UW-QOLv4) to establish subscales; to report their normative values and variations for patients by age, sex, extent of disease, and time from treatment; and to estimate clinical effect sizes and potential for use in comparative treatment studies.
Regional Maxillofacial Unit, University Hospital Aintree, Liverpool, England.
Patients with primary oral and oropharyngeal cancer treated by surgery with or without adjuvant radiotherapy since 1992. A database accumulating since 1995 contains more than 2600 UW-QOLs completed by these patients. A data set of 372 patients without cancer attending 10 general dental practices provided normative data.
Main Outcome Measures
Factor analysis indicated a 2-factor solution: (1) physical function, involving chewing, swallowing, speech, taste, saliva, and appearance, and (2) social-emotional function, involving anxiety, mood, pain, activity, recreation, and shoulder function. The best scores were for those with less advanced oral cancer tumors not requiring free-flap surgery or adjuvant radiotherapy. Older patients reported better scores, but associations were weak, and no sex differences were found. Significant differences were seen for T category, site, free-flap surgery, and adjuvant radiotherapy (P < .001). Preoperative scores were close to normative values. Patients regain social-emotional deficits by 1 year after surgery but continue with significant deficits in physical function. Comparative studies using these UW-QOL subscales as outcome measures should recruit at least 80 patients per treatment arm to detect moderately sized treatment effects.
With the UW-QOLv4, it is appropriate to analyze and report outcomes using the 2 subscales of physical and social-emotional function.
The value of reporting patient-derived outcomes using disease-specific questionnaires has been appreciated for many years.1 Validated head and neck–specific cancer health-related quality of life (QOL) questionnaires have emerged.2- 5 One of the first published was the University of Washington Quality of Life Questionnaire (UW-QOL) in 1993.6 It is a commonly used measure.4,7,8 This, in part, is due to its brevity and simplicity of scoring, both of which make it an easy measure in a busy clinical setting.
There have been 4 versions of the UW-QOL.6,9- 11 In version 16 there were 9 domains (pain, appearance, activity, recreation, swallowing, chewing, speech, shoulder, and employment). Because there was no global QOL question, the aggregate of all 9 items (a simple average of domains) was used to report a “composite” score.12 This composite 9 score was later found to correlate well with global QOL questions in other validated measures, such as that of the European Organization for Research and Treatment of Cancer.13 From version 2 onward,9 global QOL and health-related QOL questions were included. Two new domains each were added to version 310 (taste and saliva) and version 4 (v4)11 (anxiety and mood), with the omission of employment from version 3 onward. Therefore, the current version (v4) of the UW-QOL has 12 domains.
The composite 12 (the average of the 12 domain scores) has been used by some investigators when describing health-related QOL outcomes, although its psychometric properties have not been reported. Factor analysis is a useful way to help understand how items in a questionnaire relate to each other. It can be used to determine whether these data fit to a single construct (and, hence, a single composite-derived score) or whether multiple constructs are suggested. The derivation of multiple subscales, if appropriate, should improve sensitivity and responsiveness because more items of a similar construct are brought together. The UW-QOL has face, content, and construct validity.14,15 Although factor analysis has been reported for other head and neck cancer–specific questionnaires, to our knowledge, it has never been reported for the UW-QOLv4.
The issue of interpreting clinically significant changes in patient-reported outcomes is important, especially when designing randomized trials. Such variables have been published for the Functional Assessment of Cancer Therapy–Head and Neck instrument.16 The UW-QOL domains and global scales have, at most, 6 discrete options and a skewed response, and these are difficult to handle in this context. Any composite or subscale score will have a wider numerical range and greater potential for being able to assess clinical effect in treatment evaluation studies and for calculating sample sizes.
The objectives of this study were 3-fold. The first was to use factor analysis to establish either a single composite score or several subscale scores. The second was to report normative values and variations in such values by age, sex, extent of disease, and time from treatment. The third was to estimate clinical effect sizes and to report potential as a clinical outcome in comparative treatment studies. This study is a natural part of the ongoing validation process of the UW-QOL.
Since 1992, all patients diagnosed as having head and neck cancer in the Regional Maxillofacial Unit at University Hospital Aintree, Liverpool, England, have been entered into a computerized head and neck database. We have used this database to identify patients with oral and oropharyngeal cancer who have been treated primarily by undergoing surgery with or without adjuvant radiotherapy.
Since 1995, we have regularly surveyed these patients using the UW-QOL, and the pool of responses through the years have been used to run factor analyses. Analyses focused primarily on 517 patients with UW-QOL data at least 9.5 months after primary surgery and the closest available to 12 months. The guidance of Staquet et al17 was followed to run factor analyses using maximum likelihood estimation and Promax oblique rotation. Regarding sample size, these authors concluded that with discrete and highly skewed scales for items, a minimum of several hundred patients is desirable.
We revisited data sets from our series of annual surveys since 2000 to compare within-patient temporal changes in scores from one year to the next. We also revisited a data set of 372 patients without cancer attending 10 general dental practices18 to compute normative values.
Resulting UW-QOLv4 subscale scores were compared between patient subgroups using the Mann-Whitney test. Associations of subscale scores with patient age and other UW-QOL global measures were measured using the Spearman correlation coefficient (r). Kappa coefficient of agreement statistics (κ) measured within-patient agreement of UW-QOL subscale score quartiles in responses 1 year apart. Kappa values greater than 0.60 represent “good” agreement, and those above 0.80 “very good” agreement.19 An extension of this approach called weighted κ was used to give credit for partial agreement by assigning weights to off-diagonal cells in the 4 × 4 agreement table. Quadratic weights were used, which base disagreement weights on the square of the amount of discrepancy, a weighting method mathematically equivalent to the intraclass correlation coefficient. The McNemar-Bowker test statistic was used to indicate systematic bias.
The 1992 to 2006 cohort comprised 838 patients having primary surgery for oral and oropharyngeal squamous cell carcinoma. Data were factor analyzed for 517 patients with UW-QOL data at least 9.5 months after primary surgery and closest available to 12 months (median, 15 months; interquartile range [IQR], 12-25 months). From the total cohort of 838 patients, 734 were alive at 9.5 months; thus, the 517 patients with data beyond 9.5 months represent 70.4% (517 of 734) of those alive at 9.5 months. Mean (SD) age at operation of the 517 patients was 61 (12) years, with men composing 62.7% of the sample (n = 324). Most patients (86.1%, n = 445) had oral cavity tumors; tumors were also located in the oropharynx (13.2%, n = 68) and maxilla (0.8%, n = 4). One-third (145 of 445) of the oral tumors were clinical category T3 and T4, 65.4% (n = 291) underwent free-flap surgery, and 32.4% (n = 144) received adjuvant radiotherapy.
Inspection of scree slopes indicated a 2-factor solution with eigenvalues above 1.0 with a very clear elbow in the plot, with the first 2 factors lying above a straight line joining factors 3 through 12. Each factor had 6 domains, and correlations among all domain scores were positive. We labeled these factors “physical” function and “social-emotional” function. Physical function involves the chewing, swallowing, speech, taste, saliva, and appearance domains; social-emotional function involves the anxiety, mood, pain, activity, recreation, and shoulder function domains. Factor 1 and 2 loadings for appearance were similar but were higher for factor 1, indicating that appearance straddles between physical and social function, with slightly more physical than social function.
Analysis of all other QOL data after 9.5 months from surgery (median, 52 months; IQR, 33-81 months) and representing a mixed bag of data from 1361 questionnaires with multiple responses from patients and data ranging up to 15 years after surgery but excluding the 517 patients first analyzed gave a similar factor structure. Analysis of QOL data for 346 patients after surgery but before 9.5 months taking the closest available data after 3 months (median, 5 months; IQR, 3-6 months) again suggested 2 factors, with the first factor composed of swallowing, chewing, speech, taste, and saliva and the second factor composed of mood, anxiety, pain, and recreation, with appearance, shoulder function, and activity straddling between the 2.
Physical function and social-emotional function scores are computed as a simple average of their respective component domain scores, with the requirement that at least 4 domain scores are available. The 2 subscale scores can be regarded as numerical for the purpose of presentation, skewed slightly toward lower (worse) scores. A boxplot is an appropriate graphic to use for showing differences between patient subgroups, as given in Table 1. Results may be summarized as median (IQR) or as mean (SD).
Age and sex reference data for the UW-QOLv4 were collected from 372 patients attending 10 general dental practices.18 No obvious differences were noted in physical function and social-emotional function scores by age and sex. Overall median (IQR) normative scores were 100 (95-100) for physical function and 90 (74-100) for social-emotional function. Mean (SD) scores were 95 (10) for physical function and 83 (19) for social-emotional function.
Overall median (IQR) patient scores at approximately 1 to 2 years were 73 (59-90) for physical function and 80 (61-92) for social-emotional function. Mean (SD) scores were 71 (21) for physical function and 74 (20) for social-emotional function. In relation to normative data, the physical deficit was much more pronounced 1 to 2 years after surgery than was the social-emotional deficit.
The best set of physical and social-emotional scores was seen for patients with less advanced oral cancer tumors not requiring free-flap surgery or adjuvant radiotherapy (Table 1). This group had mean scores of 87 for physical function and 84 for social-emotional function, the latter being similar to normative levels. Physical function scores were lower for patients undergoing free-flap surgery with adjuvant radiotherapy irrespective of tumor category (P < .001 for T1-T2 and for T3-T4), but social-emotional function scores were more similarly distributed (P = .58 for T1 and T2 and P = .06 for T3 and T4).
Older patients reported better physical function and social-emotional function scores, but the associations were weak (Spearman r = 0.09 and r = 0.11 respectively; P = .03 for both). No significant differences were noted for sex. Significant differences were seen in both subscale scores for T category (T1 and T2 vs T3 and T4, P < .001), site (oral vs oropharyngeal, P < .01), free-flap surgery (yes vs no, P < .001), and adjuvant radiotherapy (yes vs no, P < .001). Reference data for patients 1 to 2 years after primary surgery are given in Table 1 according to different patient characteristics.
We previously reported12- 14 deficits in UW-QOLv4 domain scores after treatment, with some degree of recovery seen to approximately 12 months but little change in mean scores thereafter. Similar trends can be seen for the physical and social-emotional function scores, more notably the deficits in physical functioning (Table 2). Although the preoperative data are limited, the subscale mean scores are close to the normative values. Table 2 provides trends across time and suggests that patients just about regain any social-emotional deficits by 1 year after surgery to levels not that dissimilar to the normative levels. In contrast, patients continue to have significant residual deficits in physical function into the long term. These observations, however, could mask the more subtle changes that can be observed from the full profile of results (Figure). For physical function, the direction of change from the time of surgery is the same for each component domain, and the subscale score reflects this (Figure, A). For social-emotional function, the immediate deficits in shoulder function, activity, and recreation are counterbalanced by improvements in pain and anxiety levels and, to a lesser extent, in mood (Figure, B). Beyond approximately 6 months, the direction of change in all component domains is the same, and the subscale score reflects this.
Data were available for 851 pairs of UW-QOL data in which patients responded in one annual survey and then in the next, with both responses being at least 12 months after treatment. Results indicate a reasonable level of within-patient agreement of the subscale scores despite there being an interval of 1 year between test and retest. The median difference (IQR) between one annual survey and the next was 0 (−5 to 5) for both subscale scores. The (quadratic) weighted κ coefficient of agreement statistic for physical function score quartiles (categories based on the first of paired years) was κ = 0.83, indicating a very good level of agreement. The intraclass correlation coefficient between the numerical patient physical function scores was 0.86. Agreement was less for social-emotional function score quartiles, with weighted κ = 0.77. The intraclass correlation coefficient between numerical patient social-emotional function scores was 0.81. The McNemar-Bowker tests (testing for evidence of asymmetry in the 4 × 4 quartiles agreement table) did not indicate systematic bias.
In the 517 patients with UW-QOL data at least 9.5 months after primary surgery and closest available to 12 months, both global QOL measures (health-related and overall QOL) were more strongly correlated with social-emotional function (Spearman r = 0.70 and r = 0.70, respectively) than with physical function (r = 0.47 and r = 0.42, respectively). This stronger association with social-emotional scores was also seen in analyses of all other QOL data late after surgery (median, 52 months) and early after surgery (median, 5 months). For the 517 patients, the Spearman correlation between the 2 global measures of QOL was r = 0.82, whereas the correlation between the 2 subscales was r = 0.63.
To use a subscale score as an outcome measure requires a definition of what is meant by a meaningful clinical change across time. One approach is effect size20 obtained by dividing the mean change in a group of patients by the standard deviation in the prechange data. A small effect size represents approximately 0.20 of a standard deviation, a moderate effect size approximately 0.50, and a large effect size approximately 0.80. Results at 1 to 2 years gave subscale SDs of 20 (Table 2). Thus, for patients recruited at this time, a mean change across time of 4 U in the subscale score would be a small change, 10 U a moderate change, and 16 U a large change. Results before treatment gave subscale SDs of 15. Therefore, for patients recruited before treatment, a mean change across time of 3 U would be a small change, 7.5 U a moderate change, and 12 U a large change.
Standard statistical methods for computing sample size for a comparative study (with an α of .05 and study power of 80%) indicate that approximately 400 subjects per group are needed in analysis to detect small treatment effects, 64 per group to detect moderate effects, and 26 per group to detect large effects.20 Adjustments in numbers recruited are needed to allow for loss to follow-up, including death, so as to attain these numbers for analysis.
These data suggest that 2 subscales are more appropriate than a single composite 12 score. Some researchers are accustomed to using the composite score, so some summary statistics are given herein. Because the 2 subscale scores are each an average of 6 domain scores, the single composite 12 score is, in effect, the mean of the 2 subscale scores. We commented on the trend for patients to have a more significant physical deficit after treatment and persisting into the longer-term, whereas the social-emotional scores recover to just about baseline values. The composite 12 score would fall between these 2 trends, showing a deficit and some recovery but not to baseline values. Median (IQR) normative scores were 94 (85-98), with mean (SD) of 89 (13). In the 517 patients with UW-QOL data at least 9.5 months after primary surgery and closest available to 12 months, Spearman correlation with the physical function subscale was r = 0.90, with the social-emotional function subscale was r = 0.89, with global health-related QOL was r = 0.62, and with global overall QOL was r = 0.59. Median (IQR) composite score was 76 (60-88), and mean (SD) value was 73 (19).
This article presented data on the UW-QOLv4 for factor analyses, presentation and scoring of the resulting subscales, their comparison with normative data, patient characteristics and variations in subscale scores, temporal variation, association with other UW-QOL global measures, interpretability, clinical effect, and sample size. This article introduced 2 new UW-QOL subscales, which we called physical function and social-emotional function to reflect their component items. Each subscale score is computed as the simple average of its 6 components and is scaled from 0 to 100. The ultimate test for these scales is whether they work well in practice to discriminate between groups and to respond to changes in clinical condition across time. The results so far suggest that they do.
The data set was of sufficient size to perform both factor analyses and to make comparisons by patient characteristics, across time from treatment, and for normative reference data. It is recognized that there are limited pretreatment data (one-third of patients only), and the analysis was limited to primary surgery cases in one particular group of patients with head and neck cancer. However, in exploring the psychometric properties of the UW-QOLv4, the findings from this cohort should be applicable to other patient groups and does not reduce the bearing of the findings.
Disease-specific instruments have several advantages. They reduce patient questionnaire burden and increase acceptability by including only relevant dimensions. This may increase responsiveness. Whether dimensions should be combined in a summary score depends on whether such a score is useful and interpretable and whether there are any interactions between treatment and dimensions (ie, cancelling out of effects).21 Because the UW-QOLv4 domains and global scales are categorical, with only a few discrete scoring options and a skewed response, the new physical function and social-emotional function scales, with their many possible scores, should increase responsiveness and precision. The 2 subscale scores have face and content validity. It was remarkable that the shoulder domain hardly loaded to the physical function factor. The fact that the shoulder domain is in the social-emotional function subscale likely reflects the wording of the domain for work and hobbies. Rather than physical shoulder function, the question seems to pertain more to the social consequences of shoulder problems. The 2 subscales of the UW-QOLv4 have a similar construct as other questionnaires, such as the Medical Outcomes Survey 36-Item Short Form Health Survey, which has physical and mental component scores. The 2 constructs are different, and it is recognized that patients can go back to baseline emotional functioning despite physical limitations.
The 2 subscales each have 6 domains, and this gives better discrimination than any single domain. It is accepted that we can never be sure that using equal weights for scaling the subscale scores (ie, the simple average) gives the best estimate of physical function or social-emotional function. However, there will be further validation as colleagues report outcomes using the UW-QOLv4 as to how well equal-weight scaling works in practice.
The present data support the stability of the subscale scores from approximately 1 year after treatment and onward, their validity, and their responsiveness (sensitivity to clinically significant changes across time). The high κ and intraclass correlation values between repeated applications 12 months apart in the absence of clinical treatment or changes in disease status are remarkably high, suggesting that there is high test-retest reliability when conditions are stable beyond 1 year. The median difference in score of 0, the tight IQR of within ±5 U for differences, and the lack of systematic error in repeated applications also enhance the measure. Aspects of validity are explored by contrasting normative with patient data, within patients across time and between subgroups of patients with differing clinical characteristics.
One of the most important areas for further development is in making quantitative change scores for QOL more clinically meaningful. Ringash et al22 defined minimal important difference (MID) as the smallest difference that reflects a clinically important change in score and noted that having some knowledge of MID is important for determining sample size when planning trials with QOL end points. Most published MID estimates fall in the range 5% to 10% of the instrument range. The present results are consistent with the broad conclusion of Ringash et al22 regarding MID as 5% to 10% of the instrument range. In the present study, for patients recruited before treatment, a mean change of 7.5 U constituted a moderate effect (50% of SD), whereas for patients recruited approximately 1 year after treatment, a mean change of 10 U constituted a moderate effect.
However imprecise these percentages might seem, sample size calculations are important because they indicate the right order of magnitude of size of the study to be recruited. In clinical research, most advances are in small steps, and many effects, if real, are likely to be small to medium and are unlikely to be large. This indicates that 160 (80 per group) should be regarded as the minimum requirement for recruitment to a 2-armed randomized controlled trial to be able to detect moderately sized differences and allowing for 20% patient attrition.
In conclusion, the UW-QOLv4 is brief and simple to complete. It had minimum patient burden. Despite its brevity, the questionnaire does have psychometric validity. The identification of 2 subscales, physical function and social-emotional function, potentially increases its responsiveness and precision. They also allow a realistic estimate of sample size for clinical trials. They are to be preferred to the single aggregate composite 12 score. Questionnaire analyses and reporting should include both the physical function and social-emotional function subscales.
Correspondence: Simon N. Rogers, FDS, RCS, FRCS, MD, Regional Maxillofacial Unit, Aintree University Hospital’s National Health Service Foundation Trust, Aintree, Liverpool L9 7AL, England (email@example.com).
Submitted for Publication: January 11, 2009; final revision received May 4, 2009; accepted May 14, 2009.
Author Contributions: All authors had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Rogers, Lowe, and Yueh. Acquisition of data: Rogers. Analysis and interpretation of data: Rogers, Lowe, Yueh, and Weymuller. Drafting of the manuscript: Rogers, Lowe, and Yueh. Critical revision of the manuscript for important intellectual content: Rogers, Yueh, and Weymuller. Statistical analysis: Lowe and Yueh. Obtained funding: Rogers. Study supervision: Rogers.
Financial Disclosure: None reported.