Lip and lip line posttreatment data from the clinical trial sample are from the day 14 assessment.
eTable 1. Content of the FACE-Q Scales and Checklists
eTable 2. RMT Statistical Indicators of Fit
Customize your JAMA Network experience by selecting one or more topics from the list below.
Klassen AF, Cano SJ, Schwitzer JA, et al. Development and Psychometric Validation of the FACE-Q Skin, Lips, and Facial Rhytids Appearance Scales and Adverse Effects Checklists for Cosmetic Procedures. JAMA Dermatol. 2016;152(4):443–451. doi:10.1001/jamadermatol.2016.0018
Patient-reported outcomes data are needed to determine the efficacy of cosmetic procedures.
To describe the development and psychometric evaluation of 8 appearance scales and 2 adverse effect checklists for use in minimally invasive cosmetic procedures.
Design, Setting, and Participants
We performed a psychometric study to select the most clinically sensitive items for inclusion in item-reduced scales and to examine reliability and validity with patients. Recruitment of the sample for this study took place from June 6, 2010, through July 28, 2014. Data analysis was performed from December 11, 2014, to December 22, 2015. Pretreatment and posttreatment patients 18 years and older who were consulting for any type of facial aesthetic treatment were studied. Patients were from plastic surgery and dermatology outpatient clinics in the United States and Canada (field-test sample) and a clinical trial of a minimally invasive lip treatment in the United Kingdom and France (clinical trial sample).
Main Outcomes and Measures
The FACE-Q scales that measure appearance of the skin, lips, and facial rhytids (ie, overall, forehead, glabella, lateral periorbital area, lips, and marionette lines), with scores ranging from 0 (lowest) to 100 (highest), and the FACE-Q adverse effects checklists for problems after skin and lip treatment.
Of 783 patients recruited, 503 field-test patients (response rate, 90%) and 280 clinical trial participants were studied. The mean (SD) age of the patients was 47.4 (14.0) years in the field-test sample and 47.7 (12.3) years in the clinical trial sample. Most of the patients were female (429 [85.3%] in the field-test sample and 274 [97.9%] in the clinical trial sample). Rasch Measurement Theory analyses led to the refinement of 8 appearance scales with 66 total items. All FACE-Q scale items had ordered thresholds and acceptable item fit. Reliability, measured with the Personal Separation Index (range, 0.88-0.95) and Cronbach α (range, 0.93-0.98), was high. Lower scores for appearance scales that measured the skin (r = −0.48, P < .001), lips (r = −0.21, P = .001), and lip rhytids (r = −0.32, P < .001) correlated with the reporting of more skin- and lip-related adverse effects. Higher scores for the 8 appearance scales correlated (range, 0.70-0.28; P < .001) with higher scores on the core 10-item FACE-Q satisfaction with facial appearance scale. In the pretreatment group, older age was significantly correlated with lower scores on 5 of the 6 rhytids scales (exception was forehead rhytids) (range, −0.28 to −0.65; P = .03 to <.001). Pretreatment patients reported significantly lower scores on 7 of the 8 appearance scales compared with posttreatment patients (exception was skin) (P < .001 to .005 on independent sample t tests).
Conclusions and Relevance
The FACE-Q appearance scales and adverse effects checklists can be used in clinical practice, research, and quality improvement to incorporate cosmetic patients’ perspective in outcome assessments.
In 2014, a total of 13.9 million minimally invasive cosmetic procedures were performed in the United States, representing an increase of 3% from the year before.1 To include the patient voice in the assessment of treatment outcomes in the cosmetics industry, patient-reported outcome (PRO) instruments are needed.2 A review3 of PRO instruments in 96 736 registered clinical trials between 2007 and 2013 found that 27% used 1 or more, with 17% as a primary or secondary end point. The choice of which PRO instrument to use in a study is a crucial decision. If the wrong instrument is used, it may appear that a new aesthetic product or intervention has little to no benefit.
Engaging patients in the identification of issues that matter to them and using their stories to develop PRO instruments can help to ensure content validity.4-6 Unfortunately, few such instruments are available for cosmetic treatments. A literature review7 to identify PRO instruments for cosmetic procedures found 9 of which 3 met international recommendations for how such tools should be developed and validated (ie, BREAST-Q,8,9 FACE-Q,10 and Skindex11). The review concluded that research dedicated to the evaluation of PRO instruments in cosmetic surgery is urgently required.
The FACE-Q10,12-16 is a PRO instrument that includes more than 40 scales and checklists designed to measure appearance, adverse effects, health-related quality of life, and experience of health care. These domains form the basis of the FACE-Q conceptual framework. Each domain contains multiple scales and checklists. Because of the large number of scales, validation results are being published as a series of articles, each of which describes clinically relevant groupings. The aim of this article is to describe the set of the FACE-Q scales and checklists that can be used to evaluate minimally invasive cosmetic procedures. Specifically, we describe our psychometric findings for 8 appearance scales designed to evaluate skin, lips, and facial rhytids (overall, forehead, glabella, lateral periorbital area, lips, and marionette lines). We also describe 2 checklists designed to measure adverse effects for skin and lip treatment.
Question: Do the FACE-Q scales provide a means to measure appearance of the skin, lips, and facial rhytids (ie, overall, forehead, glabella, lateral periorbital area, lips, and marionette lines)?
Findings: In this study of 783 participants, psychometric analysis supported the reliability and validity of the FACE-Q scales. Adverse effects after specific cosmetic treatments were also identified.
Meaning: The FACE-Q can be used to involve patients in the assessment of treatment outcomes in the cosmetics industry.
Before study commencement, research ethics approval was obtained at The New School in New York City, New York, and University of British Columbia in Vancouver, British Columbia, Canada. Completion of the FACE-Q questionnaire implied consent.
The FACE-Q was developed by following the US Food and Drug Administration guidance to industry2,17 and other guidance documents.18-20 We describe our methods elsewhere.10,13-16 Briefly, a systematic review,21 qualitative interviews with 50 patients with facial aesthetics, and input from 26 experts were used to develop the FACE-Q conceptual framework and scales and checklists. The content of each scale was then refined through cognitive interviews with 35 patients. We developed 4 response options in keeping with best practice.22 Instructions ask respondents to answer in relation to the past week.
The scales for skin and lips measure satisfaction with appearance. The 6 scales that measure appearance of rhytids (overall, forehead, glabella, lateral periorbital area, lips, and marionette lines) and the adverse effects checklists (skin and lips) evaluate how bothered someone is by these concepts. eTable 1 in the Supplement lists the content and response options for the scales and checklists.
For validation purposes, we included 3 additional FACE-Q scales: 10-item satisfaction with facial appearance scale, 10-item psychological function scale, and 8-item social function scale. These scales previously demonstrated reliability, validity, and the ability to detect change.8,15 Participants were also asked questions so the sample could be characterized by age, sex, and ethnicity.
To be included in the study, patients had to be 18 years or older with a pretreatment or posttreatment status for 1 or more of any type of surgical or nonsurgical facial aesthetic treatment. For minimally invasive treatments, returning patients asked to participate, those who had received botulinum toxin treatment more than 4 months ago, and those who had received soft-tissue fillers more than 9 months ago were considered pretreatment participants in our study sample. Participants were recruited from 4 dermatology and 11 plastic surgery offices in the United States and Canada from June 6, 2010, through July 28, 2014. Data analysis was performed from December 11, 2014, to December 22, 2015. For 11 clinics, staff provided a questionnaire booklet to complete in the waiting room at check-in. The remaining clinics invited patients to participate via a postal survey that included a personalized letter from the relevant health care professional alongside a questionnaire booklet with up to 3 mailed reminders. Potential participants were provided a $5 coffee card in appreciation of their participation. Completion of the FACE-Q questionnaire implied consent.
An international, randomized, 2-arm, active-controlled study23 recruited patients 18 years and older for a volume enhancement lip treatment (clinical trial sample). Participants were recruited from 12 sites in the United Kingdom and France. The treatment injection volume was based on clinical experience and lip treatment goals. Vermilion body and border were the primary treatment sites; additional perioral sites could also be treated. This study was approved by Ethics Committee Address and Chairperson National Research Ethics Service. All participants provided written informed consent. The data were deidentified. More details about the study sample and methods are published elsewhere.23
The scales that measured lips and satisfaction with facial appearance were administered on days 0, 30, and 90. The scales that measured lip rhytids and psychological and social function were administered on days 0, 14, 30, and 90. The adverse effects checklist for lips was administered on days 14 and 30. These scales were translated into French by MAPI Research Trust, following their linguistic validation method, which includes 2 separate forward translations by 2 qualified translators, a reconciliation process, and 1 backward translation by a qualified translator.24
For the adverse effects checklists, the proportion of responses for each response option was computed. For the appearance scales, Rasch Measurement Theory (RMT)25,26 was conducted within RUMM2030 statistical software.27 Rasch Measurement Theory examines the difference between observed and predicted item responses to determine whether data from a sample fit the Rasch model.28 The results from a range of statistical and graphical tests were examined, with the evidence considered together to make a decision about each scale’s overall quality.28-30 We performed the following:
Threshold for item response options: We examined the ordering of thresholds, which are the points of crossover between adjacent response categories (eg, between somewhat satisfied and very satisfied) to determine whether successive integer scores increased for the construct measured.
Item fit statistics: For each scale, we examined 3 indicators of fit to determine whether the scale’s items worked together to map out a clinically important construct: (1) log residuals (item-person interaction), (2) χ2 values (item-trait interaction), and (3) item characteristic curves. The criteria for fit residuals should fall between −2.5 and +2.5. The χ2 value for each item should be nonsignificant after Bonferroni adjustment.
Dependency: Residual correlations among items in a scale can artificially inflate reliability. We examined residual correlations among items, which should be below 0.30.26
Stability: Differential item functioning (DIF) measures the degree to which item performance remains stable across subgroups. A χ2 value significant after Bonferroni adjustment can indicate an item with potential DIF. We examined DIF by age, sex, and country.
Targeting: Targeting can be examined by inspecting the spread of person (range of the construct reported by the sample) and item (range of the construct measured by the items) locations. Items in a scale should be evenly spread across a reasonable range that matches the range of the construct experienced by the sample.
Person separation index (PSI): We examined reliability using the PSI, a statistic that is comparable to the Cronbach α.31 The PSI measures error associated with the measurement of people in a sample. Higher values indicate greater reliability.
We also computed a Cronbach α for each scale, which provides a measure of how closely related a set of items are as a group.31 Rasch logit scores for each participant were transformed into scores from 0 (worst) to 100 (best). The scoring algorithm is available from the authors. Pearson correlations to examine associations among scores and 2-tailed independent sample t tests used to test for differences among means were used to test the following hypotheses:
Higher scores on the appearance scales would correlate with higher scores for satisfaction with facial appearance, psychological function, and social function.
Lower scores on the skin scale would correlate with more adverse effects for skin. Similarly, lower scores on the lips and lip rhytids scales would correlate with more adverse effects for lips on the day 14 assessment.
Before treatment, older participants would report lower scores on the 6 rhytids scales compared with younger participants.
Pretreatment participants would report lower scores on all 8 scales compared with posttreatment patients.
P < .05 was considered statistically significant.
A total of 503 of 558 patients invited to participate completed a FACE-Q booklet that contained 1 more of the scales described in this study (response rate, 90%). In addition, 280 individuals participated in the lip enhancement clinical trial, for a total of 783 participants. Table 1 gives the sample characteristics. When we compared the field-test sample with the clinical trial sample, mean age did not differ (P = .77 on 2-tailed independent sample t test), but sex did (P < .001 on the χ2 test). Specifically, the clinical trial sample had fewer than expected men (9.1% vs 2.1%).
The checklist that measured adverse effects of the skin was completed by 74 participants a mean (SD) of 2.4 (3.6) months after skin treatment (range, immediate to 12 months). The top 3 items endorsed included redness, uneven skin tone, and skin sensitivity (Table 2). On day 14 in the lip sample, the most common adverse effects were lips that did not feel smooth, look symetric, or look normal.
The RMT analysis supported the reliability and validity of the appearance scales. All 66 items had ordered thresholds, providing evidence that each scale’s response options worked as a continuum that increased for the construct measured. Fit residuals were within the −2.5 to +2.5 recommended range for 50 of the 66 items (eTable 2 in the Supplement), and 66 of the 66 items were not significant in terms of the adjusted χ2P values, providing evidence that the items fit the expectations of the Rasch model for each scale. The 16 items with fit outside the recommended range were retained because of their clinical importance. The item residuals were above 0.30 (range, 0.35-0.59) for 6 pairs of items within 5 scales. Subtests performed on the pairs of items revealed marginal effect on scale reliability (0 to 0.01 difference in PSI value). For the scale that measured satisfaction with lips, DIF was detected for age and/or country on 5 items. When these items were split on the variable with DIF and the new person locations for the scale were correlated with the original person locations, the DIF had a negligible effect (Pearson correlates were 0.99).
Figure 1 shows the person-item threshold distribution for the scale that measured facial rhytids overall as an example of targeting. The x-axis represents the construct (facial rhytids appearance), with higher scores (less bothered) increasing to the right. The y-axis represents the frequency of person measure locations (top histogram) and item locations (bottom histogram). The sample was divided into 4 groups based on their answer (not at all, a little, moderately, or extremely) to a stand-alone item that asked how much participants were bothered by, “How the lines on your face look overall?” and into pretreatment and posttreatment groups. These examples provide evidence that most of the sample lay inside the range in which the scale provided measurement.
The P values for fit to the Rasch model were not significant for 7 of the 8 scales, which indicates that the data satisfied the requirements of the Rasch model. The P value for the scale that measured lip rhytids was significant (P = .02). The 8 scales evidenced high reliability. The PSI and Cronbach α values were as follows: skin, 0.93 and 0.93; lips, 0.95 and 0.97; rhytids overall, 0.93 and 0.95); forehead rhytids, 0.88 and 0.95; glabella rhytids, 0.91 and 0.96; lateral periorbital area rhytids, 0.92 and 0.96; lip rhytids, 0.93 and 0.97; and marionette lines, 0.92 and 0.98, respectively.
Pearson correlations between the 8 scales and satisfaction with facial appearance scores were significant (P < .001) and ranged from 0.70 (skin) to 0.28 (glabella rhytids). Correlations between the 8 scales and psychological function were significant (P = .03 to <.001) for 7 of the 8 scales (exception was glabella rhytids) and ranged from 0.51 (lateral periorbital area rhytids) to 0.32 (rhytids overall). Correlations between the 8 scales and social function were significant for 3 scales, including lateral periorbital area rhytids (r = 0.40, P < .002), lips (r = 0.35, P < .001), and lip rhytids (r = 0.28, P < .001).
More skin-related adverse effects correlated with lower scores on the skin scale (r = −0.48, P < .001). More lip-related adverse effects correlated with lower scores on the lip (r = −0.21, P = .001) and lip rhytids (r = −0.32, P < .001) scales.
In the pretreatment group, correlations between older age and lower scores for the rhytids scales were significant for 5 of the 6 scales (exception was forehead rhytids): rhytids overall (r = −0.41, P < .001), glabella rhytids (r = −0.28, P = .03), lateral periorbital area rhytids (r = −0.35, P = .001), lip rhytids (r = −0.52, P < .001), and marionette lines (r = −0.65, P < .001). In the posttreatment group, age was not significantly correlated with scores from 5 of the 6 rhytids scales (exception was lip rhytids: r = −0.32, P < .001).
Figure 2 shows the mean scores for the 8 appearance scales for pretreatment and posttreatment data. Pretreatment patients reported significantly lower scores on 7 of the 8 scales (exception was the skin scale) compared with posttreatment patients (P <.001-.005 on 2-tailed independent sample t tests).
Increasing acceptance of facial cosmetic treatments has led to an industry that continues to expand. Research is urgently needed to ensure that new treatments are safe and effective. The FACE-Q is a rigorously developed PRO instrument that can be used by academics and other health care professionals to collect evidence-based outcome data from patients with facial aesthetics.
To date, the FACE-Q is currently the only PRO instrument that includes scales that measure facial appearance. Some FACE-Q appearance scales ask about satisfaction with appearance, and other scales, for negative concepts such as facial rhytids, ask about being bothered by appearance. Other PRO instruments used in facial aesthetics research measure appearance-related psychosocial distress rather than appearance per se. For example, the rigorously developed 61-item Skindex14 measures negative affect, self-esteem, anxiety, physical discomfort, physical limitations, self-consciousness, and intimacy. A PRO instrument that measures psychosocial issues would not be the best choice for measuring change in appearance.
The psychometric analyses in this article provided evidence of the reliability and validity of the FACE-Q scales. In addition, and fundamentally, our use of RMT methods to develop the FACE-Q has certain advantages. The RMT methods differ from traditional psychometric methods (based on classic test theory) because their focus is on the association between a person’s measurement and the probability of responding to an item, rather than the association between a person’s measurement and the observed scale total score.28 Advantages of using RMT to develop PRO instruments include the following: (1) RMT provides measurements of people that are independent of the sampling distribution of the items used and locates items in a scale independent of the sampling distribution of the people in whom they are developed, (2) RMT improves the potential to diagnose item-level psychometric issues, and (3) RMT allows for a more accurate picture of individual person measurements.28 These assets, together with the extensive qualitative work performed to create the FACE-Q, are what set the FACE-Q apart from other PRO instruments in the same clinical area.
This study has previously described limitations.10,13-16 First, the sample was heterogeneous (eg, varied by age, sex, and timing of assessment), which limits the outcome findings we can report. Second, our sample and that of the clinical trial had many more women than men, which reflects the makeup of patients with cosmetic issues. Third, there could have been bias introduced at the clinic level by office staff who recruited their patients for us. Fourth, few field-test participants completed the FACE-Q before and after treatment. Responsiveness research is needed to document the benefits of treatment for specific facial treatments.
Evidence-based information about patient outcomes for facial aesthetic treatments is needed. The FACE-Q provides the research community and physicians with a PRO instrument they can use to include patients in the assessment of outcomes.
Accepted for Publication: December 23, 2015.
Corresponding Author: Anne F. Klassen, DPhil, McMaster University, 1280 Main St W, Hamilton, ON L8S 4K1, Canada (email@example.com).
Published Online: March 2, 2016. doi:10.1001/jamadermatol.2016.0018.
Author Contributions: Drs Pusic and Klassen had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Klassen, Cano, Pusic.
Acquisition, analysis, or interpretation of data: All authors.
Drafting of the manuscript: Klassen, Cano, Pusic.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Klassen, Cano, Schwitzer.
Obtained funding: Klassen, Cano, Pusic.
Administrative, technical, or material support: Klassen, Schwitzer, Baker, A. Carruthers, J. Carruthers, Chapas.
Study supervision: Schwitzer, Baker, Pusic.
Conflict of Interest Disclosures: The FACE-Q is owned by Memorial Sloan-Kettering Cancer Center. Drs Cano, Klassen, and Pusic reported being codevelopers of the FACE-Q and, as such, receive a share of any license revenues as royalties based on Memorial Sloan-Kettering Cancer Center’s inventor sharing policy. Dr Cano reported being a cofounder of Modus Outcomes, an outcomes research and consulting firm that provides services to pharmaceutical, medical device, and biotechnology companies. Drs A. Carruthers and J. Carruthers reported being consultants and investigators for Allergan, Merz, Kythera, and Alphaeon. No other disclosures were reported.
Funding/Support: This study was supported by a grant from the Plastic Surgery Foundation (Dr Pusic).
Role of the Funder/Sponsor: The funding source had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and the decision to submit the manuscript for publication.
Previous Presentation: Preliminary results of this study were reported the American Society of Plastic Surgeons Annual Meeting; October 12, 2014; Chicago, Illinois.
Additional Contributions: The following physicians recruited their patients into the FACE-Q field-test sample: D. Berson, MD, James C. Grotting, MD, J. M. Kenkel, MD, F. Nahai, MD, Rod J. Rohrich, MD, A. Rossi, MD, Jonathan M. Sykes, MD, Nancy Van Laeken, MD, L. Young, MD, and J. Rivers, MD. Diane Murphy, MPH, at Allergan Medical provided the FACE-Q data from the clinical trial.