Bull C, Teede H, Watson D, Callander EJ. Selecting and Implementing Patient-Reported Outcome and Experience Measures to Assess Health System Performance. JAMA Health Forum. 2022;3(4):e220326. doi:10.1001/jamahealthforum.2022.0326
Psychometrically robust patient-reported outcome measures (PROMs) and patient-reported experience measures (PREMs) are critical to evaluating quality and performance across health services and systems. However, the adoption and implementation of PROMs and PREMs remain a challenge in many countries. The aim of this guide is to support instrument selection and implementation to measure health system performance.
The guide is split into 3 step-by-step sections. Step 1: Knowing What to Measure discusses what PROMs and PREMs capture and how they differ from related instruments. Step 2: Choosing the Right Instrument describes the critical psychometric properties of validity, reliability, and responsiveness, and provides resources to support instrument selection and evaluation. Step 3: Mitigating Potential PROM and PREM Implementation Barriers outlines key barriers and supports for instrument implementation at system, service, and individual levels.
Conclusions and Relevance
This guide aims to provide practical resources for the identification of psychometrically robust PROMs and PREMs, as well as support for their implementation to drive improvements across health systems globally.
The systematic and routine use of rigorously developed patient-reported outcome measures (PROMs) and patient-reported experience measures (PREMs) is heralded as critical to evaluating quality and performance across health services and systems.1-5 Use of PROMs and PREMs captures a multitude of health, well-being, and experiential dimensions,6,7 supporting the provision of person-centered, value-based health care.8-12 However, the adoption and implementation of PROMs and PREMs remain a challenge in many countries. Furthermore, while there have been substantial advancements in the development and psychometric evaluation of PROMs and PREMs in recent decades,13,14 using patient data to drive quality improvement and system changes is still variable and ambiguous.5
The aims of this guide are 2-fold: first, to provide practical support for the identification of appropriate and psychometrically sound PROMs and PREMs, and second, to support PROM and PREM implementation efforts by highlighting potential barriers and how they may be mitigated. This guide follows a step-by-step approach (Figure).
Patient-reported outcome measures are standardized, validated instruments that are completed by patients to measure their health and well-being.7,15 They can capture a wide range of outcomes, including physical functioning, social functioning, psychological well-being, symptom severity, disability, and impairment.6,7,16 Use of PROMs is common to assess changes in outcomes after the implementation of a new intervention (eg, therapy, service, policy), allowing inferences to be made about effectiveness and safety.6,17 They are also used to monitor health conditions of individuals and cohorts over time, as well as the effects of treatments longitudinally.4 They can be used in isolation or linked with other sources of information about individuals and cohorts.
Patient-reported outcome measures can be either generic or condition specific. Generic PROMs are designed to apply to a broad range of patients, conditions, and treatments, capturing aspects of health and well-being that are applicable to everyone. Examples include the 12-item Short Form Health Survey (SF-12), 36-item Short Form Health Survey (SF-36), and the PROMIS (Patient-Reported Outcome Measurement Information System) adult global health measure. Alternatively, condition-specific PROMs apply to specific groups of patients relative to their health condition or the procedure they are undergoing (eg, Oxford Knee Score, Oxford Hip Score).
Despite being referred to and used synonymously in literature and research,18-21 PROMs differ from quality of life (QoL) and utility measures. Quality of life and utility measures are preference-based measures primarily used in the context of cost-utility analysis (a form of economic evaluation).22 They comprise 2 components: (1) the measure itself, and (2) a valuation system.22 While QoL and utility measures may capture similar health concepts to those captured by PROMs (eg, physical functioning), they are operationalized (put to use) differently; PROMs are typically scored on an item-by-item or domain (subscale) basis, based on the scale structure determined through psychometric analyses, whereas an individuals’ QoL is scored as one of a finite number of health states.22 The QoL valuation system then enables a number between −1 and +1 to be assigned to the individuals’ health state, subsequently representing the value of that health state (where 1 indicates full health, 0 indicates death, and <0 indicates health states worse than death).22 Examples of QoL and utility measures include the EQ-5D (EuroQol 5 dimensions), the SF-6D (Short Form 6 dimensions), and the HUI (Health Utility Index).
Patient-reported experience measures are instruments that capture a patient’s experience of receiving care,7,23 specifically the patient’s perception of what happened during their care encounter and how it happened.13 Several researchers describe patient experiences broadly in terms of relational and functional aspects of care.24-26 Relational aspects can include the provision of emotional and psychological support, being treated with respect and dignity, being involved in decisions about care and treatment, receiving support for family and carer involvement, the provision of clear and comprehensive information, and transparent and honest communication.24 Functional aspects may include effective and timely treatment, expert management of physical symptoms, attention to physical support needs (eg, wheelchairs and blankets), attention to environmental needs (eg, clean, safe, and comfortable care environments), and the coordination and continuity of care.24
Patient-reported experience measures are primarily used to describe one-off patient experiences across a range of health care contexts and to demonstrate trends in experience scores within health services and systems over time. Notably, PREMs differ from patient satisfaction measures, despite being referred to interchangeably throughout the literature. Where PREMs ask patients to provide a report of their care experience (objective), patient satisfaction measures ask patients to evaluate their care experience (subjective).23,27,28 Evaluating involves the concepts of duty and culpability.27 That is, when evaluating a service, patient responses are influenced by their expectations of what a service should and should not do for them (duty) and whether the service is to blame when things that should not happen do happen, or when things that should happen do not happen (culpability).27 Thus, responses to satisfaction questionnaires are more likely to reflect patient expectations, attitudes of appreciation, and social acceptability27,29,30 as opposed to an objective report of what they experienced during their care encounter. There are also differences in the response categories used in PREMs and satisfaction measures. Where PREMs typically employ frequency-based response scales (eg, never, sometimes, often, always),31-33 patient satisfaction measures use agreement-based response scales (eg, strongly disagree, disagree, neither agree nor disagree, agree, strongly agree).34-36 Agreement-based scales are criticized for being biased by acquiescence (tendency to agree with an item irrespective of what is being asked37) and straightlining (tendency to give identical or near identical responses to consecutive questions38). For these reasons, patient experience data are argued to be more actionable for evaluating health care quality.
Thus, a critical component of implementing PROMs and PREMs is understanding whether users want to capture patient outcomes of care (PROMs), experiences of care (PREMs), QoL (QoL measures), or satisfaction with care (patient satisfaction measures). The decision of what to measure should be based on the goal or aim of the study or quality improvement project. Additionally, the selection of an instrument(s) should ideally be driven by key stakeholders, such as patients and health care professionals, relative to their priorities for optimizing person-centered, value-based health care.
Psychometrics is a branch of psychology that seeks to measure any number of behavioral and social phenomena using rigorous statistical methods.39 Validity, reliability, and responsiveness are key psychometric properties that provide an assessment of instrument quality, thereby supporting the quality of the resultant information collected.
Validity is defined as the extent to which an instrument measures what it was designed to measure.40 The American Psychological Association recognizes 3 overarching types of validity.41
Criterion validity is the extent to which an instrument correlates with a “gold standard.”42 There are 2 types of criterion validity: concurrent criterion validity (ie, the extent of correlation between an instrument and gold standard comparator administered at the same time) and predictive criterion validity (ie, the extent of correlation between a current instrument and gold standard comparator administered in the future). Criterion validity is criticized, however, for the general lack of gold standard instruments to compare against. The Consensus-Based Standards for the Selection of Health Measurement Instruments (COSMIN) group notes that an exception to this general lack of a gold standard is when a short version of an instrument is compared with a longer or existing version43 (eg, comparing the SF-12 to the SF-36, where the SF-36 is considered the gold standard).44 Thus, it can be difficult for instruments to demonstrate criterion validity.
Construct validity is the extent to which an instrument behaves as expected relative to the construct being measured.45 For example, it is expected that a patient’s overall experience of emergency department care is going to be positively correlated with their outcomes of care (ie, better outcomes correlate with better experiences) but negatively correlated with the duration of their wait time (ie, a shorter wait correlates with a better experience). Thus, a PREM with good construct validity would demonstrate this type of relationship because it aligns with the hypotheses of how these variables correlate.
Content validity is the extent to which the content of an instrument adequately reflects the theoretical construct(s) being measured.43 This is considered to be the most important property of an instrument because it reflects whether items are relevant, comprehensive, and comprehensible to the target population. Given that the target population of PROMs and PREMs is patients (irrespective of the health care context), the involvement of patients in the development of PROM and PREM items is essential. Construct and face validity are sometimes referred to interchangeably in the literature.46 However, face validity is an informal assessment of whether items look (on face value) to capture the intended theoretical construct(s)43,47 and should not be used in isolation or as a surrogate for content validity. Thus, it is critical that PROMs and PREMs demonstrate sound levels of validity because this ensures that they are accurately measuring the construct(s) under investigation.
Reliability refers to the extent to which an instrument performs in consistent and predictable ways.48 There are 3 main types of reliability.
Test-retest reliability is the extent that participant responses remain stable over consecutive data collections in relation to the same event (construct stability).40,49 Undertaking this type of repeated measurement is likely to produce different results unrelated to the instruments’ level of reliability (eg, random error, subject inconsistency in scoring).42 As such, a shorter time frame between instrument administrations is likely to be more suitable for producing a robust test-retest reliability score. It is well documented that patient experiences deteriorate over time because of an individual’s changing expectations, exposure to similar stimuli (eg, other health care experiences), and misremembering the events that occurred.50-54 Indeed, some researchers purposefully refrain from assessing PREM test-retest reliability for this reason,55 and it is one of the least commonly assessed PREM psychometric properties.13 Thus, neither PROMs nor PREMs should be discounted if they do not demonstrate sound test-retest reliability.
Internal consistency reliability is the degree of interrelatedness among items of the same scale (unidimensional instrument) or subscale (multidimensional instrument).43,46 Cronbach α is most frequently used to indicate an instrument’s internal consistency.13 Additionally, factor analysis (commonly used to demonstrate construct validity) is another means of supporting internal consistency assessment because items within a PROM or PREM subscale (assuming the instrument is multidimensional) should correlate strongly as they are measuring the same latent (unobservable) trait.46,48
Measurement error represents the extent of systemic and random error in respondent scores that is not attributable to true changes in the construct being measured.46 Measurement error may also be described as noise (variance in scores that is owing to randomness), where we are more interested in the signal (variance in true scores).48,56 Adequate sample sizes in the development and psychometric evaluation of PROMs and PREMs are critical for reducing measurement error because a greater number of responses reduces randomness in the results.56
Responsiveness is considered an aspect of construct validity and refers to the ability of an instrument to detect change over time where change has occurred.57 This is commonly reported in terms of clinically important change(s) to the construct being measured.46 However, PROMs and PREMs do not necessarily need to demonstrate responsiveness if used in a discriminatory capacity (eg, to detect differences in experiences between different health care professionals) but should evidence responsiveness if used in an evaluative capacity (eg, evaluating the effect of a new intervention).46,58 Notably, clinical importance is different from statistical significance. Where statistical significance (P < .05) indicates the probability of results being due to chance, clinical significance relates to the magnitude of a treatment effect and if the effect is meaningful.59 Results that are clinically important are those most likely to change current practice. While there are several articles providing guidance on responsiveness and clinically important differences for PROMs and QoL or utility measures,60,61 the same cannot be said for PREMs. This is an area of PREM psychometric evaluation that warrants greater investigation; determining a clinically important difference in patient experience scores may present an alternative means to supporting value-based performance (VBP) programs, where incentives are based on meaningful changes in scores, not just improvements on previous scores.
Floor and ceiling effects occur when a substantial proportion (15%-20%) of participant scores for an item cluster at the bottom end (floor) or top end (ceiling) of a response scale.62 The presence of floor and ceiling effects makes it difficult to distinguish differences between respondents at respective ends of the response scale, thereby affecting the sensitivity of the instrument. If floor and ceiling effects occur for a number of items within a PROM or PREM, this may indicate that the instrument lacks variability, suggestive of inadequate content validity.63
Cross-cultural adaptation refers to the process whereby a measure developed in one language or culture is adapted to be equivalently meaningful and applicable in another language or culture.64 The adaptation process is supported by cross-cultural validation whereby the performance of translated items are assessed to determine whether they are an accurate reflection of the items in the original instrument.64 Best practice guidance by Guillemin et al65 for cross-cultural adaptation of instruments suggests 5 sequential steps be followed when translating instruments, including:
Forward translation of items into new language by qualified translators
Back translation of items into existing language to determine consistency of content
Convene a stakeholder committee to compare the original and new instrument versions
Undertake pretesting with both the original and new instrument to check for content equivalence
Consider weighting items depending on the cultural context (ie, there may be items that are more important in certain contexts than the original)
An adapted measure is considered to demonstrate cross-cultural validity when respondents from different groups (eg, English speaking and Spanish speaking) respond similarly to items of respective instruments as evidenced through regression analyses, confirmatory factor analysis, or non–differential item functioning.66
There are several resources that can support the selection of psychometrically robust PROMs and PREMs. Table 18,13,14,67-74 presents a comprehensive list of repositories that include PROMs and PREMs for a variety of health care contexts. Many of these also provide information on the extent of psychometric evaluation that instruments have undergone.
Another important resource to support robust PROM and PREM selection is COSMIN, an initiative that seeks to improve both the selection and development of PROMs and other health outcome measurement instruments.75 Given the general lack of PREM-specific guidance, COSMIN is also used to support the development and psychometric evaluation of PREMs.13,76,77 This also highlights the impetus for PREM-specific guidance. The COSMIN guidance can support users to assess the psychometric quality of an instrument where no formal assessment has been documented or to support the development and psychometric evaluation of a new instrument. Despite numerous useful resources to support the identification and selection of psychometrically sound PROMs and PREMs, uniform adoption and implementation is challenged by a range of barriers.
The implementation of PROMs and PREMs to measure health system performance is challenged by factors within and across health systems. The Organisation for Economic Co-operation and Development notes that “if each country continues to do its own thing on patient-reported performance, opportunities to identify excellence, support poor performers and drive improvements across the board will be missed.”78 Among the key recommendations for international improvement in health system performance measurement, the Organisation for Economic Co-operation and Development suggests that where valid PROMs and PREMs do not yet exist for priority diseases, sectors, or services, new measures should be developed.78 They also highlight the importance of formally assessing and piloting PROMs and PREMs in different languages and settings to ensure that they are rigorously developed and tested for the purposes of international comparison.78 However, given that implementation challenges are rife at the microlevel (eg, at the service level), it may be more appropriate to consider the barriers and supports for PROM and PREM implementation within the context of a single system.
Table 215,79-89 summarizes a range of barriers and supports for PROM and PREM implementation at the system, service, and individual levels. There are also challenges associated with some instruments themselves, such as their burdensome length,82 lack of sensitivity (eg, poor ability to discriminate results between different services and patient groups), and susceptibility to floor and ceiling effects.79 However, these are well documented elsewhere in the literature and will not be addressed here.
It is important to note that changes occurring at any one level are likely to have a ripple effect on other levels. Critically, this is important for shifting system cultures and priorities toward understanding the importance of PROMs and PREMs for value-based health care. However, there is also an important balance to be struck between producing results that are tailored to individual services yet supportive of initiatives at the system level.
A key barrier at the system level includes cultural resistance toward substantial advancements in person-centered, value-based health care. System-level priorities that align with person-centered and value-based health care necessitate a focus on PROMs and PREMs because the principles underpinning these frameworks champion consumer outcomes, values, preferences, and perspectives.90,91 Thus, embedding these priorities in national health care policy, funding, safety and quality standards, accreditation expectations, and service agreements in meaningful ways will have flow-on effects for establishing system-level PROM and PREM implementation programs and benchmarking.
Commonly cited barriers for PROM and PREM implementation at the service level include a lack of leadership79,83,85 and information technology infrastructure to support PROM and PREM implementation.79,82,84 Additionally, implementation is costly,79-81,84,85 further compounding other recognized service-level barriers. Strong and visible support for PROM and PREM implementation at executive and senior levels, as well as the integration of implementation in service-level strategic priorities, is critical for fostering cultures engaged in PROM and PREM implementation.79,80,82,84,85 Integrating PROM and PREM data collection and feedback mechanisms within services’ existing information technology infrastructure will also support implementation, as well as reduce costs and the burden on health care professionals.82,84,86,87
Implementation can be hugely burdensome on health care professionals in terms of the volume of data they are asked to collect from patients,80,83 as well as the amount of time they are required to invest in the analysis and interpretation of PROM and PREM feedback.79,82-85,88 Skepticism arises when health care professionals question the reliability and validity of PROM and PREM data and the value and purpose of instrument implementation.79,82-85 There are also patient-related barriers, including burden,82,84 and the view that PROM and PREM data are only useful for health care professionals (not themselves).82,88 Thus, potential supports include outsourcing PROM and PREM implementation, data analysis, and interpretation to third parties to minimize the burden on health care professionals. This presently occurs in the US and England92,93 but needs to be considered relative to a services’ desire to promote local ownership of PROM and PREM implementation. Additionally, the interpretation and presentation of PROM and PREM feedback to health care professionals is viewed as more useful when provided alongside other important metrics (eg, service-related data such as wait times, patient importance scores, qualitative case examples) and adjusted for confounding patient variables.15,79,85,89
Two examples of successful system-level PROM and PREM programs include (1) the English National Health Service (NHS) PROMs program and (2) the US Consumer Assessment of Healthcare Providers and Services (CAHPS) program. The NHS PROMs program was established in 2009 and saw PROMs administered preoperatively and postoperatively to all patients undergoing hip and knee replacement across NHS-funded services.93 The purpose of the program was to gauge the extent to which patients recovered or deteriorated in the 6 months after surgery.93 The PROM data are summarized and publicly reported on NHS Digital annually,94 supporting service benchmarking and the National Tariff Payment System,95 a value-based performance scheme that offers bonus payments to health care professionals who provide quality care in line with risk-adjusted national average improvements in patient health status.96,97
The US CAHPS program is arguably the largest PREM program internationally. Since launching in 1995, close to 20 CAHPS measures have been implemented across Medicare and Medicaid services in the US.92 These PREMs have been increasingly used to support health plan accreditation, consumer choice of health care professionals (through public reporting websites such as Care Compare98 and Plan Finder99), services and health plans, and VBP programs. Arguably the best-known VBP program is the Hospital VBP (HVBP) Program that uses PREM data from the Hospital CAHPS. This program adjusts payments to hospitals under the Inpatient Prospective Payment System based on the quality of care delivered to patients.12 The HVBP Program is funded by a 2% reduction in participating hospitals’ Medicare severity diagnosis-related group payments for the fiscal year. This pool of funding is then redistributed to hospitals relative to their Total Performance Score, which is based on 4 equally weighted quality domains: (1) clinical outcomes, (2) person and community engagement (informed by Hospital CAHPS), (3) safety, and (4) efficiency and cost reduction.12 The HVBP Program made roughly $1.9 billion in incentive payments during the 2020 financial year.100
Thus, while the implementation of PROMs and PREMs at the system level can be challenging, it is also clear that these types of programs can be successful. Learnings from the English and US programs should be duly considered as we look toward the future of PROMs and PREMs in value-based health care.
Despite important progress in the development and psychometric evaluation of PROMs and PREMs over recent years, there remains challenges to their implementation. Moreover, learnings from countries that have successfully integrated PROMs and PREMs into health system performance measurement are yet to be effectively transferred to countries that are further behind. While it is clear that changes at the system, service, and individual levels need to occur to support PROM and PREM implementation, these efforts will be critical to driving person-centered, value-based health care globally.
Accepted for Publication: February 2, 2022.
Published: April 1, 2022. doi:10.1001/jamahealthforum.2022.0326
Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2022 Bull C et al. JAMA Health Forum.
Corresponding Author: Claudia Bull, BNutr (Hons), School of Public Health and Preventive Medicine, Monash University, 553 St Kilda Rd, Melbourne, VIC, 3141, Australia (firstname.lastname@example.org).
Author Contributions: Ms Bull had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Concept and design: Bull, Callander.
Acquisition, analysis, or interpretation of data: Bull, Teede, Watson.
Drafting of the manuscript: Bull, Callander.
Critical revision of the manuscript for important intellectual content: All authors.
Statistical analysis: Bull.
Administrative, technical, or material support: Bull, Callander.
Supervision: Watson, Callander.
Other—Expertise and guidance in the field: Teede.
Conflict of Interest Disclosures: Profs Teede and Callander reported grants from the Australian National Health and Medical Research Council during the conduct of the study. No other disclosures were reported.