Concordance of Hospital Ranks and Category Ratings Using the Current Technical Specification of US Hospital Star Ratings and Reasonable Alternative Specifications

This cross-sectional study aims to identify the changes in hospital ratings and rankings associated with alternative methodological choices in the calculation of the 2021 Centers for Medicare & Medicaid Services Hospital Compare star ratings.


eMethods 1. Brief summary of the baseline (April 2021) calculation of the CMS Hospital Compare Star Ratings, as implemented for comparisons
The technical specification of the CMS composite indicator is described in detail in their published technical methodology. This supplement summarises the baseline (April 2021) technical specification of the CMS Star Ratings.
CMS select 49 quality measures, which they group into five domains of quality (eMethods 1 Table 1). These five domains range in size from seven quality measures in the Mortality domain to 15 in the Timely and effective care domain. Each individual measure is mapped from its native scale (e.g. Percentages, rates or time-to-event) onto a common scale by Zscoring using a normal approximation. Standardised scores (i.e. z-score values) greater than 3 are rounded down to 3, and any less than -3 are rounded up to -3, a process known as Winsorization. Applying Winsorization limits the influence any extreme scores may have on the overall performance of a hospitals. A score of +3 is the best possible and -3 the worst possible on every performance measure.
To estimate scores for each of the five domains of quality CMS take the arithmetic average of the individual measures reported by the hospital. The score is only calculated where hospitals report three or more measures.
The overall summary score is calculated by taking a weighted average of the various domain scores. Outcome domains are given a higher weight than the process domain (eMethods 1 Table 1). Hospitals receive a summary score if they report at least three of the five domains of quality.
Finally, CMS assign the overall star rating by applying -means clustering to the overall summary scores, with = 5. This splits hospitals into five groups such that each hospital is closer to the mean of its group than to the mean of any other group. The group with the highest score is classed as the 'five star' group, the next highest score is the 'four star' group, and so on.
To account for possible associations where hospitals with more missing data tend to receive higher scores, CMS assign hospitals to peer groups defined by the number of domains of quality for which they have scores. As hospitals must report at least three of the five domains to receive an overall summary score, this gives three groups: those that report all five domains; those that report four domains; and those that report three domains. The -means clustering used to assign the star rating is carried out separately for each of these peer groups. Thus the score required to receive a five star rating, for example, may not be the same for all hospitals.
CMS prefer -means clustering to other approaches because it provides a naturally defensible categorisation. With other approaches, such as splitting hospitals into five equalsized groups, then there may be little distinction between performance categories. For example, if hospital performance followed a perfect normal distribution with mean 0 and standard deviation 1, then splitting into equal fifths leads to the middle performance groups covering only a narrow band of hospital scores (eMethods 1 Figure 1), while the top and bottom categories cover a very wide range of performances. Using -means mitigates this problem while still guaranteeing five distinct performance categories. In terms of the incentives provided to rated organisations, if performance is measured on a 1-100 scale it is desirable for a change in score from 2 to 1 to be of the same importance as a change in score from 62 to 61. Otherwise there is an incentive for organisations to focus improvement efforts in certain areas, and these areas may not be those that are most important. Z-scoring does not achieve this, and we defined alternative standardisation rules based on absolute performance on each measure that aimed to reflect the importance of particular levels of performance. For the 46 proportion and rate-based measures, one could deem differences between higher and lower values to be equally meaningful whether the lower value is near zero or approaching 100% (or, for rate-based measures,100 events per 100 units of person-time).
For example, a one-percentage-point reduction in post-operative mortality equates to one fewer death per 100 operations, whether it relates to a reduction from 2% to 1% or from 62% to 61%. But with Z-scoring a one percentage-point change will appear far more important for a performance measure with standard deviation of one percentage-point than for a performance measure with a standard deviation of three percentage-points. Hence, for these measures, Z-scoring distorts comparisons between different performance measures.
As a plausible alternative approach that avoids this distortion, we standardised all event (e.g. mortality) or rate (e.g. of healthcare acquired infection) measures according to the proportion of people having an event, so that differences between levels of do represented deaths avoided or safety events that did not happen. For example, a mortality rate of 0.18 would equate to a standardised score of 100*(1-0.18) = 82, while a mortality rate of 0.73 would equate to a score of 100*(1-0.73) = 27.
For the three of the 49 individual measures in the CMS Star Ratings representing time intervals, such as time in the emergency department from arrival to departure, a difference of one hour might be much more consequential if it is between, say, two and three hours than if it is between 15 and 16 hours. Z-scoring is not able to reflect the challenge that the impact of an additional hour of expected waiting time depends on how long the expected waiting time is. Hence, for these measures, Z-scoring risks distorting comparisons between different levels of performance on the same performance measure.
As a plausible alternative that reflected the variable importance of an additional period of waiting, we mapped time interval measures to the 0-100 scale using an appropriate logistic transformation: differences between middle-of-the-range performances were treated as more important than differences between excellent performances, or between poor performances.
The choice of transformation depended on the measure, but was motivated by the idea that differences between groups of hospitals with either excellent (or very poor) levels of performance were not important, so they should be small on the standardised scale.
For example, for the time-to-event measure 'ED -time arrival to departure', our alternative approach to standardisation was motivated by the idea that hospitals with median intervals of two hours or less had excellent performance (independently of whether one hospital has a mean of 60' and another of 90' minutes), and therefore all deserved a top score. Conversely, hospitals with median scores of ten hours or more had very poor performance deserving of a minimal score (again, independently of between-hospital differences within hospitals comprising the 10+ hours group).
With these considerations in mind, we used the following function to standardise the time-toevent measure 'ED -time arrival to departure'. Let be the standardised score for this measure and time from arrival in ED to departure in minutes. Then the standardised score is calculated as: This equation looks complex, but in practice it simply describes a smooth curve that falls slowly up to around four hours, then drops off rapidly until it levels off at around ten hours (eMethods 2 Figure 1A). There were 3208 hospitals with a known score for this performance measure, with mean performance 151 minutes (interquartile range 122 to 174 minutes). Mean standardised performance was 98.6 (interquartile range 98.6 to 99.97).
Standardised scores for the other four time-to-event measures were calculated as follows.
ED -time admit decision to departure (eMethods 2 Figure 1B). There were 3248 hospitals with a known score for this performance measure, with mean performance 110 minutes (interquartile range 64 to 135 minutes). Mean standardised performance was 76 (interquartile range 66 to 98). OP -time to specialist care (eMethods 2 Figure 1C). There were 415 hospitals with a known score for this performance measure, with mean performance 64 minutes (interquartile range 42 to 69 minutes). Mean standardised performance was 88 (interquartile range 92 to 99).

eMethods 3. Technical details of the alternative approach to grouping measures into domains
The baseline CMS approach assigns individual measures to five domains: four outcome domains (mortality, safety of care, readmission, and patient experience) and one process domains (timely and effective care) -see also eMethods 1 Table 1. These align with the domains used in the CMS Hospital Value-Based Purchasing program and other national quality initiatives.
As a plausible alternative approach, we used the hospital-level data on the individual measures to generate the quality domains, which were then combined into the overall composite score. We applied exploratory factor analysis to identify empirical latent factors that explained most of the variance among the individual measures.
Missing data presented a challenge to this alternative approach. Standard approaches to exploratory factor analysis require complete data on all measures for all hospitals, but all hospitals in the October 2020 dataset we used had some missing performance measure information (eMethods 1 Table 1). That is, for all hospitals there was at least one, and usually several, performance measures for which they had no performance information at all. The expectation-maximisation algorithm was used to estimate the covariance matrix between all measures, 1 giving correct results if measures were missing-at-random. 2 The number of factors to retain was decided after inspecting scree plots and considering eigenvalues. While no strict criteria were applied, enough factors were retained to reach the 'elbow' of the scree plot, with this being interpreted generously so that most factors with eigenvalues above 1 were retained.
The promax rotation was applied to estimate how strongly each measure was associated with the underlying factors. 3 This process generated empirically coherent domains into which individual measures could be assigned. On the occasions where an individual measure was found to have a similar degree of association with more than one empirically-derived domain I assigned the measure to the domain where it appeared more conceptually relevant.
This factor analytic approach produced six empirical domains (eMethods 3 Table 1), rather than the five domains used in the April 2021 CMS Star Ratings. Some of these domains closely matched the baseline domains, with the patient experience domain being unchanged. Measures in other baseline domains were spread across multiple empirical domains (eMethods Figure 1). Details of the individual measures and the domains they were assigned to under the current approach and the factor analytic approach are in eMethods 3 Table 1.  Figure 1. Drawing from these distributions meant that 95% of weights for current outcome domains (mortality, readmission, safety of care, and patient experience) were between 0.05 and 0.81, while 95% of weights for current process domains (timeliness of care, efficient use of medical imaging, and effectiveness of care) were between 0.01 and 0.61. Across many simulations the drawn weights on average matched those used in the existing Hospital Compare Star Ratings.
Weight choice was independent for each domain. While in the current CMS approach, all outcome domains receive a weight of 0.22, the simulation was not restricted to give all outcome (or all process) domains the same weight. In the current CMS approach, the total weight across all domains adds up to 1. In the simulation, weights were rescaled so that the total of the weights was 1. For example, if by some unlikely fluke the drawn weight for each of the seven domains was 0.1, so that the total weight across all domains was 0.7, the weights would be upscaled by dividing by 0.7 so that the weight given to each domain when combining them would be around 0.14. This kept the composite score on a consistent scale between different Monte Carlo simulations.