Correlation of Neonatal Intensive Care Unit Performance Across Multiple Measures of Quality of Care

Results: The NICUs varied substantially in their clinical performance across measures of quality. Of 28 unitlevel correlations, 6 were significant ( .05). Correlations between pairs of measures of quality of care were strong ( .5) for 1 pair, moderate (range, .3 to .5) for 8 pairs, weak (range, .1 to .3) for 5 pairs, and negligible ( .1) for 14 pairs. Exploratory factor analysis revealed 4 underlying factors of quality in this sample. Pneumothorax, mortality in the NICU, and antenatal corticosteroid use loaded on factor 1; growth velocity and health care–associated infection loaded on factor 2; chronic lung disease loaded on factor 3; and discharge on any human breast milk loaded on factor 4.

Q UALITY OF CARE PROVIDED by providers is increasingly scrutinized in an attempt to increase efficiency and improve the quality of patient care. 1,2 Inotherareasofmedicine, performance measurement and financial incentives are common. [3][4][5] In the neonatal intensive care unit (NICU) setting, multistakeholder health care organizations (such as the National Quality Forum 6 ) and payers of health care are promoting performance assessments of perinatal care providers.
Two facets of performance measurement have received little attention. First, is it fair to draw conclusions regarding institutional performance based on a single or limited set of measures of quality of care? Conclusions based on a small or limited set assume that measured aspects of quality reflect un-measured aspects of care. However, a study 7 ofhospitalqualityassessmentsbasedonhospitalwide mortality rates alone found substantial discrepancies in performance based on the methods used to calculate mortality rates. This calls into question whether it is valid to draw conclusions about quality of care based on hospitalwide mortality rates. In the NICU setting, good performance on one measure of quality (eg, the proportion of infants with chronic lung disease) is assumed to indicate good performance on related measures of quality (eg, duration of mechanical ventilation) and on unrelated measures (eg, rates of health careassociated infection).

For editorial comment see page 89
The use of a limited set of measures of quality of care for comparative performance measurement would be supported if NICU performance was strongly correlated across multiple measures of quality of care. However, in other areas of health care, studies [8][9][10][11][12] have found weak or no correlation across measures of quality of care. If intrainstitutional correlations among measures of quality of care are weak and performance is inconsistent, then inferences about quality from 1 or a few measures of quality are likely uninformative and potentially misleading. 13 Instead, quality should be assessed by combining multiple measures of quality into 1 or more composite indicators of quality. 14 Second, should quality improvement efforts be directed toward individual measures of quality or toward building more tightly connected systems of care so that performance can be based on several measures of quality simultaneously? Traditional approaches to quality improvement have typically addressed individual measures sequentially. 15,16 In many instances, this has promoted better, safer care, but often gains have been temporary. A growing body of literature suggests that sustained and widespread improvements in quality require changes to the system in which care occurs. For example, improvements in unit safety culture, which varies widely across NICUs, 17 have been linked to lasting improvements. 18,19 The system supporting care delivery is interconnected with quality of care provided. Therefore, correlations between measures of quality might be interpreted to reflect the degree of care systems integration. Weak correlations might suggest a low degree of systems integration, in which care processes are largely functionally independent. 9 Such a finding might signal the need for interventions, such as improvements in safety culture 20 or composite measurement of quality, 14,21 that could more broadly affect performance.
Neonatal intensive care presents a natural laboratory to test whether comparative performance measurement should be approached via limited or expanded sets of measures of quality of care. Specifically, high-quality clinical data are being collected by the California Perinatal Quality Care Collaborative (CPQCC) and other qualityof-care consortia. Our group has been working with CPQCC data to develop a composite indicator of neonatal intensive care quality provided to very low-birthweight infants, the Baby-MONITOR. 21,22 This study uses CPQCC data and 8 measures of quality that have been selected for inclusion in the Baby-MONITOR to examine the consistency of NICU performance rankings. We hypothesized that correlations of NICU rankings across measures of quality would be at least moderate. The specific objectives of this study were to examine whether high performance on one measure of quality is associated with high performance on others and to develop a datadriven explanatory model for overall NICU performance measurement.

OVERVIEW
The CPQCC 15 is a multistakeholder group of public and private obstetric and neonatal providers, health care purchasers, public health professionals, and private sector health industry specialists, committed to improving care and outcomes for the state's pregnant mothers and newborns. The collaborative includes more than 130 member hospitals, of which 24 are designated as regional centers. This roster accounts for most of the preterm infants requiring critical care in California.

PATIENT SELECTION
In total, 5445 very low-birth-weight infants cared for at 22 of 24 California level III regional centers between January 1, 2004, and December 31, 2007, met inclusion criteria for the study. Of these centers, 15 are designated as level IIID on the basis of open heart surgery performance, and the remainder are designated as level IIIC. 23 We used multiyear analysis because of the few very low-birth-weight infants cared for in some institutions. Detailed descriptions of measure selection, definition, and exclusion criteria have been published elsewhere 21 and are summarized in Table 1. Additional technical details are provided in the eAppendix (http://www.jamapeds.com).

Dependent Variables
We chose 8 quality-of-care measures that had been selected by an expert panel in a modified Delphi experiment for inclusion in the Baby-MONITOR and that have subsequently been confirmed by a sample of clinical neonatologists. Measure definitions were derived from standard CPQCC and Vermont Oxford Network algorithms. 21,24 Measures included the following: (1) antenatal corticosteroid use, (2) hypothermia (Ͻ36ЊC) during the first hour of life, (3) nonsurgically induced pneumothorax, (4) health care-associated bacterial or fungal infection, (5) survival to discharge or to 36 weeks' gestational age with chronic lung disease (need for oxygen therapy or mechanical ventilation at 36 weeks' gestational age), (6) discharge on any human breast milk, (7) mortality in the NICU during the birth hospitalization, and (8) growth velocity. Growth velocity was determined according to a logarithmic function. 25 We aligned all variables so that a higher value represents a better outcome. Statistical modeling (described herein) for this analysis required transformation of continuous variables into categorical ones. Therefore, we empirically dichotomized growth velocity into high-and low-growth groups based on the median velocity of 12.4 g/kg/d derived from the 95% central sample. The denominators for the variables differ slightly. For example, infants who died in the NICU or who survived but remained in the NICU for more than 6 months are not included in the denominator for the breast milk variable.

Independent Variables
We applied CPQCC standard operational definitions for all independent variables. Patients were grouped into gestational age at birth strata of 25 0/7 to 27 6/7 , 28 0/7 to 29 6/7 , and 30 0/7 or more weeks based on similar patient numbers between groups. Apgar score was categorized as 3 or less, 4 to 6, or greater than 6.

STATISTICAL ANALYSIS
Basic descriptive analyses examined the variation in unadjusted measures across sites. Hospital-level data included each level III NICU as the unit of analysis. To adjust for confounding due to differences in case mix, we developed risk adjustment models for each measure. For each one, we selected a set of candidate variables based on reported associations in the literature or clinical relevance, and we tested for associations with the outcome of interest in univariate analyses using the Fisher exact test for categorical variables and, based on the underlying variable distribution, the t test or the 2-sample Wilcoxon signed rank test. Variables associated at a significance level of P Յ .25 were entered into a logistic regression model, and variables associated at a significance level of P Ͼ .05 were successively removed from the model after checking the loglikelihood ratio test for contribution to model fit. 26 To rank NICU performance on each measure of quality of care, we used a method that was developed by Draper and Gittoes 27 for use in the United Kingdom educational system and which is relevant and valid in any profiling setting with dichotomous outcomes. For each NICU and for each measure of quality of care, a z score was computed as the observed rate minus the expected rate, divided by its estimated standard error. The NICU's expected value was computed as a weighted mean of the rate (eg, the survival rate) in the overall database for all levels of the risk adjustment variables.

Objective 1: Consistency of High Performance
We used 2 approaches to examine the degree to which superior performance on one key measure of quality (survival) was associated with superior performance on the other measures. 9 First, we ranked NICU performance on each measure according to its z score and calculated correlations of the z scores using the Pearson correlation coefficient. Correlations were rated as weak, moderate, or strong according to conventional thresholds. 28 Second, we compared the distribution of being in the top 4 ranks across measures to a binomial distribution using a 2 test. A test result that is statistically nonsignificant indicates that the hypothesis of independence cannot be rejected.

Objective 2: Development of a Model of Overall NICU Performance
We performed an exploratory factor analysis to determine whether underlying factors were driving the correlations. Factor loadings in excess of 0.5 were used to classify variables into factors. For all analyses, P Ͻ .05 was considered statistically significant. Detailed information on model building, the method by Draper and Gittoes, 27 and factor analysis is given in the eAppendix.

HUMAN STUDY COMPLIANCE
The CPQCC data are collected for quality improvement and meet the criteria for deidentified data. The data set is then fur-ther deidentified with respect to hospital for use as a research data set. The study was approved by the CPQCC and by the Baylor College of Medicine institutional review board. Table 2 gives characteristics of the study sample. The means for the measures of quality of care are adjusted for illness severity at birth. Table 3 lists z scores of performance on each variable (the standardized observed minus expected rate), with the NICUs labeled A through V in descending order of survival. A z score of zero indicates that the observed results on the measures of quality of care equal the expected (ie, risk-adjusted) results. A positive number indicates that performance is better than expected. We found substantial variation within measures of quality of care between NICUs, except for pneumothorax. A separate analysis using random-effects models showed significant NICU-level variation for all outcomes except pneumothorax (data are available from the author on request).

Consistency of High Performance Across Measures of Quality
We found little consistency of high performance between NICUs. The number of times that NICUs were among the top 4 ranks (a high performer) for the 8 measures of quality of care ranged from 0 (never among the top 4 ranks) to 4 (being in the top 4 ranks for 4 of 8 measures). Figure 1 shows the observed and expected distribution under an assumption that high performance on different measures occurs at random (according to a binomial distribution in which the probability of success on each trial is 4/24 = 0.17 and the 8 trials are independent). The observed distribution from the random binomial distribution was not statistically different (P Ͼ .9). Nevertheless, the sum of ranks (Figure 2) across measures of quality of care suggests that hospitals performing well on survival tend to do well on other measures of quality.

OBJECTIVE 2: DEVELOPMENT OF A MODEL OF OVERALL NICU PERFORMANCE
Exploratory factor analysis revealed 4 underlying factors of quality in this sample ( Table 5). Pneumothorax, mortality in the NICU, and antenatal corticosteroid use loaded on factor 1; growth velocity and health care-associated infection loaded on factor 2; chronic lung disease loaded on factor 3; and discharge on any human breast milk loaded on factor 4. Hypothermia during the first hour of life did not load on any factor. These factors might be clinically interpreted as follows: factor 1 may reflect the quality of perinatal care because the consequences of good perinatal care are low rates of pneumothorax and high survival; factor 2 may reflect the quality of supporting healthy development, which would be endangered by poor growth velocity and health care-associated infection; factor 3 may represent the quality of respiratory care because good care results in low rates of chronic lung disease; and factor 4 may reflect maternal involvement, which is key to achieving high rates of discharge on any human breast milk.

COMMENT
In this article, we examined NICU performance on 8 measures of quality of care. Except for the variable measuring pneumothorax, we found significant variation in clinical processes and outcomes between NICUs within and across each measure of quality. Correlations between most measures of quality were modest, and performance on one measure of quality had little predictive accuracy regarding performance on another. The only exception was high growth velocity and the absence of health careassociated infection, which were reasonably correlated. An exploratory factor analysis revealed 4 underlying factors of quality in this sample.
Our results have important implications for the comparative performance measurement endeavor. Given the modest correlations among measures of quality of care and the inconsistency among relative performances, one should not infer overall NICU quality based on a single measure or a few measures of quality. Our findings call into question the assumption that this measurement approach will lead to widespread improvements in quality, a method that underlies current benchmarking ef- forts in health care and is based on few measures of quality and a handful of diseases (eg, diabetes mellitus, heart disease, and hypertension). Quality improvement efforts may need to focus on multidimensional improvement and build more tightly connected systems of care so that performance can be raised on several measures of quality simultaneously. We believe that exploratory factor analysis yielded results that have a meaningful clinical interpretation and may help inform a multidimensional conceptual model for measuring and understanding NICU quality. These findings could be the focus of future improvement efforts based on underlying aspects of quality that have a causal effect on the outcomes and might need to be considered together in improving overall performance. Our group is testing whether the Baby-MONITOR, if designed according to this model, better predicts other quality-related constructs, such as safety culture. If repeated elsewhere, this could lead to a more parsimonious set of measures of quality of care to assess overall Abbreviations: NA, not applicable; NICU, neonatal intensive care unit. a A z score of zero indicates that the observed results on the measures of quality of care equal the expected results. A positive number indicates that performance is better than expected. Substantial within-NICU and between-NICU variation exists. NICU quality and offer a welcome relief to those who have to collect them. However, the limitation of such a data-driven approach may be that it could exclude the wisdom of the clinical community.
Our findings are open to different interpretations. Measures of quality may be functionally independent from each other. However, based on the clinical literature, we would a priori have expected stronger correlations. Many of the measures of quality of care (such as mortality in the NICU and antenatal corticosteroid use) have demonstrated strong causal links in randomized controlled trials. 29 We speculate that providers who excelled in one area of quality would similarly excel in others; furthermore, NICUs that reliably followed processes to avoid health care-associated infections would achieve better growth velocity and lower mortality.
We interpret our results to spotlight a low degree of systems integration within the NICU setting. Neonatal intensive care may not exist as a tightly integrated and standardized care delivery system. The NICUs seem to have the ability to excel in some areas of care but not in others.
One way to promote systems-based care may be to meaningfully measure overall quality of care by combin-ing individual measures of quality into a composite indicator of quality. 21 One study 30 showed that, while adherence to individual process measures of surgical infection prevention did not predict postoperative infection, a composite of prevention measures did; similarly, the study found a quality signal based on the sum of ranks across NICUs, which would have been difficult to detect based on individual measures of quality. Our group is working to develop a composite indicator of NICU quality, the Baby-MONITOR, based on an explicit and rigorous framework. 14 Until such composite indicators have been developed and tested in a rigorous manner to ensure internal validity and external validity, it seems that conclusions about overall quality of care based on measurement of restricted measure sets should be viewed with skepticism.
This study must be evaluated within the context of its design. Our investigation relies on data submitted to the CPQCC by the NICUs and not by independent medical records abstractors. This may raise concern regarding the validity of the data. However, little incentive exists for NICUs to systematically submit inaccurate data because this would diminish the usefulness of  Abbreviation: NICU, neonatal intensive care unit. a Promax-rotated factor loadings for the 4-factor solution. Promax is a nonorthogonal rotation method used in factor analysis to enhance the interpretability of the factor-loading structure. Boldfaced terms uniquely loaded on a single factor. data feedback from the CPQCC, a service that NICUs pay for. In addition, data validity is strengthened by the CPQCC's use of standardized data abstraction protocols and operation manuals, as well as by automated data quality management tools to identify potentially inaccurate data entries.
An alternative explanation for our findings of modest correlation of NICU performance across different measures of quality could be that quality of care among our small sample of California regional NICUs was similar. It may be that specific state-level policies foster care processes and cultures that are alike, making it harder to find diverging performance. On the other hand, investigations have found large differences in performance across other networks. 31 Nevertheless, the specific attributes of the present study may hamper generalizability to other states and types of NICUs.
We developed individual risk adjustment models to control for confounding due to clinical risk at birth. These models have not been validated in other samples; therefore, it is possible that the models introduced bias into our results, although the direction of this bias is not easily ascertained. Similarly, residual confounding introduced by unobserved variables (such as academic affiliation or staffing ratios) may have influenced our results.
In conclusion, modest correlations of NICU performance on multiple measures of quality were observed.
Benchmarking of NICU quality based on isolated indicators of quality may not reflect or improve overall quality of care. Multidimensional measurement of performance via composite indicators might promote multidimensional improvement using systems-based interventions.