Author Affiliations: Department of Pediatrics (Dr Profit) and Section of Health Services Research, Department of Medicine (Drs Profit, Pietz, Hysong, and Petersen and Mr Kowalkowski), Baylor College of Medicine, Section of Neonatology, Department of Pediatrics, Texas Children's Hospital (Dr Profit), and Houston Veterans Affairs Health Services Research and Development Center of Excellence, Health Policy and Quality Program, Michael E. DeBakey Veterans Affairs Medical Center (Drs Profit, Pietz, Hysong, and Petersen and Mr Kowalkowski), Houston; Department of Neonatology, Beth Israel Deaconess Medical Center, and Division of Newborn Medicine, Harvard Medical School, Boston, Massachusetts (Dr Zupancic); and Perinatal Epidemiology and Health Outcomes Research Unit, Division of Neonatology, Stanford University School of Medicine and Lucile Packard Children's Hospital, and California Perinatal Quality Care Collaborative, Palo Alto (Dr Gould), and Department of Applied Mathematics and Statistics, Baskin School of Engineering, University of California, Santa Cruz (Dr Draper).
Objectives To examine whether high performance on one measure of quality is associated with high performance on others and to develop a data-driven explanatory model of neonatal intensive care unit (NICU) performance.
Design We conducted a cross-sectional data analysis of a statewide perinatal care database. Risk-adjusted NICU ranks were computed for each of 8 measures of quality selected based on expert input. Correlations across measures were tested using the Pearson correlation coefficient. Exploratory factor analysis was used to determine whether underlying factors were driving the correlations.
Setting Twenty-two regional NICUs in California.
Patients In total, 5445 very low-birth-weight infants cared for between January 1, 2004, and December 31, 2007.
Main Outcomes Measures Pneumothorax, growth velocity, health care–associated infection, antenatal corticosteroid use, hypothermia during the first hour of life, chronic lung disease, mortality in the NICU, and discharge on any human breast milk.
Results The NICUs varied substantially in their clinical performance across measures of quality. Of 28 unit-level correlations, 6 were significant (ρ < .05). Correlations between pairs of measures of quality of care were strong (ρ ≥ .5) for 1 pair, moderate (range, ρ ≥ .3 to ρ < .5) for 8 pairs, weak (range, ρ ≥ .1 to ρ < .3) for 5 pairs, and negligible (ρ < .1) for 14 pairs. Exploratory factor analysis revealed 4 underlying factors of quality in this sample. Pneumothorax, mortality in the NICU, and antenatal corticosteroid use loaded on factor 1; growth velocity and health care–associated infection loaded on factor 2; chronic lung disease loaded on factor 3; and discharge on any human breast milk loaded on factor 4.
Conclusion In this sample, the ability of individual measures of quality to explain overall quality of neonatal intensive care was modest.
Quality of care provided by providers is increasingly scrutinized in an attempt to increase efficiency and improve thequality of patient care.1,2In other areas of medicine, performance measurement and financial incentives are common.3- 5 In the neonatal intensive care unit (NICU) setting, multistakeholder health care organizations (such as the National Quality Forum6) and payers of health care are promoting performance assessments of perinatal care providers.
Two facets of performance measurement have received little attention. First, is it fair to draw conclusions regarding institutional performance based on a single or limited set of measures of quality of care? Conclusions based on a small or limited set assume that measured aspects of quality reflect unmeasured aspects of care. However, a study7 of hospital quality assessments based on hospitalwide mortality rates alone found substantial discrepancies in performance based on the methods used to calculate mortality rates. This calls into question whether it is valid to draw conclusions about quality of care based on hospitalwide mortality rates. In the NICU setting, good performance on one measure of quality (eg, the proportion of infants with chronic lung disease) is assumed to indicate good performance on related measures of quality (eg, duration of mechanical ventilation) and on unrelated measures (eg, rates of health care–associated infection).
The use of a limited set of measures of quality of care for comparative performance measurement would be supported if NICU performance was strongly correlated across multiple measures of quality of care. However, in other areas of health care, studies8- 12 have found weak or no correlation across measures of quality of care. If intrainstitutional correlations among measures of quality of care are weak and performance is inconsistent, then inferences about quality from 1 or a few measures of quality are likely uninformative and potentially misleading.13 Instead, quality should be assessed by combining multiple measures of quality into 1 or more composite indicators of quality.14
Second, should quality improvement efforts be directed toward individual measures of quality or toward building more tightly connected systems of care so that performance can be based on several measures of quality simultaneously? Traditional approaches to quality improvement have typically addressed individual measures sequentially.15,16 In many instances, this has promoted better, safer care, but often gains have been temporary. A growing body of literature suggests that sustained and widespread improvements in quality require changes to the system in which care occurs. For example, improvements in unit safety culture, which varies widely across NICUs,17 have been linked to lasting improvements.18,19 The system supporting care delivery is interconnected with quality of care provided. Therefore, correlations between measures of quality might be interpreted to reflect the degree of care systems integration. Weak correlations might suggest a low degree of systems integration, in which care processes are largely functionally independent.9 Such a finding might signal the need for interventions, such as improvements in safety culture20 or composite measurement of quality,14,21 that could more broadly affect performance.
Neonatal intensive care presents a natural laboratory to test whether comparative performance measurement should be approached via limited or expanded sets of measures of quality of care. Specifically, high-quality clinical data are being collected by the California Perinatal Quality Care Collaborative (CPQCC) and other quality-of-care consortia. Our group has been working with CPQCC data to develop a composite indicator of neonatal intensive care quality provided to very low-birth-weight infants, the Baby-MONITOR.21,22 This study uses CPQCC data and 8 measures of quality that have been selected for inclusion in the Baby-MONITOR to examine the consistency of NICU performance rankings. We hypothesized that correlations of NICU rankings across measures of quality would be at least moderate. The specific objectives of this study were to examine whether high performance on one measure of quality is associated with high performance on others and to develop a data-driven explanatory model for overall NICU performance measurement.
The CPQCC15 is a multistakeholder group of public and private obstetric and neonatal providers, health care purchasers, public health professionals, and private sector health industry specialists, committed to improving care and outcomes for the state's pregnant mothers and newborns. The collaborative includes more than 130 member hospitals, of which 24 are designated as regional centers. This roster accounts for most of the preterm infants requiring critical care in California.
In total, 5445 very low-birth-weight infants cared for at 22 of 24 California level III regional centers between January 1, 2004, and December 31, 2007, met inclusion criteria for the study. Of these centers, 15 are designated as level IIID on the basis of open heart surgery performance, and the remainder are designated as level IIIC.23 We used multiyear analysis because of the few very low-birth-weight infants cared for in some institutions. Detailed descriptions of measure selection, definition, and exclusion criteria have been published elsewhere21 and are summarized in Table 1. Additional technical details are provided in the eAppendix.
We chose 8 quality-of-care measures that had been selected by an expert panel in a modified Delphi experiment for inclusion in the Baby-MONITOR and that have subsequently been confirmed by a sample of clinical neonatologists. Measure definitions were derived from standard CPQCC and Vermont Oxford Network algorithms.21,24 Measures included the following: (1) antenatal corticosteroid use, (2) hypothermia (<36°C) during the first hour of life, (3) nonsurgically induced pneumothorax, (4) health care–associated bacterial or fungal infection, (5) survival to discharge or to 36 weeks' gestational age with chronic lung disease (need for oxygen therapy or mechanical ventilation at 36 weeks' gestational age), (6) discharge on any human breast milk, (7) mortality in the NICU during the birth hospitalization, and (8) growth velocity. Growth velocity was determined according to a logarithmic function.25 We aligned all variables so that a higher value represents a better outcome. Statistical modeling (described herein) for this analysis required transformation of continuous variables into categorical ones. Therefore, we empirically dichotomized growth velocity into high- and low-growth groups based on the median velocity of 12.4 g/kg/d derived from the 95% central sample. The denominators for the variables differ slightly. For example, infants who died in the NICU or who survived but remained in the NICU for more than 6 months are not included in the denominator for the breast milk variable.
We applied CPQCC standard operational definitions for all independent variables. Patients were grouped into gestational age at birth strata of 250/7 to 276/7, 280/7 to 296/7, and 300/7 or more weeks based on similar patient numbers between groups. Apgar score was categorized as 3 or less, 4 to 6, or greater than 6.
Basic descriptive analyses examined the variation in unadjusted measures across sites. Hospital-level data included each level III NICU as the unit of analysis. To adjust for confounding due to differences in case mix, we developed risk adjustment models for each measure. For each one, we selected a set of candidate variables based on reported associations in the literature or clinical relevance, and we tested for associations with the outcome of interest in univariate analyses using the Fisher exact test for categorical variables and, based on the underlying variable distribution, the t test or the 2-sample Wilcoxon signed rank test. Variables associated at a significance level of P ≤ .25 were entered into a logistic regression model, and variables associated at a significance level of P > .05 were successively removed from the model after checking the log-likelihood ratio test for contribution to model fit.26
To rank NICU performance on each measure of quality of care, we used a method that was developed by Draper and Gittoes27 for use in the United Kingdom educational system and which is relevant and valid in any profiling setting with dichotomous outcomes. For each NICU and for each measure of quality of care, a z score was computed as the observed rate minus the expected rate, divided by its estimated standard error. The NICU's expected value was computed as a weighted mean of the rate (eg, the survival rate) in the overall database for all levels of the risk adjustment variables.
We used 2 approaches to examine the degree to which superior performance on one key measure of quality (survival) was associated with superior performance on the other measures.9 First, we ranked NICU performance on each measure according to its z score and calculated correlations of the z scores using the Pearson correlation coefficient. Correlations were rated as weak, moderate, or strong according to conventional thresholds.28 Second, we compared the distribution of being in the top 4 ranks across measures to a binomial distribution using a χ2 test. A test result that is statistically nonsignificant indicates that the hypothesis of independence cannot be rejected.
We performed an exploratory factor analysis to determine whether underlying factors were driving the correlations. Factor loadings in excess of 0.5 were used to classify variables into factors. For all analyses, P < .05 was considered statistically significant. Detailed information on model building, the method by Draper and Gittoes,27 and factor analysis is given in the eAppendix.
The CPQCC data are collected for quality improvement and meet the criteria for deidentified data. The data set is then further deidentified with respect to hospital for use as a research data set. The study was approved by the CPQCC and by the Baylor College of Medicine institutional review board.
Table 2 gives characteristics of the study sample. The means for the measures of quality of care are adjusted for illness severity at birth.
Table 3 lists z scores of performance on each variable (the standardized observed minus expected rate), with the NICUs labeled A through V in descending order of survival. A z score of zero indicates that the observed results on the measures of quality of care equal the expected (ie, risk-adjusted) results. A positive number indicates that performance is better than expected. We found substantial variation within measures of quality of care between NICUs, except for pneumothorax. A separate analysis using random-effects models showed significant NICU-level variation for all outcomes except pneumothorax (data are available from the author on request).
Table 4 gives the NICU-level correlation matrix among measures of quality of care. Of 28 unit-level correlations, 6 were significant (ρ < .05). Correlations between pairs of measures of quality of care were strong (ρ ≥ .5) for 1 pair, moderate (range, ρ ≥ .3 to ρ < .5) for 8 pairs, weak (range, ρ ≥ .1 to ρ < .3) for 5 pairs, and negligible (ρ < .1) for 14 pairs.
We found little consistency of high performance between NICUs. The number of times that NICUs were among the top 4 ranks (a high performer) for the 8 measures of quality of care ranged from 0 (never among the top 4 ranks) to 4 (being in the top 4 ranks for 4 of 8 measures). Figure 1 shows the observed and expected distribution under an assumption that high performance on different measures occurs at random (according to a binomial distribution in which the probability of success on each trial is 4/24 = 0.17 and the 8 trials are independent). The observed distribution from the random binomial distribution was not statistically different (P > .9). Nevertheless, the sum of ranks (Figure 2) across measures of quality of care suggests that hospitals performing well on survival tend to do well on other measures of quality.
Figure 1. Observed and expected probability of a neonatal intensive care unit (NICU) being in the top 4 ranks for each measure was not different from random variation (random binomial distribution). This indicates that NICUs are not consistent with regard to performance across 8 measures of quality. If NICUs that performed in the top 4 on one measure of quality also were more likely to perform well on other measures, we would see a U-shaped distribution.
Figure 2. Correlation between survival rank and the sum of neonatal intensive care unit (NICU) ranks across 7 measures of quality of care shows a trend indicating that NICUs performing well on survival tend to perform well on the other measures. This trend was not apparent in correlations between pairs of measures of quality of care and suggests that composite measurement may provide a better global picture of quality of care delivery.
Exploratory factor analysis revealed 4 underlying factors of quality in this sample (Table 5). Pneumothorax, mortality in the NICU, and antenatal corticosteroid use loaded on factor 1; growth velocity and health care–associated infection loaded on factor 2; chronic lung disease loaded on factor 3; and discharge on any human breast milk loaded on factor 4. Hypothermia during the first hour of life did not load on any factor. These factors might be clinically interpreted as follows: factor 1 may reflect the quality of perinatal care because the consequences of good perinatal care are low rates of pneumothorax and high survival; factor 2 may reflect the quality of supporting healthy development, which would be endangered by poor growth velocity and health care–associated infection; factor 3 may represent the quality of respiratory care because good care results in low rates of chronic lung disease; and factor 4 may reflect maternal involvement, which is key to achieving high rates of discharge on any human breast milk.
In this article, we examined NICU performance on 8 measures of quality of care. Except for the variable measuring pneumothorax, we found significant variation in clinical processes and outcomes between NICUs within and across each measure of quality. Correlations between most measures of quality were modest, and performance on one measure of quality had little predictive accuracy regarding performance on another. The only exception was high growth velocity and the absence of health care–associated infection, which were reasonably correlated. An exploratory factor analysis revealed 4 underlying factors of quality in this sample.
Our results have important implications for the comparative performance measurement endeavor. Given the modest correlations among measures of quality of care and the inconsistency among relative performances, one should not infer overall NICU quality based on a single measure or a few measures of quality. Our findings call into question the assumption that this measurement approach will lead to widespread improvements in quality, a method that underlies current benchmarking efforts in health care and is based on few measures of quality and a handful of diseases (eg, diabetes mellitus, heart disease, and hypertension).
Quality improvement efforts may need to focus on multidimensional improvement and build more tightly connected systems of care so that performance can be raised on several measures of quality simultaneously. We believe that exploratory factor analysis yielded results that have a meaningful clinical interpretation and may help inform a multidimensional conceptual model for measuring and understanding NICU quality. These findings could be the focus of future improvement efforts based on underlying aspects of quality that have a causal effect on the outcomes and might need to be considered together in improving overall performance. Our group is testing whether the Baby-MONITOR, if designed according to this model, better predicts other quality-related constructs, such as safety culture. If repeated elsewhere, this could lead to a more parsimonious set of measures of quality of care to assess overall NICU quality and offer a welcome relief to those who have to collect them. However, the limitation of such a data-driven approach may be that it could exclude the wisdom of the clinical community.
Our findings are open to different interpretations. Measures of quality may be functionally independent from each other. However, based on the clinical literature, we would a priori have expected stronger correlations. Many of the measures of quality of care (such as mortality in the NICU and antenatal corticosteroid use) have demonstrated strong causal links in randomized controlled trials.29 We speculate that providers who excelled in one area of quality would similarly excel in others; furthermore, NICUs that reliably followed processes to avoid health care–associated infections would achieve better growth velocity and lower mortality.
We interpret our results to spotlight a low degree of systems integration within the NICU setting. Neonatal intensive care may not exist as a tightly integrated and standardized care delivery system. The NICUs seem to have the ability to excel in some areas of care but not in others.
One way to promote systems-based care may be to meaningfully measure overall quality of care by combining individual measures of quality into a composite indicator of quality.21 One study30 showed that, while adherence to individual process measures of surgical infection prevention did not predict postoperative infection, a composite of prevention measures did; similarly, the study found a quality signal based on the sum of ranks across NICUs, which would have been difficult to detect based on individual measures of quality. Our group is working to develop a composite indicator of NICU quality, the Baby-MONITOR, based on an explicit and rigorous framework.14 Until such composite indicators have been developed and tested in a rigorous manner to ensure internal validity and external validity, it seems that conclusions about overall quality of care based on measurement of restricted measure sets should be viewed with skepticism.
This study must be evaluated within the context of its design. Our investigation relies on data submitted to the CPQCC by the NICUs and not by independent medical records abstractors. This may raise concern regarding the validity of the data. However, little incentive exists for NICUs to systematically submit inaccurate data because this would diminish the usefulness of data feedback from the CPQCC, a service that NICUs pay for. In addition, data validity is strengthened by the CPQCC's use of standardized data abstraction protocols and operation manuals, as well as by automated data quality management tools to identify potentially inaccurate data entries.
An alternative explanation for our findings of modest correlation of NICU performance across different measures of quality could be that quality of care among our small sample of California regional NICUs was similar. It may be that specific state-level policies foster care processes and cultures that are alike, making it harder to find diverging performance. On the other hand, investigations have found large differences in performance across other networks.31 Nevertheless, the specific attributes of the present study may hamper generalizability to other states and types of NICUs.
We developed individual risk adjustment models to control for confounding due to clinical risk at birth. These models have not been validated in other samples; therefore, it is possible that the models introduced bias into our results, although the direction of this bias is not easily ascertained. Similarly, residual confounding introduced by unobserved variables (such as academic affiliation or staffing ratios) may have influenced our results.
In conclusion, modest correlations of NICU performance on multiple measures of quality were observed. Benchmarking of NICU quality based on isolated indicators of quality may not reflect or improve overall quality of care. Multidimensional measurement of performance via composite indicators might promote multidimensional improvement using systems-based interventions.
Correspondence: Jochen Profit, MD, MPH, Houston Veterans Affairs Health Services Research and Development Center of Excellence, Health Policy and Quality Program, Michael E. DeBakey Veterans Affairs Medical Center (152), 2002 Holcombe Blvd, Houston, TX 77030 (firstname.lastname@example.org).
Accepted for Publication: May 31, 2012.
Published Online: November 12, 2012. doi:10.1001/jamapediatrics.2013.418
Author Contributions: Drs Profit and Pietz had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Profit, Zupancic, Gould, Pietz, Draper, and Petersen. Acquisition of data: Profit, Zupancic, Gould, Kowalkowski, Hysong, and Petersen. Analysis and interpretation of data: Profit, Pietz, Kowalkowski, and Draper. Drafting of the manuscript: Profit, Zupancic, Gould, and Petersen. Critical revision of the manuscript for important intellectual content: Profit, Zupancic, Gould, Pietz, Kowalkowski, Draper, Hysong, and Petersen. Obtained funding: Profit, Hysong, and Petersen. Administrative, technical, and material support: Profit, Zupancic, Gould, and Petersen.
Conflict of Interest Disclosures: None reported.
Funding/Support: This study is supported in part by grants 1 K23 HD056298 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (Dr Profit), CD2-07-0181 from the Department of Veterans Affairs Health Services Research and Development Program (Dr Hysong), Veterans Affairs Health Services Research and Development Center of Excellence grant HFP90-20 (Drs Hysong and Petersen), and American Heart Association Established Investigator Award 0540043N (Dr Petersen).
Additional Contributions: We thank Aloka L. Patel, MD, and Rush University Medical Center (Chicago, Illinois) for granting Dr Profit a nonexclusive license to use Rush University Medical Center's exponential infant growth model for noncommercial research purposes.
Profit J, Zupancic JAF, Gould JB, et al. Correlation of neonatal intensive care unit performance across multiple measures of quality of care. JAMA Pediatrics. Published online November 12, 2012. doi:10.1001/jamapediatrics.2013.418. JAMA. 2012;307(13):1394-1404.
Profit J, Zupancic JAF, Gould JB, Pietz K, Kowalkowski MA, Draper D, Hysong SJ, Petersen LA. Correlation of Neonatal Intensive Care Unit Performance Across Multiple Measures of Quality of Care. JAMA Pediatr. 2013;167(1):47-54. doi:10.1001/jamapediatrics.2013.418