To assess the ability of the third edition of the Bayley Scales of Infant and Toddler Development (Bayley-III) to detect developmental delay in 2-year-old children who were extremely preterm and those carried to term.
Prospective cohort study.
The state of Victoria, Australia.
Subjects were consecutive surviving children who were born either at less than 28 weeks' gestational age (extremely preterm) or with less than 1000 g birth weight (extremely low-birth-weight; n = 221) in the state of Victoria, Australia, in 2005 and randomly selected controls who were both carried to term and of normal birth weight (n = 220).
Main Outcome Measure
Children were assessed by psychologists blinded to knowledge of group at 2 years of age, corrected for prematurity with the new Bayley-III scale.
Follow-up rates of both cohorts were high (>92%). Mean values for all composite and subtest scores for the extremely preterm/extremely low-birth-weight group were significantly below those of the control group (P < .001), with the magnitude of all group differences being in excess of two-thirds SD. Mean values for the extremely preterm/extremely low-birth-weight group approached the normative mean, but in contrast, the mean values for the control group were higher than expected, with composite scores being between 0.55 and 1.23 SD above the normative mean. Proportions of children with developmental delay were grossly underestimated using the reference values, but were within the expected range when computed relative to the mean (standard deviation) for the controls.
The Bayley-III scale seriously underestimates developmental delay in 2-year-old Australian children.
Standardized developmental assessments are important in the early detection of developmental delay in children, determining eligibility for early intervention programs and the evaluation of perinatal, neonatal, and infant treatments.1 For high-risk infants such as those born very (<32 weeks) or extremely (<28 weeks) preterm, close monitoring using developmental screeners or standardized developmental assessments should be standard practice.
While there is no criterion standard for determining developmental delay,1,2 the Bayley Scales of Infant Development (BSID)3 and its revisions4,5 are the most widely reported measures. The second edition of the Bayley scales (BSID-II), in particular, has been used in many studies to determine rates of developmental delay in very preterm children and6- 8 perinatal factors associated with poor outcome9- 15 and as an outcome measure in perinatal randomized controlled trials.16- 21 The BSID-II has also been applied in studies involving other high-risk conditions such as severe combined immunodeficiency,22 human immunodeficiency virus,23 prenatal cocaine exposure,24 cerebral palsy,25 neurotoxin exposure,26,27 gastroschisis,28 and Prader-Willi syndrome.29
The primary scales from the BSID and BSID-II are the Mental Developmental Index (MDI) and the Psychomotor Developmental Index (PDI). In brief, the MDI evaluates early cognitive and language development, while the PDI evaluates early fine and gross motor development. The broad natures of both the MDI and PDI are the main limitations of the BSID and BSID-II.1 For example, low MDI scores may reflect a specific delay in communication skills, cognitive abilities, or both. The third edition of the Bayley scales (Bayley-III) attempts to address this limitation by refining the measure to include separate composite scores for cognitive, language, and motor domains. In addition, scale scores can be calculated to assess receptive communication, expressive communication, and fine and gross motor development. Parent-report questionnaires are incorporated into the Bayley-III to assess social-emotional and adaptive behavior. Thus, the structure of the new Bayley-III has the potential to provide more clinically useful information relating to early development, improving our capacity to discriminate specific developmental problems and helping to target early intervention programs to more specific areas of weakness. From a research perspective, the Bayley-III may improve understanding of early development in high-risk populations and may be a more sensitive outcome measure for clinical trials.
To date, few published studies have used the Bayley-III, and the original enthusiasm for this measure may have waned, with many clinicians suggesting that it overestimates development and, as such, underestimates delay. This article will examine this issue by contrasting developmental scores and rates of delay in a large regional cohort of 2-year-olds who are extremely preterm/extremely low-birth-weight and those carried to term who were born in 2005.
The extremely preterm/extremely low-birth-weight (EP/ELBW) group comprised all children born at fewer than 28 completed weeks of gestation or with birth weights of less than 1000 g born in the state of Victoria in 2005 who survived to 2 years of age. Gestational age was determined by the best obstetric estimate, based on fetal ultrasound, before 20 weeks in most cases. The control group participants were born at 37 weeks' gestation or later and weighed more than 2499 g. They were randomly selected from each maternity unit associated with the 3 level-III perinatal centers in the state, stratified to balance with extremely preterm survivors for sex, mother's health insurance status, and the language spoken primarily in her country of birth (English or other).
Development was assessed in survivors at 2 years of age, corrected for prematurity, using the Bayley-III scale. Blinded psychologists administered the Cognitive, Language, and Motor scales but not the Social-Emotional or Adaptive Behavior scales. The Cognitive scale assesses abilities such as sensorimotor development, exploration and manipulation, object relatedness, concept formation, memory, and simple problem solving. The Language scale consists of Receptive Communication (verbal comprehension, vocabulary) and Expressive Communication (babbling, gesturing, and utterances) subtests, while the Motor scale consists of Fine Motor (grasping, perceptual-motor integration, motor planning, and speed) and Gross Motor (sitting, standing, locomotion, and balance) subtests.
The Composite scores for the Cognitive, Language, and Motor scales are age-standardized with a mean (SD) score of 100 (15). The Receptive Communication, Expressive Communication, Fine Motor, and Gross Motor subtest scores are age-standardized with a mean (SD) score of 10 (3). Percentile ranks, developmental age equivalents, and growth scores are not reported. The standardization sample for the Bayley-III comprised 1700 children divided across 17 age bands from 1 to 42 months, with 100 children in each age band. The sample was reported to be representative of the 2000 US Bureau of the Census population survey data in terms of parent education, ethnicity, and geographic region. The original standardization sample included only typically developing children carried to term but later, children with cognitive, physical, and behavioral issues were added to constitute approximately 10% of the total sample.
Children were also assessed by blinded pediatricians for neurosensory impairments including cerebral palsy (CP), blindness (visual acuity <20/200 in the better eye), and deafness (hearing loss requiring amplification, or worse). The criteria for the diagnosis of CP included abnormal tone and delays in motor control and function.
Developmental delay was calculated according to (1) the Bayley-III norms, and (2) the control group mean (standard deviation). Mild cognitive/language/motor delay comprised a score on the relevant composite scale from −2 SD to less than −1 SD; moderate delay, from −3 SD to less than −2 SD; and severe delay, a score of less than −3 SD. Children who were unable to complete psychological testing because of severe developmental delay were assigned a score of −4 SD.
Data were analyzed using SPSS for Windows version 17.0 (SPSS Inc, Chicago, Illinois). Means were contrasted by mean difference and 95% confidence intervals, and by linear regression analysis to adjust for confounding variables (family structure and maternal education). Analyses were also performed excluding children with neurological impairments (CP, blindness, or deafness). Rates of impairment between groups were compared by χ2 analysis or Fisher exact test with small cell sizes. P < .05 were statistically significant.
The Research and Ethics Committees at the Royal Women's Hospital, Mercy Hospital for Women, and Monash Medical Centre, Melbourne, Australia, approved this follow-up study. Written informed consent was obtained from parents of controls carried to term. Follow-up was considered routine clinical care for the very preterm infants.
The EP/ELBW group comprised 221 survivors at 2 years' corrected age, of whom 211 participated in the developmental assessment (95% retention rate). The control group comprised 220 survivors aged 2 years, of whom 202 participated in this study (92% retention rate). The perinatal and demographic characteristics of the 2 groups are displayed in Table 1. The EP/ELBW children were less likely to be from intact families at 2 years of age (χ2 = 12.2; P < .001), and their mothers were less likely to have completed secondary school (χ2 = 15.5; P < .001). The rate of CP was elevated in the EP/ELBW group (9.0% [19 of 211] vs 0% [0 of 202]; P < .001, Fisher exact test), but the rate of deafness (1.9% [4 of 211] vs 0.5% [1 of 202]; P = .37, Fisher exact test) was low and did not differ between groups. No children were blind in either group. The groups did not differ regarding the corrected age at assessment (EP/ELBW mean [SD], 24.4 [2.5] months; control mean [SD], 24.2 [1.7] months; mean difference, 0.1; 95% confidence interval, −0.3 to 0.6).
Table 2 lists the descriptive statistics and mean group differences for the Bayley-III composite and subtest scores for the EP/ELBW and control groups. The means for all composite and subtest scores for the EP/ELBW group were significantly lower than those of the control group (P < .001), with the magnitude of all group differences being in excess of two-thirds SD. However, it is important to note that the means for the EP/ELBW group approached the normative mean and were within the reference (“average”) range. In contrast, the means for the control group were higher than expected, with the composite scores being between 0.55 and 1.23 SD above the normative mean. Analyses were repeated, adjusting for family structure and maternal education. While maternal education was a significant predictor of these developmental outcomes, the mean group differences remained substantial and statistically significant (P < .001). The magnitude of the mean group differences declined marginally when children with CP and/or deafness were excluded, although no statistical conclusions were altered (P < .001), and all group differences remained in excess of 0.5 SD.
The main purpose of the Bayley-III is to detect developmental delay. The rates of mild, moderate, and severe delay determined according to reference value and compared with controls are presented in Table 3. Using normative criteria, the proportions of children in the EP/ELBW group with cognitive, language, and motor delay were only 13%, 21%, and 16%, respectively. The rates for the control group were well below those expected for normally distributed data: 13.6%, 2.0%, and 0.3%, for mild, moderate, and severe developmental delay, respectively. Furthermore, the rate of children in the total cohort with moderate to severe delay was minimal.
When delay was calculated on the basis of the control distribution, the rates rose considerably and were more in line with expectations (Table 3). Using this approach, one-third of the EP/ELBW group exhibited cognitive delay, and even higher proportions had language and motor delay. The proportion of children in the EP/ELBW group with moderate to severe delay was consistent with clinical impressions and previous studies. For the control group, the rate of delay varied from 12% for cognitive development to 17% for motor development.
The Bayley-III is currently the most commonly applied measurement tool for assessing early development both in clinical practice and research settings but, to date, limited evidence exists supporting its construct and predictive validity. Our study used the Bayley-III to assess the developmental profile of a geographic cohort of EP/ELBW 2-year-olds and a randomly selected control group. Our findings were contrary to expectations in that the rate of developmental delay for the EP/ELBW group was well below that reported previously6,8,9,30 and the rate of delay in the control group was negligible.
Possible explanations for these findings include (1) the Bayley-III's overestimation of developmental outcomes in 2-year-olds and, as such, the underestimation of developmental delay; (2) substantial improvement in developmental outcomes for EP/ELBW children and the recruitment of a high-achieving control group; and (3) systematic error in administration and/or scoring. We are confident that our findings are not owing to systematic error in administration/scoring, as our psychologists are experienced in conducting developmental assessments with the Bayley scales and all completed the accredited training program for the Bayley-III. The first 2 possible explanations, however, have important implications. We propose that the first explanation is the more likely and that standardized scores of the Bayley-III 2-year-old children underestimate developmental delay and need to be interpreted with great caution. This premise is supported by the finding that the means for the control group for all Bayley-III scales were substantially above the standardized mean, whereas a previous control sample recruited by our group 8 years earlier, assessed using the Bayley-II, had a mean (SD) MDI of 99 (15.4), indistinguishable from the expected mean value of 100. We used the same procedures to recruit the 2 control groups, and there have been no substantial demographic changes during such a short period to suggest that the control groups might be systematically different between eras. Thus, it is highly unlikely that these findings are owing to a high-achieving control group.
Furthermore, we doubt that the higher-than-expected standard scores of the EP/ELBW cohort reflect improved outcome, as the rate of delay judged according to the control group mirrors previous research. Most previous studies examining developmental outcomes in very preterm cohorts have used the BSID-II. In extremely preterm children, the rates of developmental delay determined using BSID-II reference values in cohorts born in the 1990s are high. For example, in cohorts of children born earlier than 25 weeks' gestation, Hintz et al8 reported rates of moderate to severe cognitive delay ranging from 40% to 47% at 18 to 22 months' corrected age, while moderate to severe motor delay ranged from 31% to 32%. The EPICure study assessed a geographic cohort of children with gestational ages of fewer than 26 weeks at 30 months, corrected, and based on MDI/PDI reference values, classified 64% of their cohort as delayed (34%, mild; 11%, moderate; 19%, severe).30 Hack et al9 have also reported high rates of mild to severe cognitive (68%) and motor (71%) delay in a hospital-based ELBW cohort born from 1992 to 1995. In cohorts from the state of Victoria with gestational ages of less than 28 weeks, the rates of mild, moderate, and severe developmental delay defined by the MDI relative to the mean (standard deviation) for randomly selected controls were 23%, 11%, and 7%, respectively, for those born in 1991 to 1992, and 22%, 9%, and 15%, respectively, for those born in 1997.31 As expected, lower rates of delay are reported in cohorts that include more mature infants. In a recently described New Zealand cohort of children with gestational ages of fewer than 33 weeks born from 2001 to 2002, one-third exhibited cognitive delay and 30% exhibited motor delay.6 Given previous developmental studies of EP/ELBW children, we had expected rates of overall delay in the 40% to 45% range, consistent with what we observed when delay was based on our control group, but much lower than when delay was based on Bayley-III reference values.
The structural differences between the Bayley-III and BSID-II mean that the scale scores from the 2 tests are not comparable, and direct comparisons with earlier studies that have used the BSID-II are problematic. Theoretically, rates of delay or impairment should increase rather than decrease with the introduction of new standardized measures such as the Bayley-III owing to the creeping phenomena of developmental/intelligence quotient scores over time, often referred to as the Flynn effect.2,32 For example, we observed an increased sensitivity in detecting developmental delay when the BSID-II replaced the original BSID.7
One limitation of the current study is that our observations are restricted to reference values for 2-year-old children. Further research is needed to assess the appropriateness of reference values in other age bands; we therefore stress that our results should not be extrapolated to other ages prior to receiving the results of such studies. We recognize that there are cultural and other differences between Australia and the United States, where the Bayley-III was standardized; however, this was not an issue for previous Australian cohorts using the BSID-II, which was also standardized in the United States.
In conclusion, the Bayley-III seriously overestimated the developmental progress of 2-year-old Australian children. Given the extent of the overestimation that we observed, we have similar reservations regarding the Bayley-III's sensitivity to detect developmentally delayed children in other countries including the United States, Canada, and England but clearly, further research is needed to confirm our suspicions. Also, the appropriateness of the measure and its reference values for children in other age bands needs to be studied. Our findings have important implications for clinical services, follow-up programs, and clinical trials that rely on the Bayley-III for the assessment of developmental delay, and we recommend caution in the interpretation of Bayley-III scores for high-risk children in the absence of appropriate control groups.
Corresponding Author: Peter Anderson, PhD, Victorian Infant Brain Studies, Royal Children's Hospital, Flemington Rd, Parkville, Victoria, Australia 3052 (firstname.lastname@example.org).
Accepted for Publication: September 23, 2009.
Author Contributions: All authors had access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Anderson, De Luca, Hutchinson, Roberts, and Doyle. Acquisition of data: Anderson, Hutchinson, and Roberts. Analysis and interpretation of data: Anderson, De Luca, Hutchinson, and Doyle. Drafting of the manuscript: Anderson and Hutchinson. Critical revision of the manuscript for important intellectual content: Anderson, De Luca, Hutchinson, Roberts, and Doyle. Statistical analysis: Anderson and Doyle. Obtained funding: Anderson. Administrative, technical, and material support: Anderson, De Luca, Hutchinson, and Roberts. Study supervision: Anderson.
Victorian Infant Collaborative Group: Catherine Callanan, RN, Noni Davis, FRACP, Julieanne Duff, FRACP, Elaine Kelly, MA, Marion McDonald, RN, Michael Stewart, FRACP, Linh Ung, BSc, Royal Women's Hospital; Elaine Kelly, MA, Gillian Opie, FRACP, Andrew Watkins, FRACP, Amanda Williamson, MA, Heather Woods, RN, Mercy Hospital for Women; Elizabeth Carse, FRACP, Margaret P. Charlton, MEd(Psych), PhD, Marie Hayes, RN, Monash Medical Center; Rod Hunt, PhD, FRACP, Michael Stewart, FRACP, Royal Children's Hospital, Melbourne, Australia.
Financial Disclosure: None reported.
Funding/Support: This study was supported in part by a project grant 454413 from the National Health and Medical Research Council, Australia.
Anderson PJ, De Luca CR, Hutchinson E, Roberts G, Doyle LW, the Victorian Infant Collaborative Group. Underestimation of Developmental Delay by the New Bayley-III Scale. Arch Pediatr Adolesc Med. 2010;164(4):352-356. doi:10.1001/archpediatrics.2010.20