[Skip to Content]
Sign In
Individual Sign In
Create an Account
Institutional Sign In
OpenAthens Shibboleth
[Skip to Content Landing]
Research Letter
February 2015

Accuracy of Bayley Scores as Outcome Measures in Trials of Neonatal Therapies

Author Affiliations
  • 1Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
  • 2Department of Pediatrics and Child Health, University of Manitoba, Winnipeg, Manitoba, Canada
  • 3Division of Neonatology, Children’s Hospital of Philadelphia and University of Pennsylvania, Philadelphia
JAMA Pediatr. 2015;169(2):188-189. doi:10.1001/jamapediatrics.2014.2965

Long-term follow-up of extremely preterm infants is essential because the proportion of infants who survive is increasing while neurodevelopmental impairment rates remain high.1 Psychometric tests, such as the Bayley Scales of Infant and Toddler Development, Third Edition (Bayley-III),2 are commonly used to measure early childhood outcomes in obstetric and neonatal randomized trials.3,4 However, the composite scores on psychometric tests can be inaccurate because of administrative and technical errors. To determine the magnitude of this problem, we performed a post hoc review of all errors detected by central source document verification of Bayley-III assessments during the 18-month follow-up of the Canadian Oxygen Trial (COT).5


Extremely preterm COT participants in 25 centers in 6 countries were randomly assigned to 2 oxygen saturation target ranges. The primary outcome at a corrected age of 18 to 21 months was death or survival with neurodevelopmental disability. Cognitive or language composite scores of less than 85 on the Bayley-III scales were included in the primary outcome.5 The research ethics boards of all clinical centers approved the protocol, and written informed consent was obtained from a parent or guardian of every study infant.

Original Review Process During COT

Experienced examiners administered the Bayley-III test and submitted copies of the source documents to the coordinating center. Between October 15, 2008, and August 15, 2012, trained staff (J.D. and L.C.) reviewed the source documents for accuracy, completeness, and agreement with the electronic database entries. Discrepancies were identified and corrected through a formal data clarification process with input from Bayley-III examiners at the respective study sites. A small number of queries (n = 36) could not be resolved directly with the clinical centers and were referred to an adjudication committee consisting of a developmental pediatrician (D.M.) and a neurodevelopmental consultant (K.P.).

Post Hoc Analysis

In this study, 1 assessor (J.D.) reexamined all Bayley-III source documents and classified the errors that were identified during the original review process into 5 categories: calculation of the child’s corrected age, documenting or applying scoring rules, raw score addition, look-up of scaled scores and composite scores in normative tables, and electronic data entry. A 15% random sample was independently classified by a second assessor (L.C.) to ensure consistent classification of error types. Only 2 disagreements were found and resolved through consensus. The error categories are hierarchical in nature. An early mistake—for example, calculation of the corrected age—may affect all subsequent steps. We counted such errors only once, assigned them to the category in which they first occurred, and summarized the frequency of independent errors in each of the 5 categories.


During COT follow-up, the source documents for 936 of 954 (98.1%) Bayley-III assessments were submitted to the coordinating center. Eighteen children could not be evaluated because of severe developmental delay or autism. Of 936 source documents, 576 (61.5%) contained no errors. The remaining 360 (38.5%) contained at least 1 error, and the total number of independent errors was 387 (Table). None of the 25 clinical centers were completely error free. The best and worst center-specific error rates were 8 of 61 (13.1%) and 14 of 16 (87.5%), respectively. Had they not been detected during the original review process, 41 of 387 (10.6%) incorrectly reported composite scores would have changed the determination of the composite primary outcome in COT.

Type and Frequency of Errors in Bayley-III Assessments
Type and Frequency of Errors in Bayley-III Assessments

Experienced Bayley-III examiners made numerous administrative, scoring, and reporting errors in this international multicenter trial. Errors were detected and corrected in real time by comprehensive and rigorous central source document verification. The use of computer-assisted scoring software may reduce calculation and table look-up errors but cannot correct the effects of administrative or clerical errors that are made before data are entered into the program.


This study underscores the importance of diligent administration and recording of psychometric assessments. We recommend central source document verification of all psychometric tests that contribute to the primary outcome in large multicenter trials of perinatal and neonatal therapies.

Back to top
Article Information

Corresponding Author: Lorrie Costantini, BA, Department of Clinical Epidemiology and Biostatistics, McMaster University, Neonatal Trials Group, DBCVSRI Hamilton General Hospital Campus, 237 Barton St East, Hamilton, ON L8L 2X2, Canada (costan@mcmaster.ca).

Published Online: December 29, 2014. doi:10.1001/jamapediatrics.2014.2965.

Author Contributions: Mss Costantini and D’Ilario had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

Study concept and design: Costantini,Schmidt.

Acquisition, analysis, or interpretation of data: Costantini, D’Ilario, Moddemann, Penner.

Drafting of the manuscript: Costantini.

Critical revision of the manuscript for important intellectual content: All authors.

Statistical analysis: Costantini.

Administrative, technical, or material support: Costantini,D’Ilario, Moddemann, Penner.

Study supervision: Schmidt.

Conflict of Interest Disclosures: None reported.

Funding/Support: The Canadian Oxygen Trial was supported by grant MCT-79217 from the Canadian Institutes for Health Research (Drs Schmidt and Moddemann).

Role of the Funder/Sponsor: The funding source had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Vohr  BR.  Neurodevelopmental outcomes of extremely preterm infants.  Clin Perinatol. 2014;41(1):241-255.PubMedGoogle ScholarCrossref
Bayley  N.  Bayley Scales of Infant and Toddler Development.3rd ed. San Antonio, TX: Psychological Corp; 2006.
Benjamin  DK  Jr, Hudak  ML, Duara  S,  et al; Fluconazole Prophylaxis Study Team.  Effect of fluconazole prophylaxis on candidiasis and mortality in premature infants: a randomized clinical trial.  JAMA. 2014;311(17):1742-1749.PubMedGoogle ScholarCrossref
Wapner  RJ, Sorokin  Y, Mele  L,  et al; National Institute of Child Health and Human Development Maternal-Fetal Medicine Units Network.  Long-term outcomes after repeat doses of antenatal corticosteroids.  N Engl J Med. 2007;357(12):1190-1198. PubMedGoogle ScholarCrossref
Schmidt  B, Whyte  RK, Asztalos  EV,  et al; Canadian Oxygen Trial (COT) Group.  Effects of targeting higher vs lower arterial oxygen saturations on death or disability in extremely preterm infants: a randomized clinical trial.  JAMA. 2013;309(20):2111-2120. PubMedGoogle ScholarCrossref