ADSV indicates Analysis of Dysphonia in Speech and Voice (KayPENTAX); and CPP F0, cepstral peak fundamental frequency.
eTable 1. Averaged ADSV Data for Male Participants.
eTable 2. Averaged ADSV Data for Female Participants.
Infusino SA, Diercks GR, Rogers DJ, Garcia J, Ojha S, Maurer R, Bunting G, Hartnick CJ. Establishment of a Normative Cepstral Pediatric Acoustic Database. JAMA Otolaryngol Head Neck Surg. 2015;141(4):358-363. doi:10.1001/jamaoto.2014.3545
Copyright 2015 American Medical Association. All Rights Reserved. Applicable FARS/DFARS Restrictions Apply to Government Use.
Few studies have used objective measures to evaluate the development of the normal pediatric voice. Cepstral analysis of continuous speech samples is a reliable method for gathering acoustic data; however, it has not been used to examine the changes that occur with voice development.
To establish and characterize acoustic patterns of the normal pediatric voice using cepstral analysis of voice samples from a normal pediatric voice database.
Design, Setting, and Participants
Cross-sectional study of 218 children aged 4 to 17 years, for whom English was the primary language spoken at home, conducted at a pediatric otolaryngology practice and pediatric practice in a tertiary hospital (April 2012–May 2014).
Interventions and Exposures
Sustained vowel utterances and continuous speech samples (4 Consensus Auditory-Perceptual Evaluation of Voice [CAPE-V] and 2 sentences from the rainbow passage) were recorded and analyzed from children with normal voices.
Main Outcomes and Measures
Normal values were collected for the acoustic measures studied (ie, fundamental frequency, cepstral peak fundamental frequency, cepstral peak prominence [CPP], low-to-high spectral ratio [L/H ratio], and cepstral-spectral index of dysphonia in recorded phrases) and compiled into a normative acoustic database.
Significant changes in fundamental frequency were observed with a distinct shift in slope at ages 11 and 14 years in boys for sustained vowel (ages 4-11 years, −6.83 Hz/y [P < .001]; 11-14 years, −27.62 Hz/y [P < .001]; and 14-17 years, −5.68 Hz/y [P = .001]), all voiced (ages 4-11 years, −4.19 Hz/y [P = .002]; 11-14 years, −29.42 Hz/y [P < .001]; and 14-17 years, −4.63 Hz/y [P < .001]), glottal attack (ages 4-11 years, −4.51 Hz/y; 11-14 years, −27.23 Hz/y; and 14-17 years, −1.70 Hz/y [P < .001 for all]), and rainbow (ages <14 years, −20.68 Hz/y [P < .001]; and 14-17 years, −4.50 Hz/y [P = .001]) recordings. A decreasing linear trend in fundamental frequency among all recordings (vowel, all voiced, easy onset, glottal attack, plosives, and rainbow) was found in girls (−2.56 Hz/y [P < .001], −3.48 Hz/y [P < .001], −2.82 Hz/y [P < .001], −3.49 Hz/y [P < .001], −2.30 Hz/y [P < .001], and −2.98 Hz/y [P = .01], respectively). A linear increase in CPP was seen with age in boys, with significant changes seen in recordings for vowel (0.10 dB/y [P = .05]), all voiced (0.2 dB/y [P < .001]), easy onset (0.13 dB/y [P < .001]), glottal attack (0.12 dB/y [P < .001]), plosives (0.15 dB/y [P < .001]), and rainbow (0.17 dB/y [P = .006]). A significant linear increase in CPP for girls was only seen in all voiced (0.13 dB/y [P < .001]). L/H ratio showed a linear increase with age among all speech samples (vowel, all voiced, easy onset, glottal attack, plosives, and rainbow) in boys (1.14 dB/y [P < .001], 0.92 dB/y [P < .001], 1.19 dB/y [P < .001], 0.79 dB/y [P < .001], 0.69 dB/y [P < .001], and 0.54 dB/y [P = .002], respectively) and girls (0.96 dB/y, 0.60 dB/y, 0.75 dB/y, 0.37 dB/y, 0.44 dB/y, and 0.58 dB/y, respectively [P ≤ .001 for all]).
Conclusions and Relevance
This represents the first pediatric voice database using frequency-based acoustic measures. Our goal was to characterize the changes that occur in both male and female voices as children age. These findings help illustrate how acoustic measurements change with development and may aid in our understanding of the developing voice, pathologic changes, and response to treatment.
Spoken language is our primary form of communication that facilitates our social development, education, and interpersonal and professional interactions. Childhood and adolescence represent essential periods of development for both vocal quality and speech. Pediatric voice disorders can have a profound impact on a child’s social and intellectual development, affecting up to 10% of the population,1,2 thus highlighting the importance of better understanding the normal process of vocal development. However, objective data characterizing this progression are lacking. Without an understanding of how acoustic measurements of the normal pediatric voice develop and change over time, our ability to use these objective measurements in evaluating children with voice disorders and their response to treatments is limited.
Perceptual analysis performed by trained speech pathologists and clinicians has traditionally been used to study characteristics of the normal voice and to identify voice disorders. A relatively modern assessment created to judge voice quality and the severity of disorders is the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V).3 To perform a CAPE-V analysis, the patient reads the following 4 sentences:
• Easy onset, “How hard did he hit him?”
• All voiced, “We were away a year ago.”
• Glottal attack, “We eat eggs every Easter.”
• Plosives, “Peter will keep at the peak.”
Unlike analyses performed by a trained observer, acoustic measurements calculate vocal properties from algorithms that use period and frequency and can provide objective information about the qualities of the normal and dysphonic voice. Time-based algorithms performed on sustained vowel utterances provide information about cycle-to-cycle variations in frequency and amplitude, which help objectively characterize subjective perceptions of breathiness, roughness, and strain.4
In 2012, Maturo et al7 published the first large series of time-based acoustic measurements from normative voice recordings in children. Three hundred thirty-five boys and girls ranging in age from 4 to 18 years were recorded speaking sustained vowel utterances, and then fundamental frequency (F0), jitter, shimmer percentage, and noise-to-harmonic ratio were calculated. Their findings did not reveal age-based differences in acoustic properties such as jitter and shimmer but did show differences across age for noise-to-harmonic ratio. They also found certain transitional periods of development, marked by critical age points, where significant changes in F0 began.7 This finding gives credibility to the significance of a vocal fold with a developing layered structure rather than a simple cord where function changes in linear relation to cord length.8
One potential limitation of the work by Maturo et al7 is that acoustic measurements were calculated using 1 voice sample and that sustained vowel utterances obtained from children at 1 point in time might lack reliability. To address this issue of consistency, Hill et al9 obtained sustained vowel utterances from 50 boys and girls aged 4 to 17 years at 2 narrowly spaced points in time. Time-based acoustic measurements between sustained vowel utterances were found to show consistency only in F0. Poor reliability was seen with jitter, shimmer, and noise-to-harmonic ratio.9 This finding suggests that sustained vowel utterances should not be the only voice sample used to obtain acoustic measurements in children.
Cepstral analysis uses frequency-based algorithms to examine continuous speech samples. These algorithms are used to create a cepstrum, the broken-down components of a speech sample represented as a distribution of pitches and their corresponding intensities (measured in decibels). A great deal of information can be uncovered from a cepstrum, such as frequency and variability, much like time-based acoustic analyses. Unlike sustained vowel utterances, however, cepstral analyses of continuous speech samples may provide a more representative assessment of a patient’s voice.10- 15 In 2013, Diercks et al16 demonstrated that standardized continuous voice recordings obtained from normal children at 2 points in time were consistent for measures of fundamental frequency and variability.
Given the advantages of frequency-based acoustic measurements on standardized continuous speech samples, the goal of this research was to establish the first cepstral analysis database in children aged 4 to 17 years. With this database, we hope to provide a more accurate representation of the developing pediatric voice compared with prior studies. Without first characterizing the normal pediatric voice, acoustic measurements cannot be used to quantify degrees of dysphonia and treatment response in children with vocal dysfunction.
This study was carried out between April 2012 and May 2014 with the approval of the institutional review board of the Massachusetts Eye and Ear infirmary. Patient participation in this study was dependent on the agreement of both the parent and child, with written informed consent obtained. Patients between the ages of 4 to 17 years were recruited in an outpatient pediatric otolaryngology clinic as well as a general pediatric clinic. Patients with a hearing loss greater than 20 dB, history of airway procedures or voice disorders, developmental delay, or cognitive delay were excluded from the study. All patients used English as their primary language spoken at home.
Children were seated in a quiet room with an adjustable microphone (Shure Beta 53; Shure Inc) placed to the right oral commissure. Recordings were collected with a Dell Optiplex 960 personal computer (Dell Inc) with an Intel Core Duo 2 CPU (3.1 GHz, 1.94 GB of RAM) (Microsoft Windows XP Professional Version 2002; Microsoft Corp). Multidimensional Voice Program (model 5101) and Analysis of Dysphonia in Speech and Voice (model 5109) for the Computerized Speech Laboratory (model 4500) (KayPENTAX) were used to analyze voice recordings. The MDVP was used to evaluate F0 for sustained vowel utterances. The ADSV was used to evaluate cepstral peak fundamental frequency (CPP F0), cepstral peak prominence (CPP), CPP standard deviation (SD), low-to-high spectral ratio (L/H ratio), L/H ratio SD, and cepstral-spectral index of dysphonia of sustained vowel utterances (CSID) in CAPE-V sentences and the second and third sentences of the rainbow passage (“The rainbow is the division of white light into many beautiful colors. These take the shape of a long round arch with its path high above and its 2 ends apparently beyond the horizon”). The definitions of these values are given in Table 1.
When recording sustained vowel utterances, children were asked to maintain the vowel /a/ at a comfortable pitch and loudness level for 4 seconds. For CAPE-V and rainbow sentence recordings, children were asked to use their normal speaking voice. The rainbow sentences could not be reliably spoken by younger participants and only were included for children aged 10 to 17 years. Samples from 220 children (113 male and 107 female) were recorded.
We used descriptive statistics, obtaining mean (SD), median, and range, for both MDVP and ADSV data across age. Data were summarized as both sexes combined and stratified by sex. For each set of data, means with error bars (SDs) were initially plotted for visual inspections of slope for both sexes combined and stratified by sex. Locally weighted scatterplot smoothing (LOESS) was then applied to sex-stratified data to determine if the values changed linearly over time or if it showed break points of change in slope. If the values changed constantly over time, a simple linear regression was fitted. If LOESS fits showed any break points of change in slope, then piecewise regression was performed.
A total of 220 children (113 male and 107 female) participated in this study (Table 2). All participants had normal voices that were either self-reported or confirmed by their parents and were screened for any disqualifying factors prior to recording.
Piecewise regressions showed distinct periods of change, marked by critical points, in male frequency measures across age, with both the MDVP and ADSV data (Figure). The rate of change of the MDVP and ADSV data for each distinct period, marked by critical points, is given in Table 3. In the male population, MDVP F0 and ADSV CPP F0 from sustained vowel, all voiced, and glottal attack all showed statistically significant critical points at ages 11 and 14 years. These transitional periods were not observed in frequency data from our female participants. Instead, a more linear relationship was seen with the mean frequency significantly decreasing across age (Table 4 and Figure).
For CPP in the male population, linear regression showed a significant positive slope for all voiced (approximate change per increasing year, 0.2 dB; P < .001), easy onset (0.13 dB; P < .001), glottal attack (0.12 dB; P < .001), plosives (0.15 dB; P < .001), and the rainbow sentence (0.17 dB; P < .01). Only all voiced had a linear regression that showed significant positive slope in CPP for girls (approximate change per increasing year, 0.13 dB; P < .001). Low-to-high spectral ratio showed a small, yet significant linear increase in all ADSV recordings against age for both sexes (Table 4).
The measured ADSV variables for each sex and age group are given in eTables 1 and 2 in the Supplement.
To our knowledge, this is the largest study evaluating healthy voice development in children using cepstral analysis. Prior databases have been collected using time-based acoustic measures; however, sustained vowel utterances do not provide reliable data regarding variability in acoustic signal. Continuous speech samples can be evaluated objectively through cepstral analysis and provide a more representative speech sample for analysis.
Fundamental frequency—the rate of vocal fold vibration—is a common measurement in voice analyses. In 2012, Maturo et al7 published the largest database examining normal pediatric voice development using time-based algorithms. This work demonstrated specific periods of transition in F0 during development in both male and female participants. In male participants, F0 began to shift at age 12 years and reached maturation at age 16 years. In female participants, F0 began to transition at age 11 years and reached maturation at age 14 years.7
While our results from both time-based and cepstral-based algorithms support the findings of Maturo et al7 of 2 transition periods for F0 in male participants, our transitional periods were marked by critical points primarily at 11 and 14 years. In addition, our fundamental frequency data for female participants showed a decreasing linear correlation with age with no critical points. This discrepancy is surprising because both studies were performed primarily in the same clinic at the Massachusetts Eye and Ear Infirmary with very similar sample populations. A possible explanation for these disparities could stem from the different researchers collecting the voice data or the larger sample size used in the study by Maturo et al.7 While further work should be done to solve this inconsistency, it is important to note that our results also include additional ADSV data. The transitional periods in F0 we found in male participants aged 11 to 14 years were seen with MDVP data as well as 4 of 6 ADSV CPP F0 measures (vowel, all voiced, glottal attack, and rainbow). In female participants, no transition periods in F0 were seen with the MDVP data or any of the 6 additional ADSV frequency measures we collected. It is possible that with our female population, critical points in F0 and CPP F0 were too subtle to detect and a larger sample size is needed.
In 2013, Diercks et al16 demonstrated that cepstral analysis of continuous speech recordings offers a more reliable method for measuring voice perturbation and variability. Frequency-based acoustic measurements of signal variability, including CPP, L/H ratio, L/H ratio SD, and CSID score (multivariate index measure) are all reliable calculations for measuring the pediatric voice.16 Cepstral peak prominence is the difference between F0 intensity and the average intensity of the voice sample.9 Previous literature has reported that a low CPP correlates strongly with perceived ratings of dysphonia and breathiness.12,14,18,19 Our data showed a small but significant increasing trend in CPP in male participants (Table 4). One possible explanation is that as age increases, phonation is better controlled during speech, which leads to a higher CPP. In female participants, a significant increasing trend was only seen with all voiced (Table 4). Larger sample sizes in future studies may better resolve potential trends in vowel, easy onset, glottal, plosives, and rainbow with age. L/H ratio is a measure of the distribution of low to high frequency energy in a voice sample, and studies have associated low values of this statistic with dysphonia.3,16 We found a significant linear increase in L/H ratio in both sexes for all phrases. As F0 decreases, the addition of lower frequencies acquired by an individual could potentially raise L/H values.
We initially believed that we could compare existing normative CSID data to children’s values in our study to exclude any candidates with voice abnormalities. Diercks et al16 found that the CSID score generated by the ADSV software was a reliable measure when examined across 2 points in time. Publications examining CSID have shown the ability to distinguish between normal and dysphonic voices and that the score is consistent with perceptual ratings from speech pathologists.20,21 However, the CSID scoring system for the ADSV software was developed based on adult data, and normative values have only been researched and published in adult populations. Our work suggests that the values calculated from mature voices are not representative of the pediatric population. While all of the children included in the study had normal voices, younger children tended to display higher CSID values that would otherwise be labeled as dysphonic in adults. Our pediatric CSID values for easy onset, all voiced, glottal attack, and plosives all decreased with age, eventually falling within the normative range seen in adult populations. For both sexes, CSID values for sustained vowel were still high by age 17 years. However, this is expected because sustained vowels have a higher variability than continuous speech samples.16 Recordings from the rainbow passage remained relatively constant, likely since they were used for ages 10 years and older. To define expected CSID scores for children across age and sex, further work needs to be done to characterize the normal ranges and values for each age group.
One potential limitation of this study was that patients were recruited in a tertiary care pediatric otolaryngology practice and general pediatric practice and may not accurately represent the general pediatric population. We did, however, attempt to include only children with normal voices through appropriate participant screening. In addition, we did not screen for external factors, such as sleep deprivation, caffeine intake, and hydration, which can also affect accuracy of acoustic measurements. Some age groups in this study are not represented equally, and because the patient population is stratified by sex and age, each group has a relatively small sample size. Therefore, some trends in the data, such as clear transitional periods in female F0 development, may emerge given a larger sample size. While Diercks et al16 were able to show that continuous recordings of children aged 4 to 18 years were reliable, they did not look at each individual age group. It is possible that some of the age groups used in this study cannot reliably produce continuous speech.
A great deal of work has been done examining the acoustic properties of the human voice in both normal and dysphonic populations. While there is good literature on adult normative acoustic values, it is not applicable to pediatric populations.4 Several studies have used acoustic measures to examine dysphonia in children,11,22- 26 yet there is currently no database that looks at these measures in normal developing voices. To better understand dysphonic voices in children and the developing voice in general, it is important to properly characterize the normal pediatric voice. Future work with cepstral analysis needs to be performed using larger sample sizes to more accurately define the acoustic properties of the pediatric voice.
Although we have shown both distinct and subtle changes of the developing voice using cepstral analysis, the underlying causes of these changes have yet to be identified. Histological work has shown that there is a clear change in vocal fold architecture from birth to adulthood.8 Other factors such as vocal fold length and hormonal influences are also linked to acoustic development.7,27 It is likely a combination of several of these characteristics that lead to the mature, adult voice.
To our knowledge, this is the first study to look at normative pediatric voice development using cepstral-based algorithms to analyze continuous speech. With our voice recordings, we have been able to map the development of the pediatric voice for both male and female populations. Our data corroborates previous findings that male voice development has a transitional period marked by 2 critical points in frequency change. While many studies have examined acoustic measurements of the abnormal voice based on time- and frequency-based algorithms, more work is needed to define the acoustic profiles of the normal pediatric voice. Voice maturation is a complex process, which involves changes in vocal fold length and tissue architecture. Correlating data about the acoustic profile of normal children with changes in vocal fold structure during development may help us to understand more about how the voice matures. Moving forward, we hope these data can facilitate additional acoustic voice investigations and eventually aid in diagnosing voice disorders and measuring response to treatment.
Submitted for Publication: August 5, 2014; final revision received October 15, 2014; accepted December 2, 2014.
Corresponding Author: Scott A. Infusino, BA, Department of Otolaryngology, Massachusetts Eye and Ear Infirmary, 243 Charles St, Boston, MA 02114 (Sainfusino@gmail.com).
Published Online: January 22, 2015. doi:10.1001/jamaoto.2014.3545.
Author Contributions: Mr Infusino had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Infusino, Diercks, Bunting, Hartnick.
Acquisition, analysis, or interpretation of data: Infusino, Diercks, Rogers, Garcia, Ojha, Maurer, Bunting.
Drafting of the manuscript: Infusino, Garcia, Ojha.
Critical revision of the manuscript for important intellectual content: Infusino, Diercks, Rogers, Garcia, Maurer, Bunting, Hartnick.
Statistical analysis: Maurer.
Administrative, technical, or material support: Infusino, Rogers, Ojha, Bunting, Hartnick.
Study supervision: Diercks, Rogers, Bunting, Hartnick.
Conflict of Interest Disclosures: None reported.