Diercks GR, Ojha S, Infusino S, Maurer R, Hartnick CJ. Consistency of Voice Frequency and Perturbation Measures in Children Using Cepstral AnalysesA Movement Toward Increased Recording Stability. JAMA Otolaryngol Head Neck Surg. 2013;139(8):811-816. doi:10.1001/jamaoto.2013.3926
Copyright 2013 American Medical Association. All Rights Reserved. Applicable FARS/DFARS Restrictions Apply to Government Use.
Few studies have evaluated the pediatric voice objectively using acoustic measurements. Furthermore, consistency of these measurements across time, particularly for continuous speech, has not been evaluated.
(1) To evaluate normal pediatric voice frequency and perturbation using both time-based and frequency-based acoustic measurements, and (2) to determine if continuous speech samples facilitate increased recording stability.
Prospective, longitudinal study.
Pediatric otolaryngology practice within a tertiary hospital.
Forty-three children, ages 4 to 17 years.
Intervention or Exposure
Sustained vowel utterances and continuous speech samples, which included 4 Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) sentences and the first sentence of the “rainbow passage” (“A rainbow is a division of white light into many beautiful colors that takes the shape of a long round arch, with its path high above and its 2 ends apparently beyond the horizon”) were obtained at 2 time points
Main Outcome and Measure
Intraclass correlation coefficients (ICCs) were calculated to assess reliability between speech samples.
Fundamental frequency of sustained vowel utterances had excellent reliability (ICC ≥ 0.94). Time-based analyses of perturbation in sustained vowel utterances demonstrated poor reliability (ICC < 0.40), while frequency-based analyses of perturbation for these utterances demonstrated good to excellent reliability (ICC > 0.40). Fundamental frequency of continuous speech sample had excellent reliability (ICC > 0.94). Frequency-based analyses of continuous speech samples demonstrated excellent reliability (ICC > 0.75) for all but 1 variable, which demonstrated good reliability (cepstral-spectral index of dysphonia of the all voiced sample; ICC = 0.72).
Conclusions and Relevance
Sustained vowel utterance and continuous speech samples provide consistent measures of fundamental frequency. Frequency-based analysis of sustained vowel recordings improves the reliability of perturbation measures. Continuous speech recordings can be obtained in children and demonstrate good to excellent reliability across recordings. This suggests that frequency-based analysis of continuous speech may be more representative of a child’s voice and therefore may be of use in the study both of the developmental changes of the pediatric voice as well as the study of vocal changes pretreatment and posttreatment in children with voice disorders.
Voice and speech disorders are prevalent in the pediatric population, affecting up to 3% to 10% of children,1,2 and may adversely affect a child’s ability to communicate, resulting in psychological and emotional stress, and have an impact on social interactions and educational milestones.3,4 Therefore, identification of children at risk is paramount.
Diagnosis of voice disorders in children relies on both subjective and objective measures of voice. Children may be identified by self-reporting, completion of validated quality-of-life surveys, perceptual analysis by experienced clinicians, laryngoscopy, and through acoustic measures, which are computer-assisted analyses of vocal quality.5 Subjective analyses may present issues with interrater reliability and the ability to compare different populations, and laryngoscopy, particularly in the pediatric population, presents a challenge. Acoustic measures of voice, however, are objective and easily obtained and may help to quantify the severity of dysphonia and response to treatment.6
Acoustic measurements can be obtained using time-based algorithms and frequency-based algorithms to extract data about dominant frequency and signal perturbation. Time-based measures rely on accurate identification of periodic cycle boundaries and calculate cycle-to-cycle variations in frequency (shimmer), amplitude (jitter), and waveform shape. If there is considerable ambient noise, a highly perturbed signal or irregular or aperiodic vocal cord motion, these time-based measurements may be inconsistent.7- 9 In addition, time-based measures cannot be used to analyze continuous speech.7 Some authors suggest that continuous speech may be a more valid method of assessing vocal quality, as it may correlate better with perceptual ratings and show increased degrees of impairment than sustained vowel utterances in individuals with dysphonia.7
An alternative method of acoustic analysis is frequency-based and relies on Fourier transformations of the acoustic signal to identify signal noise and aperiodicity without relying on periodic cycle boundaries. Some authors have suggested that these frequency-based measurements may be more accurately applied to analysis of the dysphonic voice.7- 11 Unlike time-based analysis, frequency-based analysis can be performed on sustained vowel utterances as well as continuous speech.7
To identify an abnormal voice using acoustic measures, we first need to characterize the normal pediatric and adult voice and to address whether these measurements are consistent across recordings to determine if they truly are representative of a patient’s voice. While there is good literature on consistency of normative acoustic measurement data in adults, until recently, few data were available for the pediatric population.
Several studies have compiled databases of acoustic measurements in the pediatric population without voice or speech disorders using time-based algorithms.5,12 In 2002, Campisi et al12 recorded frequency and perturbation data for sustained vowel recordings of 50 children with voice or speech disorders, as well as 26 children with vocal cord nodules, and found increased measures of voice perturbation in the population with vocal fold disease. In 2012, Maturo et al5 created the largest database to date from an English-speaking normative pediatric population, including acoustic measurements from the sustained vowel recordings of 335 children. In both of these studies, single sustained vowel recordings were obtained at 1 point in time, raising the question of whether a single sustained vowel recording is representative of a child’s voice. In 2012, Hill et al13 investigated whether single sustained vowel recordings were consistent across 2 time points in the pediatric population. Single sustained vowel utterances, as well as an average of 3 sustained vowel utterances, were obtained from 50 children at 2 recording sessions, and the consistency of time-based acoustic measurements was analyzed. Fundamental frequency, F0, was reliable for both single and averaged recordings, consistent with previously reported data from adult populations.14- 16 However, measures of perturbation were poorly reliable for single recordings at 2 time points. Consistency improved only when recordings were averaged across each session.13 These data suggest that a single, sustained vowel utterance is not representative of a child’s voice.
If single, sustained vowel recordings do not facilitate reliable measures of voice perturbation in children, we questioned if frequency-based acoustic analysis of continuous speech samples could be used in the pediatric population to provide further insight about the pediatric voice. To date, cepstral analysis of the normal pediatric voice has never been published, nor have data regarding the consistency of frequency-based acoustic measures for the same test performed at 2 points in time in either the adult or pediatric population. Our study was designed as a pilot study to determine if continuous speech recordings can be reliably obtained from the pediatric population and whether a continuous speech recording is consistent across time points.
Institutional review board approval was obtained from the Massachusetts Eye and Ear Infirmary, Boston. English-speaking children ages 4 to 18 years were recruited from the outpatient pediatric otolaryngology clinic from April through September 2012. Exclusion criteria included reason for visit being hearing loss (HL), confirmed by audiogram with HL greater than 20dB, or deemed insufficient for adequate speech and language development by an audiologist, history of voice disorders or airway surgery, developmental or cognitive delay, and subjective difficulty with speech production or hearing. Participation in the study was contingent on both the child and the parent or guardian’s agreement, and written consent, and assent when appropriate, was obtained.
Children were seated in a quiet, soundproof room and a headset-mounted, adjustable microphone (Shure Beta 53, Shure Inc) was placed 3 cm from the right oral commissure. Recordings were performed with a Dell Optiplex 960 personal computer (Dell Inc) with an Intel Core Duo 2 CPU (3.1 GHz, 1.94 GB of RAM) (Microsoft Windows XP Professional Version 2002, Microsoft Corp). Multidimensional Voice Program (MDVP) (model 5101) and Analysis of Dysphonia in Speech and Voice (ADSV) (model 5109) software options for the Computerized Speech Laboratory (model 4500; KayPENTAX) were used to analyze voice recordings. We used MDVP software to evaluate F0, jitter, shimmer, and noise-to-harmonic ratio (NHR) of sustained vowel utterances. We used ADSV software to evaluate cepstral peak frequency (CP F0), cepstral peak prominence (CPP), CPP standard deviation (CPP SD), low-to-high ratio (L/H ratio), L/H ratio standard deviation (L/H SD) and cepstral-spectral index of dysphonia (CSID) of sustained vowel utterances, 4 Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) sentences, and the first sentence of the “rainbow passage” (“A rainbow is a division of white light into many beautiful colors that takes the shape of a long round arch, with its path high above and its 2 ends apparently beyond the horizon”) when children were able to read the passage. For sustained vowel recordings, patients were asked to sustain the vowel /a/ at a comfortable pitch and volume using a normal speaking voice for 4 seconds. The middle 3.5 seconds (87.5%) of voicing was captured for analysis. For patients unable to sustain the vowel for 4 seconds, the middle 87.5% of the recording was used for analysis. All recordings were 1 to 3.5 seconds. For continuous speech recordings, patients were asked to read the following CAPE-V sentences: all voiced, “How hard did he hit him?”; easy onset, “We were away a year ago”; glottal attack, “We eat eggs every Easter”; plosives, “Peter will keep at the peak.” For patients who were able, recordings of the first sentence of the rainbow passage were also obtained. Data were selected for each recording according to KayPENTAX’s instructions. Each child underwent 1 testing session before and 1 after their clinic visit. Time elapsed between sessions was recorded.
For sustained vowel recordings, the data produced by ADSV and MDVP were evaluated separately. The mean of all voice samples for each variable was compared between the 2 recording sessions. Intraclass correlation coefficients (ICCs) were calculated for the mean of F0, jitter, shimmer, and NHR for MDVP, and CP F0, CPP, CPP SD, L/H ratio, L/H ratio SD, and CSID for ADSV. For continuous speech recordings, for each CAPE-V sentence, the mean of all voice samples for each variable was compared between the 2 recording sessions. The ICCs were calculated for the mean of CP F0, CPP, CPP SD, L/H ratio, L/H ratio SD, and CSID. The ADSV software does not calculate a CSID score for the rainbow passage. The ICC values were interpreted to represent excellent reliability if they were more than 0.75, good reliability if they were at 0.40 to 0.75, and poor reliability if they were less than 0.40.17
This study was constructed as a pilot study to determine if continuous speech recordings could be obtained from children ages 4 to 17 years, and to determine if recordings were consistent across time points. Because it was designed for proof of concept of this method of voice analysis, recordings from 50 children were initially obtained. However, 7 of these children were excluded when further investigation revealed history of an audiogram that demonstrated conductive hearing loss greater than 20 dB or a learning disability that had an impact on speech and language development.
The study included 43 children, ages 4 to 17 years, with a male: female ratio of 1.53 to 1.0 Their mean (SD) age was 9 (4) years (median age, 7 years). Three children dropped out of the study prior to their second recording session, yielding 40 children with data from 2 recording sessions. The mean (SD) time between recording sessions was 53.4 (22.9) minutes (median, 48.0 minutes). Two of the 40 patients with a second recording had incomplete data. One patient was missing the all voiced sentence from recording 1, glottal attack sentence from recording 2, and plosives sentence from both recording sessions. The other patient was missing the glottal attack sentence from recording 1. For these patients, sustained vowel and continuous speech recordings were included in analysis if data from both recording sessions was available. Eighteen patients completed 2 recordings of the first sentence of the rainbow passage, which were available for analysis.
For sustained vowel recordings, time-based analysis was performed using MDVP. Between recording sessions, the mean F0 of single, sustained vowel recordings was found to have excellent reliability (ICC = 0.98). Mean jitter, shimmer, and NHR demonstrated poor reliability between recording sessions (ICC = 0.36, 0.04, and −0.09, respectively) (Table 1).
For sustained vowel recordings, frequency-based analysis was performed using ADSV. Between recording sessions, the mean cepstral peak fundamental frequency (CP F0) was found to have excellent reliability (ICC = 0.94). In comparison with time-based measures of perturbation for sustained vowel utterances, which demonstrated poor reliability, frequency-based measures of perturbation demonstrated increased reliability. The mean CPP SD and mean L/H ratio SD demonstrated good reliability (ICC = 0.64 and 0.50, respectively), and the mean L/H ratio had excellent reliability (ICC = 0.83). The mean CSID score, which is an index of dysphonia, demonstrated good reliability between recordings (ICC = 0.45). The mean CPP SD demonstrated poor consistency across recordings (ICC = 0.13) (Table 2).
Frequency-based analyses were performed for continuous speech recordings. For each CAPE-V sentence, the mean CP F0 demonstrated excellent reliability between recording sessions (ICC = 0.94 for easy onset, ICC = 0.98 for all voiced, ICC = 0.97 for glottal attack, and ICC = 0.95 for plosives). Average measures of perturbation represented by the CPP and L/H ratio for all sentence types demonstrated excellent reliability between sessions (Table 3). For all sentence types, the mean L/H ratio SD demonstrated good reliability (Table 3). With the exception of the all voiced condition, in which the mean CPP SD demonstrated poor reliability (ICC = 0.23), mean CPP SD demonstrated good reliability for all other conditions tested (Table 3). The mean CSID scores demonstrated good reliability for the all voiced sentence (ICC = 0.72) and excellent reliability for the easy onset, glottic attack, and plosives variables (ICC = 0.81, 0.87, and 0.90, respectively). For the first sentence of the rainbow passage, the mean CP F0, CPP, CPP SD, L/H ratio, and L/H ratio SD all demonstrated excellent reliability (ICC = 0.98, 0.95, 0.83, 0.87, 0.86, respectively).
To our knowledge, this is the first study to evaluate pediatric patients without voice or speech disorders with frequency-based acoustic measures. We demonstrate that fundamental frequency F0 can be measured reliably from frequency-based analysis of single sustained utterance and continuous speech recordings of CAPE-V sentences and the rainbow passage in children. These data are consistent with studies of the adult voice, which show that fundamental frequency is stable in repeated voice recordings of both sustained vowel and continuous speech.14- 16 It reproduces results from Hill et al,13 which also demonstrated consistency of F0 across sustained vowel recordings in the pediatric population.
As with prior studies, we demonstrate that time-based acoustic measures of perturbation (jitter, shimmer, NHR) are not consistent between single sustained utterances obtained at 2 points in time. This phenomenon has been demonstrated in both adult and pediatric populations.13,14,16 In adults, this lack of recording reliability has been attributed to ambient noise, sex differences, hormonal change, time of day, mood, and caffeine intake.14,18- 20
Frequency-based acoustic measurements have been studied in adult population with dysphonia and those without voice or speech disorders. Analysis allows for identification of a cepstral peak (CP), which occurs at the dominant F0. A cepstral peak prominence (CPP) is then calculated, which measures the difference between the F0 energy and the average energy of the signal, which is derived from a linear regression model. In signals with a high degree of harmonic organization, the CP will clearly emerge from background noise, and the CPP will be larger.11 Multiple studies have suggested that a low CPP is associated with dysphonia and correlates with CAPE-V perceptual ratings of breathiness.8,10,21,22 A low-high ratio (L/H ratio) is also calculated, which represents the relative distribution of low to high frequency energy in the acoustic signal, and is a measure of spectral skew. Increased high-frequency spectral energy may be associated with dysphonia, and studies have also demonstrated that a small L/H SD is also present in speakers with dysphonia.22 KayPENTAX’s ADSV program also computes a multivariate index measure, the CSID, based on regression algorithms that incorporate the cepstral and spectral measures of CPP, L/H ratio, and L/H ratio SD. The CSID approximates a 100-point severity scale that can be compared with auditory-perceptual ratings and has been validated for the CAPE-V sentences in adults.23,24
Consistency data for frequency-based acoustic measurements are lacking. In adults, the absolute means of frequency-based measures have been shown to differ across vocal tasks.9 To our knowledge, the absolute means of frequency-based measures for the same task at differing time points in both patients without voice or speech disorders and those with dysphonia, however, has not been studied previously and ours is the first study to evaluate the consistency of frequency-based acoustic measures for the same task at 2 different time points. We show that for both sustained vowel utterances and continuous speech, measures of perturbation, CPP and L/H ratio, and perturbation variability, the L/H ratio SD, demonstrate good to excellent reliability. These data suggest that for a particular verbal task, frequency-based acoustic analysis of a single recording is representative. Future studies evaluating the reliability of frequency-based acoustic measures between tasks will need to be performed to determine if a single continuous speech recording is representative of a voice in all conditions.
Our data demonstrate that continuous speech samples can be easily obtained from the pediatric population and that single continuous speech recordings are representative for a particular task. Scores of reliability improved with frequency-based analysis of sustained vowel recordings in comparison with time-based analysis of sustained vowel recordings. Frequency-based analysis of continuous speech recordings demonstrated additional improvements in reliability, showing excellent consistency of perturbation measures across recordings, suggesting that continuous speech may be more representative of voice than sustained vowel utterances. This supports assertions by Awan and Roy7 and Awan et al8 that continuous speech is a more valid measure of vocal quality.
The CSID score, which is an acoustic measurement of dysphonia that correlates with perceptual ratings of voice, has not yet been studied in children. We demonstrate that the CSID has excellent consistency across time points for continuous speech tasks. As additional data are obtained regarding the CSID profile of children without voice or speech disorders, this score may be used to better characterize the degree of dysphonia and response to treatment in children with vocal cord disease.
Strengths of this study include a moderately large sample size and attempts to minimize extralaryngeal factors that might affect the vocal acoustics of children enrolled in the study. Our sample size of 43 patients is considerably larger than previous studies evaluating vocal consistency in adults, which evaluated 20 or fewer patients.14- 16,25 In addition, our sample size is larger than many studies evaluating frequency-based acoustic measurements in adults without voice or speech disorders and those with and dysphonia.8,9,11,22,24
This study was not powered to evaluate for differences between sexes or age. Therefore, data from all children were grouped together to calculate the means from each testing session. While this limits our ability to perform subgroup analysis, it did allow patients to function as their own controls. Our study design also helped to minimize variance in acoustic measurements secondary to time of day. Patients were tested at approximately the same time of day, with an mean of 53.4 minutes elapsed between testing sessions. While the degree of ambient noise was never measured, recordings were performed in a soundproof room to minimize the effects of ambient noise.
The study is limited by our study population and poor control of intrinsic factors that may affect the pediatric voice. Study patients were recruited from a tertiary care pediatric otolaryngology practice. While all attempts were made to limit extralaryngeal influences on voice, our patients do not necessarily represent a true normative pediatric population because of the setting from which they were recruited. We did not control the frequency or intensity of voice recordings. In addition, we did not control for caffeine intake, sleep deprivation, hydration status, or menstrual status of study patients, all of which could potentially influence voice perturbation and variability. However, the extent to which these factors may influence vocal consistency is unclear, and again each patient served as his or her own control.
The goal of this pilot study was to evaluate whether continuous speech recordings could be obtained in the pediatric population and if recordings would be consistent over time. The study was not designed to evaluate if acoustic measurements of continuous speech are stable across vocal tasks and was not powered to perform subgroup analyses to look at sex- or age-related differences in consistency or perturbation. A larger study is required to understand the frequency-based acoustic measurement characteristics of the pediatric population with voice or speech disorders and to identify changes that occur to these measurements in children with various types of dysphonia and in response to treatment.
In conclusion, this is the first study, to our knowledge, that addresses consistency of frequency-based acoustic measurements in children. It is also the first study evaluating the consistency of frequency-based measures of frequency and perturbation across 2 time points. Single sustained vowel utterances and continuous speech provide consistent measures of frequency. Frequency-based analysis of sustained vowel utterances improves consistency over time-based analysis, and frequency-based analysis of continuous speech demonstrates good to excellent consistency. Continuous speech recordings can be used to evaluate the vocal acoustics in children. Further studies are needed to understand frequency-based acoustic measurements in the developing pediatric population without voice or speech disorders as well as in the pediatric population with dysphonia.
Submitted for Publication: March 11, 2013; accepted May 29, 2013.
Corresponding Author: Gillian R. Diercks, MD, Department of Otolaryngology–Head and Neck Surgery, Massachusetts Eye and Ear Infirmary, Harvard Medical School, 243 Charles St, Boston, MA 02114 (Gillian_Diercks@meei.harvard.edu).
Author Contributions: All authors had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.
Study concept and design: Diercks.
Acquisition of data: Diercks, Ojha, Infusino.
Analysis and interpretation of data: Diercks, Ojha, Maurer, Hartnick.
Drafting of the manuscript: Diercks, Ojha, Hartnick.
Critical revision of the manuscript for important intellectual content: Infusino, Maurer, Hartnick.
Statistical analysis: Diercks, Maurer, Hartnick.
Administrative, technical, and material support: Ojha, Infusino, Hartnick.
Study supervision: Hartnick.
Conflict of Interest Disclosures: None reported.
Previous Presentation: These data were a podium presentation titled “Comparing Normative Pediatric Voice Acoustic Data Obtained from Conversational Speech: Moving Towards Increased Recording Stability,” at the Fall Voice Conference; October 6, 2012; New York, New York.