Distribution of 16 ratings of percentage gap size for 10 nasoendoscopic segments. Each point represents estimation of gap size from 1 rater. There are 16 symbols per segment (1 for each rater).
Customize your JAMA Network experience by selecting one or more topics from the list below.
Sie KCY, Starr JR, Bloom DC, et al. Multicenter Interrater and Intrarater Reliability in the Endoscopic Evaluation of Velopharyngeal Insufficiency. Arch Otolaryngol Head Neck Surg. 2008;134(7):757–763. doi:10.1001/archotol.134.7.757
To explore interrater and intrarater reliability (Rinter and Rintra, respectively) of a standardized scale applied to nasoendoscopic assessment of velopharyngeal (VP) function, across multiple centers.
Multicenter blinded Rinter and Rintra study.
Eight academic tertiary care centers.
Sixteen otolaryngologists from 8 centers.
Main Outcome Measures
Raters estimated lateral pharyngeal and palatal movement on nasoendoscopic tapes from 50 different patients. Raters were asked to (1) estimate gap size during phonation and (2) note the presence of the Passavant ridge, a midline palatal notch on the nasal surface of the soft palate, and aberrant pulsations. Primary outcome measures were Rinter and Rintra coefficients for estimated gap size, lateral wall, and palatal movement; κ coefficients for the Passavant ridge, a midline palatal notch on the nasal soft palate, and aberrant pulsations were also calculated.
The Rinter coefficients were 0.63 for estimated gap size, 0.41 for lateral wall movement, and 0.43 for palate movement; corresponding Rintra coefficients were 0.86, 0.79, and 0.83, respectively. Interrater κ values for qualitative features were 0.10 for the Passavant ridge; 0.48 for a notch on the nasal surface of the soft palate, 0.56 for aberrant pulsations, and 0.39 for estimation of gap size.
In these data, there was good Rintra and fair Rinter when using the Golding-Kushner scale for rating VP function based on nasoendoscopy. Estimates of VP gap size demonstrate higher reliability coefficients than total lateral wall, mean palate estimates, and categorical estimate of gap size. The reliability of rating qualitative characteristics (ie, the presence of the Passavant ridge, aberrant pulsations, and notch on the nasal surface of the soft palate) is variable.
Velopharyngeal insufficiency (VPI), also known as velopharyngeal (VP) incompetence, inadequacy, or dysfunction, denotes incomplete VP closure during speech production. Clinically, VPI manifests as hypernasal resonance, nasal air emissions, and associated stigmata, including facial grimacing and compensatory misarticulations. The VP complex comprises the levator veli palatini, tensor veli palatini, palatoglossus, palatopharyngeus, muscularis uvula, and the pharyngeal constrictor muscles. Dysfunction of the velopharynx can be caused by anatomic abnormalities, classic cleft palate, and neuromuscular dysfunction.
The primary tool for determining the presence and severity of VPI is the trained ear.1,2 Speech pathologists assess resonance, presence of nasal air emissions or turbulence, articulation, and intelligibility. This perceptual assessment of speech is subjective and generally does not predict the degree of VP closure.3-5 It is important to establish the relative amount of VP movement because any surgical intervention to improve VP closure carries with it the possibility of compromising the upper airway. Therefore, surgeons have advocated tailoring the surgery to each patient's needs.6-8 Specifically, patients with poor VP closure and a resultant large VP gap require more augmentation than patients with relatively good VP closure resulting in a small VP gap. In addition to overall VP closure, the relative contributions of lateral wall motion and palatal motion may also have an impact on treatment recommendations.9 Therefore, we were interested in assessing a standardized rating system as applied to nasoendoscopic assessment of VP function.
Nasoendoscopic evaluation of the VP during speech allows a bird's-eye view of the VP mechanism, generating a 2-dimensional view at the level of attempted VP closure. The endoscopist estimates the degree of lateral pharyngeal and palatal movement during speech. Other characteristics of the VP, such as the presence of a notch on the nasal side of the soft palate, presence of the Passavant ridge, and aberrant pulsations of the internal carotid arteries, can also be visualized when present. These findings may have an impact on patient treatment.
In an effort to standardize the reporting of endoscopic and radiographic measures of VP function, a multidisciplinary International Working Group proposed a rating scale in 1988.10 The Golding-Kushner scale involves separate ratings for the motion of each component of the VP sphincter, that is, right and left lateral pharyngeal and palatal motion, as seen during endoscopic and radiographic evaluations. This scale uses semiquantitative judgments of palatal, lateral pharyngeal, and posterior pharyngeal wall movement during speech relative to rest.
A standardized approach to quantifying, documenting, and recording VPI is critical for reliable communication about diagnosis, clinical care, and research in this field. Because the scale is subject to rater judgment, its usefulness for clinical and research purposes will depend on how variable it is when used by different raters and within raters over time. In the present study, we were interested in estimating intrarater and interrater reliability (Rintra and Rinter, respectively) of the Golding-Kushner scale when used by a large number of otolaryngologists who are active in the treatment of VPI across multiple centers in the United States.
Institutional review board exemption was obtained at Children's Hospital and Regional Medical Center (CHRMC), Seattle, Washington. A videotape of nasoendoscopic evaluations of 50 pediatric patients presenting for evaluation of VPI was compiled.11 The patients were randomly selected from videotapes of all patients with VPI evaluated in the VPI clinic at CHRMC. All nasoendoscopic examinations were performed by 1 of 2 otolaryngologists (J.A.P. and K.C.Y.S.) with a 2.4-mm flexible nasopharyngoscope. The scope was passed through the middle meatus for optimal viewing of the VP during speech, and the examinations were videorecorded using VHS format. The videotaped segments did not contain patient-identifying information.
We identified centers at which there were at least 2 otolaryngologists actively involved in treatment of VPI and invited those otolaryngologists to participate. All participants reported (1) the frequency of using the Golding-Kushner scale in rating VP function (never, rarely, occasionally, usually, always), (2) the number of years since completing fellowship training, (3) the number of years' experience evaluating VPI, (4) the number of VPI evaluations performed in the past year, and (5) the number of VPI surgical procedures performed in the past year.
In this prospective, blinded study, 16 otolaryngologists from 8 institutions each rated the 50 nasoendoscopy segments using the Golding-Kushner scale twice, separated by at least 1 month. Raters were blinded both from each other's ratings and also from their own previous ratings. The raters were allowed to view a patient segment of videotape as many times as desired but were not allowed to return to previously viewed segments. An Access (Microsoft Corp, Redmond, Washington) database requiring that all fields be filled was used to record the data. Items rated included right and left lateral pharyngeal wall movement, right and left palate movement and elevation, posterior pharyngeal wall movement, and estimate of gap size. Raters were asked to judge the presence or absence of the Passavant ridge, midline palatal notch, and aberrant pulsations. Right and left lateral pharyngeal wall movements were rated on a scale from 0 to 0.5, with 0 indicating absence of medial excursion of the wall and 0.5 indicating medial excursion to the midline. Right and left palatal elevation was rated from 0 to 1.0, with 0 indicating absence of palatal elevation relative to the resting position of the palate and 1.0 indicating palatal contact with the posterior wall. Posterior pharyngeal wall movement was rated from 0 to 1.0, with 0 indicating absence of posterior wall movement and 1.0 indicating anterior movement of the posterior pharyngeal wall to meet the palate. An overall clinical estimate of the residual VP gap size was estimated from 0% to 100%, with 0% indicating complete closure and 100% indicating no decrease in gap size with speech. In addition, raters described the VP gap size using a categorical scale (large, moderate, small, and none).
We estimated Rinter and Rintra coefficients (intraclass correlation coefficients) from analysis of variance models.12,13 For each measurement category, the correlation coefficient R ranges from 0 to 1.0 and indicates the proportion of variability attributable to true differences across segments. That is, higher reliability is represented by estimates closer to 1.0, whereas unreliable measures produce estimates closer to 0. For each R estimate we also calculated a 95% lower bound as a measure of its precision. When Rinter (Rintra) is close to 1.0, the reliability across (or within) raters is high, and interrater (intrarater) variability is relatively low. The dimensions for each segment were as follows: GAP indicated estimated percentage gap size; LAT, right lateral + left lateral wall movement; and PAL, (right palatal movement and left palatal movement)/2.
The R values for Rintra were calculated separately for each individual rater and institution, and then Rintra and Rinter for estimated gap size were calculated separately for segments that were above and below the median duration of nasopharyngoscopic evaluation (80 seconds).
The Rinter and Rintra for assessing qualitative characteristics, including the presence of soft palate notch, aberrant pulsation, Passavant ridge, and categorical gap size, were evaluated by using κ coefficients. The κ coefficient is a measure of agreement between 2 independent observers that may take on values between 0 and 1.0. Zero indicates that any agreement observed is completely by chance and 1.0 indicates complete agreement between observers. It is used to describe Rinter and Rintra when data can be classified into nominal categories (eg, “yes” or “no”). In addition to estimating κ statistics with bootstrapped 95% confidence intervals, we also calculated, for each of the 50 segments, the number of raters giving positive ratings (of 16 raters). We used only the first of the 2 repeated ratings from each rater for each segment. We plotted the number of segments (of 50) with each possible number of positive ratings (of 16) for each of the 4 qualitative characteristics. All statistical analysis was performed using Stata statistical software (version 9.0; StataCorp, College Station, Texas).
The 16 otolaryngologists had a wide range of experience in evaluating VPI, number of annual VPI evaluations performed, and frequency of use of the Golding-Kushner standardized scale to grade VP function. There was a wide range of level of experience in managing patients with VPI and consistency of use of the Golding-Kushner scale (Table 1).
The 50 segments ranged in duration from 26 seconds to 4 minutes 36 seconds, with a median length of 80 seconds. They included evaluations from 20 girls and 30 boys (mean [SD] age, 9.0 [3.8] years; range, 3.5-17.0 years). This population included 26 patients with cleft palate with or without a cleft lip, who had previously undergone palatoplasty; 15 syndromic patients, including 11 with the diagnosis of velocardiofacial syndrome; 2 patients with VPI following adenotonsillectomy; and 7 patients with idiopathic VPI. The mean (SD) ratings from the 16 otolaryngologists of the VP gap, total lateral wall movement and average palate movement were 23.1% (22.2%), 0.4% (0.3%), and 0.7% (0.3%), respectively (Table 2).
The distribution of estimated gap size ratings across all 16 raters for 10 segments is representative of our findings in all 50 rated segments and demonstrates the variability found in quantifying gap size (Figure). There were segments for which raters differed in estimating gap size by as much as a difference of 80%. The Rinter and Rintra for the estimated gap size, total lateral wall movement, and total palatal movement ranged from 0.41 to 0.86 (Table 3). The Rintra was higher than Rinter for all 3 measurements. The Rintra for individual raters varied from 0.54 to 0.99 (Table 4). Neither Rintra nor Rinter differed by duration of the segment (whether above or below the median of 80 seconds); Rinter was no greater for raters by individual or by institutions than among all raters across institutions (data not shown). The Rintra also did not vary in any systematic way depending on the rater's historic use of the Golding-Kushner scale (Table 4) or level of experience, and neither did Rinter (data not shown).
Compared with reliability for assessing quantitative characteristics such as VP gap size as a percentage of possible closure, there was generally lower reliability for the assessment of characteristics measured qualitatively. The κ statistics measuring Rinter for assessing the Passavant ridge, on the nasal side of the soft palate, aberrant pulsations, and gap size category varied from 0.10 to 0.56 (Table 5). The Rintra was also lower than for quantitative measurements. Across raters, the mean κ statistics ranged from 0.6 to 0.7, depending on the characteristic being assessed; the Rintra for individual raters ranged from 0 to 1.0 for some qualitative assessments (data not shown).
Evaluation of aberrant pulsations was fairly consistent, with 41 segments rated present by none of the raters (or conversely rated absent by all raters); 1 segment rated present by 15 of 16 raters; 2 segments rated present by 9 of 16 raters; and 7 segments rated present by 1 to 3 raters. The evaluation was consistent between repeated ratings 98% of the time (data not shown). There was far less consistency across raters for evaluation of a soft palate notch and the Passavant ridge. In rating the same videotape twice, the 2 ratings were in agreement 84% of the time for soft palate notch and 88% of the time for the Passavant ridge (data not shown).
In the absence of an objective, quantifiable measure of VP function, it is important to understand the reliability of clinical tools currently used in the treatment of patients with VPI. Description of VP function is relevant for clinical practice and outcomes assessment. Others have recognized the importance of creating a standardized reporting system.10 Use of this scale has been variably adopted by otolaryngologists who regularly treat patients with VPI.
In the interest of understanding the current literature on outcomes with VPI surgery, and to design multicenter studies in the future, we used a standardized reporting system to evaluate Rinter and Rintra among a large group of otolaryngologists who commonly treat this patient population. We found that the mean Rintra and Rinter of gap sizes were approximately 0.86 and 0.63, respectively. The interpretation of these values depends to some extent on the context in which the scale is applied because the “cost” of mismeasurement or disagreement likely differs in different contexts. For example, the use of gap size measurements for determining appropriate surgical treatments might require a higher reliability than if the measurements are made for purely descriptive purposes. Research involving multiple raters might be impractical with an Rinter of 0.63, which, although considered moderate in some contexts, would necessitate a larger sample size to ensure sufficient statistical power despite potentially large amounts of measurement error.12,14
Other authors15-17 have studied the reliability of nasoendoscopic assessment of VP function in subjects with normal VP function or small numbers of subjects. D’Antonio et al18 conducted a large study of reliability of nasopharyngoscopy prior to recommendations of the International Working Group. Their study was designed to compare the reliability of a group of expert raters, which was their standard clinical practice, with the ratings of novice raters functioning individually. The mean κ statistic was 0.78 for the expert group and 0.51 to 0.50 for the novice group. They also compared the novice ratings with the expert groups, with mean κ scores of 0.3 to 0.4. Although they were able to conclude that experts working together in a group were more consistent than novice raters working individually, we studied a group of expert raters at different centers, all working independently.
In the present study, all of the participants were asked to use the Golding-Kushner scale to rate the endoscopic segments, regardless of whether they use the scale in their clinical practice. A copy of the article describing the rating scale10 was provided to each rater, but formal training and feedback were not provided. This seems to represent the most typical application of the Golding-Kushner scale because there is no formal or systematic training available regarding its use. All segments were obtained on patients with clinical manifestations of VPI.
This study has several weaknesses. Flexible fiberoptic nasoendoscopic assessments are affected by optical distortion that is difficult to quantify. This distortion renders quantitative measurements impossible. Second, the audio recorded during the examination was included in the videotape, potentially introducing bias based on the degree of VPI. The audio component was included so that raters could correlate VP movement with the production of nonnasal speech sounds. We also felt that inclusion of the audio portion most closely mimicked practical application of the Golding-Kushner scale. Although VPI severity may have influenced the ratings, it would not necessarily have influenced the Rinter and Rintra of the ratings. Another potential source of bias was the format of the samples. The samples were provided to participants in the form of a VHS videotape, and therefore the segments were viewed in the same order during both ratings. We attempted to minimize any recall effects by ensuring that raters waited at least 1 week between making the assessments. With such a large number of segments, it is unlikely that participants would recall their previous rating of any particular segment.
The Golding-Kushner scale itself is a scale of relative movement and therefore is subjective in its application. This scale was chosen because it was created by a multidisciplinary working group. This is the most detailed scale described for assessment of instrumental tests of VP function. It is also appealing because it can be applied to both endoscopic and radiographic tests of the VP.
The strengths of this study include having had an experienced speech therapist present for all the endoscopic examinations to ensure recording of a speech sample tailored to demonstrate optimal function of the VP. With regard to the data set, we developed an Access database that required that all fields be completed by the rater, thereby ensuring a complete data set. Finally, we were able to recruit a large number of raters from geographically diverse centers in the United States.
With these considerations in mind, the Rinter and Rintra for VP gap size were 0.86 and 0.63, respectively. The experience of the rater in terms of number of years of experience and number of patients seen over the previous year did not seem to be associated with the degree of reliability of assessments. Furthermore, the length of the segment also was not associated with reliability. The relatively high Rintra of the scale indicates that use of the scale by a single endoscopist is acceptable. However, the Rinter was too low to be used for intercenter comparisons of patient populations at this time. This finding highlights the need to either refine the application of this rating system or to develop a more reliable one.
The variability in the categorical rating of qualitative characteristics is important to address because some authors19,20 have suggested using these features to determine candidacy for specific surgical procedures and to define surgical risk. There are inherent limitations to this part of the study owing to our use of κ statistics. For example, κ statistics are sensitive to the underlying prevalence of a characteristic and thus may not be comparable across studies. Furthermore, a weighted κ statistic might have been more appropriate for estimating the reliability of gap size category ratings because weighted κ statistics incorporate a penalty scale for disagreements that depends on how great the disagreement (eg, whether disagreements were in adjacent or nonadjacent categories). Thus, these particular κ statistics might have underestimated the true reliability of categorical gap size ratings. Nonetheless, the relatively low to moderate reliability observed for categorical characteristics, such as gap size categories and aberrant pulsation, implies a need for refinement of these diagnoses when designing future studies.
A previous study18 looking at the reliability of nasoendoscopic evaluation has shown that providing feedback during the evaluation or allowing group participation statistically improves reliability. We propose convening a group of otolaryngologists who routinely use this scale to develop a “teaching tape” to demonstrate application of the scale to endoscopic examinations. This could be used as a self-administered tutorial. We will then use a similar design to determine the impact of such training on Rinter and Rintra.
In conclusion, this study demonstrated that the Golding-Kushner scale is fairly reliable for describing the findings of a nasoendoscopic evaluation of VP function, making it a useful tool to be used within a center and especially a center in which there is only 1 endoscopist. However, the Rinter is currently too low to be used for comparing subjects across centers. It is critical to maximize reliability of a rating scale in the evaluation and reporting of VP function before embarking on multicenter studies.
Correspondence: Kathleen C. Y. Sie, MD, Division of Pediatric Otolaryngology, Childhood Communication Center, Children's Hospital and Regional Medical Center, PO Box 5371/6E-1, Seattle, WA 98105-0371 (firstname.lastname@example.org).
Submitted for Publication: June 14, 2007; final revision received November 14, 2007; accepted November 25, 2007.
Author Contributions: All of the authors had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Study concept and design: Sie. Acquisition of data: Sie, Bloom, de Serres, Drake, Elluru, Haddad, Hartnick, MacArthur, Milczuk, Muntz, Perkins, Senders, Smith, Willging, and Zdanski. Analysis and interpretation of data: Sie, Starr, Bloom, Cunningham, Hartnick, Muntz, Senders, and Tollefson. Drafting of the manuscript: Sie, Starr, Bloom, de Serres, Hartnick, and Perkins. Critical revision of the manuscript for important intellectual content: Sie, Starr, Cunningham, Drake, Elluru, Haddad, Hartnick, MacArthur, Milczuk, Muntz, Senders, Smith, Tollefson, Willging, and Zdanski. Statistical analysis: Starr and Hartnick. Administrative, technical, and material support: Sie, Drake, Haddad, MacArthur, and Willging. Study supervision: Sie and Starr.
Financial Disclosure: None reported.
Funding/Support: This study was supported in part by the Murakami Endowment for the Childhood Communication Center, Children's Hospital and Regional Medical Center, Seattle.
Previous Presentation: This study was presented at the Annual Meeting of the American Society of Pediatric Otolaryngology; May 28, 2005; Las Vegas, Nevada.
Disclaimer: The views expressed in this article are those of the author(s) and do not necessarily reflect the official policy or position of the Department of the Navy, Department of Defense, or the US government.