To determine if heart murmur intensity grading performance can be improved using the heart sounds as an internal reference.
Single-blind controlled trial of 100 medical students, residents, and pediatric attending physicians at a children’s hospital. Groups of 1 to 3 participants were alternately assigned to intervention and control groups, reported their method of grading heart murmur intensity, and then graded the intensity of a random sample of 20 recorded murmurs on a 6-point scale. Before rating another random sample of 20 murmurs, the intervention group was taught a system that uses the heart sounds as an internal reference. Primary outcomes were change in accuracy (percentage correct), interrater agreement (κ), and consistency (κ). Subgroup analyses were performed by training level and heart murmur grade.
Grading accuracy improved more in the intervention group than the control group (Δ improvement, 5%; 95% confidence interval [CI], −0.1%-10.0%]). This was most pronounced among attending physicians (Δ improvement, 11%; 95% CI, 0.4%-22%) and students (Δ improvement, 12%; 95% CI, 3%-20%) and for grade 2 murmurs (Δ improvement, 20%; 95% CI, 10%-31%). Relatively greater improvements in consistency were observed after the intervention for attending physicians (Δ improvement, 0.17; 95% CI, 0.01-0.32) and grades 2 (Δ improvement, 0.22; 95% CI, 0.09-0.36) and 3 murmurs (Δ improvement, 0.16; 95% CI, 0.05-0.28).
A system that uses the heart sounds as an internal reference for grading heart murmur intensity quickly improves accuracy and consistency for some providers and specific murmurs.
In 1933, Freeman and Levine1 published a report introducing for the first time a numeric scale for grading heart murmur intensity. The proposed 6-point rating scale, eponymously named the Levine system, persists today as the gold standard for grading heart murmur intensity. As recognized by Freeman and Levine more than 70 years ago, accuracy, consistency, and interrater agreement in grading heart murmur intensity are essential for diagnostic purposes. Along with a number of other factors (timing, location, radiation, quality, response to maneuvers, and the presence of other cardiac and noncardiac signs and symptoms), accurate grading of murmur intensity is necessary to distinguish innocent from pathological murmurs, with louder murmurs (grade ≥3) thought more likely to represent hemodynamically significant cardiac defects.2- 6 Consistency allows an individual health care provider to follow a patient’s murmur for change over time, and interrater agreement ensures that providers speak a common language in describing their patients’ murmurs.
The original Levine system for grading murmurs and its more recent permutations2,4 distinguish grades 1 through 3 murmurs based on their relative intensity and the ease with which they are detected (Table 1). This creates some potential problems for grading performance. First, because murmurs are described with respect to one another, there is no gold standard for each grade’s intensity. Second, if users develop their own internal gold standard for each grade’s intensity, then their judgments may be affected by ambient environmental noise and the thickness of the patient’s chest wall. Finally, a system that is based in part on the ease with which a listener detects the murmur may result in upgrading by inexperienced listeners and downgrading by more experienced ones. Despite the frequency with which practitioners grade heart murmur intensity and the theoretical limitations of the Levine system, its performance characteristics have never been studied.
In this study, we surveyed medical students, pediatric residents and attending physicians on the systems they currently use to grade heart murmur intensity. We then tested a novel system that uses the heart sounds as an internal reference for grading heart murmur intensity. Under this system, absence of a heart murmur is graded 0/6. Murmurs that are clearly softer than the heart sounds are graded 1/6. Murmurs that are approximately equal in intensity to the heart sounds are graded 2/6. Finally, murmurs that are clearly louder than the heart sounds are graded 3/6. We hypothesized that most providers would use some variation of the Levine system for grading heart murmur intensity and that, by providing an internal reference, the heart sounds–based murmur grading system would improve accuracy, consistency, and interrater agreement compared with the currently used systems.
The trial was conducted from July 1 through August 30, 2003, at a freestanding tertiary care academic children’s hospital in Philadelphia, Pa. Eligible participants included residents, fellows, and attending physicians in the divisions of General Pediatrics and Pediatric Emergency Medicine and medical students in clinical rotation in pediatrics. Individuals who were already familiar with the murmur grading system being tested were excluded. Chief residents and fellows were considered attending physicians.
This was a single-blind controlled trial in which eligible participants completed the study protocol immediately after agreeing to participate in the trial. A medical student (M.T.) recruited participants among the attending physicians, residents, and medical students caring for patients in the emergency department and the general pediatrics inpatient service. Groups of 1 to 3 participants were alternately allocated to intervention and control groups. Participants were blinded to their allocation and were asked not to discuss the details of the study with others in the hospital. All participants were asked first to describe in writing their system for grading heart murmurs on a 0- to 6-point scale. They were then asked to use their system to grade a random sample of 20 heart murmurs that varied in intensity from grades 0/6 to 3/6. The heart sound and murmur recordings were played from a CD-ROM of a laptop computer connected to a splitter that allowed up to 3 participants to simultaneously listen to (using headphones) and rate murmurs. Each murmur was played for 10 seconds, and participants were asked to write down the murmur grade at the end of the 10-second interval without consulting the others in the group. The intervention group was then taught the heart sounds–based system for grading the intensity (0/6-3/6) of heart murmurs. It took approximately 3 minutes to teach the new grading system. The intervention group was then asked to use the heart sound–based grading system to grade another random sample of 20 heart murmurs that varied in intensity from grades 0/6 to 3/6. The control group was not taught the heart sounds–based grading system. Rather, after a 3-minute break, they were asked to grade the same second random sample of 20 murmurs. Participants recorded their responses on a standardized paper form.
The main outcomes of the study were the accuracy, consistency, and interrater agreement of the participants in grading heart murmurs before and after the educational intervention (or the 3-minute break for the controls).
Accuracy refers to the proportion of times that observers correctly identified the heart murmur grade. Because there is no objective reference standard for judging heart murmur intensity and because our goal was to evaluate the heart sounds–based system, we used that system as the reference standard for the heart murmur grades in determining accuracy (discussed in the “Heart Sound and Murmur Simulation” subsection of the “Methods” section). We calculated observer accuracy before and after the intervention (or the break) and compared the mean change in accuracy for the 2 groups using a 2-tailed t test.
Consistency refers to the degree to which an individual observer assigns the same murmur the same grade. We calculated consistency using the κ statistic.7 Intrarater κ before and after the intervention (or the break) was calculated for each observer, and we compared the change in κ for the 2 groups using the Wilcoxon test. A nonparametric test was used because the true distribution of difference in change of intrarater κ is not known.
Interrater agreement refers to the degree to which the observers agreed (beyond chance agreement) in their rating of the heart murmur grade for each heart murmur. Using methods described by Fleiss et al7 for multiple observers and multiple ratings, we calculated interrater agreement before and after the intervention (or break) by means of the interrater κ statistic. There are no established statistical methods for comparing changes in interrater agreement κ Astatistics, so we performed significance testing using a permutation test.8 First, we randomly assigned the intervention and control subjects into 2 new comparison groups. We then calculated the difference in change (κ) between these new comparison groups. We repeated this process 500 times, each time calculating the difference in change (κ), and were able to calculate a mean difference in change (κ, with the 95% confidence interval [CI] around the mean) produced by random assignment of subjects who received and did not receive the intervention into comparison groups. If the actual difference in change between the intervention and control groups (overall or in subgroup comparisons) was outside the 95% CI produced by the permutation test, we considered that difference statistically significant.
Subgroup analyses were performed stratified by level of participant training (attending physician, resident, and student) and heart murmur grade (0/6-3/6).
The murmurs and heart sounds were recorded from a cardiovascular patient simulator (“Harvey”; University of Miami Center for Research in Medical Education, Miami, Fla). Normal heart sounds over the aortic valve were used for the 0/6 recording. We selected 2 midsystolic murmurs—an innocent murmur over the pulmonic valve (a crescendo-decrescendo murmur) and mitral valve prolapse (MVP) over the mitral valve (a decrescendo murmur)—and used spectral analysis software (SIGVIEW; available at http://www.sigview.com/index.htm) to compare the intensity (amplitude in decibels) of the murmurs with the heart sounds. The innocent murmur was approximately half the intensity of the heart sounds, and the MVP murmur was approximately equal in intensity to the heart sounds. In the innocent murmur and MVP recordings, the S1 and S2 heart sounds were approximately equal in intensity. We then used an audio editor (AVS Media.com; available at http://www.avsmedia.com/index.aspx) to isolate and modulate the innocent and MVP murmurs to produce additional murmurs that, compared with the heart sounds, were approximately 50% lower intensity (grade 1), equal intensity (grade 2), and 50% greater intensity (grade 3). A total of 7 recordings were produced, including normal heart sounds (0/6) and 2 types each of grades 1/6, 2/6, and 3/6 murmurs. To simulate the overall decreased intensity of heart sounds and murmurs in patients with thicker chest walls,we used the audio editor to generate an additional 7 recordings with approximately 50% decreased overall intensity of heart sound and murmurs.
To generate 2 sequences of 20 randomly ordered murmurs, we used a random number generator to select recordings (without replacement) from a group consisting of 3 sets of standard and 2 sets of decreased amplitude recordings of murmurs graded 0/6 through 3/6. The heart sound and murmur recordings were pilot tested on 2 attending pediatricians who accurately graded the murmurs using the heart sounds–based system more than 90% of the time.
Participant responses were entered into a Microsoft Excel spreadsheet (Microsoft, Redmond, Wash), and statistical analyses were performed using Statistical Analysis Software (SAS Institute Inc, Cary, NC). The study was granted exemption from the sponsoring institutional review board.
There were no significant differences in the sex, medical department, or training of the subjects in the intervention and control groups (Table 2).
After reviewing the reported methods used by participants to rate heart murmur intensity, we identified 7 distinct systems. Eighty-one participants (81%) rated heart murmur intensity using a system that closely resembled the Levine system in description, although none reported descriptions of grades 1 through 3 murmurs that corresponded exactly with those described by Levine. Other systems used (in decreasing frequency) were based on the level of training (eg, “only a cardiologist can hear” for grade 1/6), setting (eg, “can hear only if patient is sleeping” for grade 1/6), prior personal experience with the murmur (eg, “I always hear this” for grade 3/6), the location of the murmur (eg, “can hear only in 1 location” vs “all locations” for grades 1/6 and 3/6, respectively), and its association with maneuvers (eg, “can hear only with maneuvers” for grade 1/6). Some participants used a combination of systems to grade various intensity murmurs.
Table 3 presents grading performance before and after participants received training on a heart sounds–based system for grading murmur intensity compared with no additional training. There was a strong training effect such that accuracy, interrater agreement, and consistency improved with experience alone in both groups. Overall, grading accuracy improved more in the intervention group (67% to 84%) than in the control group (65% to 77%; Δ improvement, 5%; 95% CI, −0.1% to 10%), but this did not quite reach statistical significance.
Improvements in interrater agreement and consistency were not significantly different between the intervention and control groups overall. However, compared with no instruction, the heart sounds–based system resulted in a significant incremental improvement in grading performance in providers with certain levels of training and for specific murmur intensities.
The incremental improvement in accuracy with training in the heart sounds–based system was most pronounced among attending physicians (Δ improvement, 11%; 95% CI, 0.4%-22%) and students (Δ improvement, 12%; 95% CI, 3%-20%). Relatively greater improvements in interrater agreement were observed among attending physicians (Δ improvement, 0.07; 95% CI, −0.03 to 0.17) and students (Δ improvement, 0.11; 95% CI, −0.02 to 0.24), but the differences did not achieve statistical significance. Compared with their control counterparts, attending physicians clearly had greater improvement in consistency with the heart sounds–based system (Δ improvement, 0.17; 95% CI, 0.01-0.32) (Table 4).
Greater improvements with the heart sounds–based system were observed in accuracy for grade 2 murmurs (Δ improvement, 20%; 95% CI, 10%-31%) and consistency for grades 2 (Δ improvement, 0.22; 95% CI, 0.09-0.36) and 3 murmurs (Δ improvement, 0.16; 95% CI, 0.05-0.28). There was an incremental improvement in interrater agreement with the heart sounds–based system for grades 2 (Δ improvement, 0.08; 95% CI, −0.22 to 0.38) and 3 murmurs (Δ improvement, 0.07; 95% CI, −0.17 to 0.31), but this did not reach statistical significance (Table 5).
In this sample of medical students, residents, and attending physicians, the most commonly used system for grading heart murmur intensity was some variation of the Levine system, which characterizes the intensity of murmurs relative to one another. Compared with controls, using the heart sounds as an internal reference for grading murmur intensity resulted in greater improvements in accuracy for attending physicians and students and grade 3 murmurs, and greater improvements in consistency for attending physicians and grades 2 and 3 murmurs. The heart sounds–based system also produced greater improvements in interrater agreement for attending physicians, students, and grades 2 and 3 murmurs, but these trends were not statistically significant.
To our knowledge, this is the first study to describe methods and abilities for grading heart murmur intensity and the first to test an alternative to the Levine method in an experimental design. Multiple studies have demonstrated that cardiac auscultation skills of residents and attending physicians are suboptimal,9- 13 and several studies have evaluated the effectiveness of educational interventions to improve skills in recognizing common cardiac lesions and innocent murmurs.14- 19 However, none of these specifically addressed how inaccurate and unreliable intensity grading might contribute to poor recognition of cardiac conditions.
This study suggests that the heart sounds–based method may be a promising alternative to the Levine system, but the results must be interpreted in the context of the study’s limitations. First, the sample size was small and thus may have limited our ability to detect smaller differences in improvement, particularly those favoring the heart sounds–based system in the subgroup analyses. Second, because participants were allocated to the intervention and control groups in an alternating rather than a random sequence, it is possible that selection bias was introduced in the allocation of participants. The fact that the intervention group consistently started with higher accuracy, interrater agreement, and consistency scores (before the intervention) suggests that they may have had better baseline cardiac auscultation skills. However, the baseline differences in grading performance were not statistically significant between the 2 groups, and their other baseline characteristics were otherwise very similar. Furthermore, it is unclear how the research assistant (M.T.), a medical student from another institution, could have known the participants’ baseline auscultation skills, or whether better baseline auscultation skills in the intervention group would have increased or diminished their improvement in grading performance compared with those of the control group. Third, we did not teach the Levine system to the controls and compare their improvement in accuracy, interrater agreement, and consistency with those of the subjects who were taught the heart sounds–based system. Therefore we cannot claim the superiority of one system over the other. Reminding participants of the exact definitions of murmur grades according to the Levine system might have produced different results. Fourth, a very strong training effect was observed in both the intervention and control groups, which may have limited our ability to discern the full effect of the intervention. Although we limited the preintervention period to an evaluation of 20 murmurs of varying intensity and types (innocent murmur and MVP), as evidenced by the improvements observed in the control group, participants quickly became familiar with the murmurs and improved their grading performance with the benefit of time alone. Despite this training effect, the intervention provided additional improvement in grading performance for some participants and certain murmurs. Finally, because the reference standard for determining grading accuracy was defined with respect to the heart sounds, the comparison of changes in accuracy in the intervention group (who used the heart sounds as an internal reference for grading murmur intensity) and the control group (who did not) is not entirely fair. However, the fact that accuracy improved more in the intervention group than in the control group demonstrates that users can quickly learn and correctly apply the system.
The lack of consistent differences in preintervention grading performance by level of training was surprising but suggests that existing heart murmur intensity grading systems are equally problematic for attending physicians, residents, and medical students. The differences in grading performance improvement observed among participants of different training levels were also unexpected and difficult to explain. We can only speculate that the intervention was ineffective among residents because they were too tired, too busy, or too distracted to absorb and apply a new skill. On the other hand, the intervention might have been effective among attending physicians and students because the attending physicians have the most experience with cardiac auscultation and thus a firm foundation from which to learn new techniques, whereas the medical students have so little experience grading heart murmurs that any organizing principles quickly improve their grading performance. The fact that accuracy, interrater agreement, and consistency improved the most with the intervention for grades 2 and 3 murmurs suggests that these are the most difficult ones to grade with existing rating methods. This has important clinical significance, as grade 3 murmurs are thought more likely to be associated with cardiac pathology.2- 6
More research is needed to evaluate the effectiveness of the heart sounds–based grading system. Future clinical trials should compare it directly with the Levine system using a wider variety of murmurs, without a run-in period or with a longer washout period before the intervention to avoid the training effect observed in this study and thus isolate the effect of the grading system. Trials with larger samples representing providers from various institutions at different levels of training will enhance the generalizability of these results. The heart sounds–based system needs to be tested in actual patients and its utility in identifying cardiac pathology also needs to be evaluated, as murmurs graded 3/6 with this system may not be associated with cardiac pathology the same way that those graded 3/6 with the Levine system are. The heart sounds–based system also needs to be studied in actual patients to determine if changes in hemodynamic state produce differential changes in heart sound and murmur intensity that might diminish consistency and interrater agreement. If our findings are substantiated in other studies, then medical schools and resident training programs should teach the heart sounds–based system as an alternative to the Levine system for grading heart murmur intensity.
Correspondence: Ron Keren, MD, MPH, The Children’s Hospital of Philadelphia, 3535 Market St, Room 1524, Philadelphia, PA 19104 (email@example.com).
Accepted for Publication: September 16, 2004.
Funding/Support: This study was supported by grant K23 HD043179 from the National Institute of Child Health and Human Development, Bethesda, Md (Dr Keren).
Acknowledgment: We would like to thank Chris Feudtner, MD, PhD, MPH, and Neena Desai, MHS, for providing helpful comments on earlier drafts of the manuscript.
Ron Keren, Michele Tereschuk, Xianqun Luan. Evaluation of a Novel Method for Grading Heart Murmur Intensity. Arch Pediatr Adolesc Med. 2005;159(4):329–334. doi:10.1001/archpedi.159.4.329