[Skip to Navigation]
Original Investigation
August 4, 2021

Performance of a Convolutional Neural Network and Explainability Technique for 12-Lead Electrocardiogram Interpretation

Author Affiliations
  • 1RISE Lab, Department of Electrical Engineering and Computer Science, University of California, Berkeley, Berkeley
  • 2Department of Medicine, Division of Cardiology, University of California, San Francisco, San Francisco
  • 3Cardiovascular Research Institute, San Francisco, California
  • 4Department of Laboratory Medicine, University of California, San Francisco, San Francisco
  • 5Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco
JAMA Cardiol. 2021;6(11):1285-1295. doi:10.1001/jamacardio.2021.2746
Key Points

Question  Can readily available electrocardiogram (ECG) data be used to train a high-performing convolutional neural network (CNN) across a large range of 12-lead ECG diagnoses when compared with clinical standards of care?

Findings  In this cross-sectional study of 992 748 ECGs from 365 009 adult patients, a CNN was trained to predict 38 diagnostic classes with strong overall performance. Compared with a consensus committee of cardiac electrophysiologists, the CNN performed comparably to or exceeded cardiologist clinical diagnoses and the MUSE (GE Healthcare) system’s automated ECG diagnosis for most classes.

Meaning  In this cross-sectional study, a CNN trained on readily available ECG data achieved comparable performance to cardiologists and exceeded the performance of MUSE automated analysis for most diagnoses.


Importance  Millions of clinicians rely daily on automated preliminary electrocardiogram (ECG) interpretation. Critical comparisons of machine learning–based automated analysis against clinically accepted standards of care are lacking.

Objective  To use readily available 12-lead ECG data to train and apply an explainability technique to a convolutional neural network (CNN) that achieves high performance against clinical standards of care.

Design, Setting, and Participants  This cross-sectional study was conducted using data from January 1, 2003, to December 31, 2018. Data were obtained in a commonly available 12-lead ECG format from a single-center tertiary care institution. All patients aged 18 years or older who received ECGs at the University of California, San Francisco, were included, yielding a total of 365 009 patients. Data were analyzed from January 1, 2019, to March 2, 2021.

Exposures  A CNN was trained to predict the presence of 38 diagnostic classes in 5 categories from 12-lead ECG data. A CNN explainability technique called LIME (Linear Interpretable Model-Agnostic Explanations) was used to visualize ECG segments contributing to CNN diagnoses.

Main Outcomes and Measures  Area under the receiver operating characteristic curve (AUC), sensitivity, and specificity were calculated for the CNN in the holdout test data set against cardiologist clinical diagnoses. For a second validation, 3 electrophysiologists provided consensus committee diagnoses against which the CNN, cardiologist clinical diagnosis, and MUSE (GE Healthcare) automated analysis performance was compared using the F1 score; AUC, sensitivity, and specificity were also calculated for the CNN against the consensus committee.

Results  A total of 992 748 ECGs from 365 009 adult patients (mean [SD] age, 56.2 [17.6] years; 183 600 women [50.3%]; and 175 277 White patients [48.0%]) were included in the analysis. In 91 440 test data set ECGs, the CNN demonstrated an AUC of at least 0.960 for 32 of 38 classes (84.2%). Against the consensus committee diagnoses, the CNN had higher frequency-weighted mean F1 scores than both cardiologists and MUSE in all 5 categories (CNN frequency-weighted F1 score for rhythm, 0.812; conduction, 0.729; chamber diagnosis, 0.598; infarct, 0.674; and other diagnosis, 0.875). For 32 of 38 classes (84.2%), the CNN had AUCs of at least 0.910 and demonstrated comparable F1 scores and higher sensitivity than cardiologists, except for atrial fibrillation (CNN F1 score, 0.847 vs cardiologist F1 score, 0.881), junctional rhythm (0.526 vs 0.727), premature ventricular complex (0.786 vs 0.800), and Wolff-Parkinson-White (0.800 vs 0.842). Compared with MUSE, the CNN had higher F1 scores for all classes except supraventricular tachycardia (CNN F1 score, 0.696 vs MUSE F1 score, 0.714). The LIME technique highlighted physiologically relevant ECG segments.

Conclusions and Relevance  The results of this cross-sectional study suggest that readily available ECG data can be used to train a CNN algorithm to achieve comparable performance to clinical cardiologists and exceed the performance of MUSE automated analysis for most diagnoses, with some exceptions. The LIME explainability technique applied to CNNs highlights physiologically relevant ECG segments that contribute to the CNN’s diagnoses.

Add or change institution
Limit 200 characters
Limit 25 characters
Conflicts of Interest Disclosure

Identify all potential conflicts of interest that might be relevant to your comment.

Conflicts of interest comprise financial interests, activities, and relationships within the past 3 years including but not limited to employment, affiliation, grants or funding, consultancies, honoraria or payment, speaker's bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued.

Err on the side of full disclosure.

If you have no conflicts of interest, check "No potential conflicts of interest" in the box below. The information will be posted with your response.

Not all submitted comments are published. Please see our commenting policy for details.

Limit 140 characters
Limit 3600 characters or approximately 600 words