General approach
The results section presents three sets of experiments (see Fig. 1). In the baseline experiment, our published ECG-Vanilla net4 was tested with either the labeled NYU dataset of high-quality scans (https://education.med.nyu.edu/ecg-database/app) or unlabeled mobile device-acquired 12-lead ECG images. Because only the scanned 12-lead ECG images were labeled, they were vital for achieving high accuracy in heart disease classification. These experiments aimed to assess the feature-extraction capability of ECG-Vanilla and to justify the adversarial approach. As expected, ECG-Vanilla showed poor performance when tested on images from different distributions and/or different formats than those of the NYU dataset. The main experiment tested ECG-Adversarial on mobile device-acquired 12-lead ECG images. Additional experiments aimed to assess the performance of ECG-Adversarial net on images generated from signals and on various unseen formats.
Detection of multiple heart conditions from 12-lead ECG images using the convolutional neural network ECG-Vanilla
In this section we first present 12-lead ECG analysis results using a convolutional network, named ECG-Vanilla4, trained, and tested using perfectly scanned images from the NYU dataset. ECG interpretations must bear capacities to identify and differentiate between cardiac comorbidities, as well as rare diseases. To this end, we developed ECG-Vanilla with a generic per-disease architecture that is separately trained for binary identification (see “Methods”). Table 1 shows the accuracy and F1 score (the harmonic mean of the positive predictive value) for detection of 14 diseases, with the cardiologist labeling taken as the ground truth. Both training and test sets with the same distributions (i.e., scanned images) were taken from the NYU dataset. The detection accuracy and F1 score of the network for the test set were 0.83–0.99 and 0.82–0.99, respectively, depending on the disease type and the amount of training data available for that disease. The network achieved ROC-AUC > 0.9 for all heart conditions (Fig. 2). The class weighted average ROC-AUC was 0.97. Supplementary Table S1 presents more statistical measurements.
The receiver operation characteristic area under the curve (ROC-AUC) of cardiac condition identification by ECG-Vanilla. (A) Atrial fibrillation (AFIB) 0.99, (B) Premature ventricular contraction (VPC) 0.98, (C) Left-axis deviation (LAD) 0.97, (D) Left bundle branch block (LBBB) 1.0, (E) Sinus tachycardia 0.99, (F) Left atrial enlargement (LAE) 0.98, (G) ST changes—Nonspecific 0.9, (H) Left ventricular hypertrophy (LVH) 0.99, (I) Sinus bradycardia 0.99, (J) Sinus arrhythmia 0.96, (K) ST-elevation due to myocardial infarction 0.94, (L) Right bundle branch block (RBBB) 0.99, (M) Normal variant 0.96, (N) Prolonged QT interval 0.94.
To compare the performance of ECG-Vanilla to that of a state-of-the-art net, the same training routine (see “Methods” section) was used for both ECG-Vanilla and ResNet18. ECG-Vanilla was slightly superior to ResNet18 (Table S2).
Detection of multiple disorders from unlabeled mobile device-acquired 12-lead ECG images with distortion, using a convolutional network ECG-Adversarial
ECG-Vanilla proved very efficient in detecting tested diseases, as shown by the high accuracy and high F1 scores (Table 1). However, when tested on images from the same format, but that were captured by mobile device (to mimic clinical conditions), its performance was relatively poor (Table 2). To improve performance, we designed a domain-adversarial neural network called ECG–Adversarial, comprising both a label predictor and a domain classifier. In the context of this section, domain adversarial training attempts to extract features to accurately identify cardiac conditions (label predictor), but at the same time, it intentionally deteriorates the ability of the domain classifier to determine if the example is an original ECG image or an ECG image with mobile device acquisition distortion. As a result, the extracted features are only those which do not rely on a perfectly scanned image, thus training the network to find features that enable it to also diagnose mobile-captured images.
The proprietary format used for plots generated by an ECG machine of a specific vendor can include dozens of features, such as grid scale, grid color, lead beginning and ending with calibration plots, lead placement, and others. In ML jargon, such a format is called a domain, where the challenge is to transfer learnt analysis capabilities from a domain where a lot of high-quality, labeled data is available to domains where only poor-quality, perhaps unlabeled training data is available. Another type of domain consists of images of 12-lead ECG plots acquired using mobile devices. Once again, the challenge here is to enable accurate analysis of data from that domain while the training is carried out using data from the domain of high-quality, scanned images with no distortions and artifacts. To meet both challenges, ECG-Adversarial makes use of unlabeled 12-lead ECG samples from target domains (various formats, mobile camera-captured images) together with samples from the high-quality, labeled data domain. See high-level description in Fig. 3.
Architecture of ECG-Adversarial. Two datasets were used for training and testing: the New York University (NYU) dataset containing perfectly scanned and labeled 12-lead ECG images (NYU dataset), and a dataset containing mobile-captured 12-lead ECG images. The latter was generated by printing the NYU dataset images on paper, using a mobile camera to capture images with various artifacts, and splitting the movie into frames. The input 12-lead ECG images from the NYU dataset are propagated forward and backward (blue and green, respectively) through the feature extractor. The same images with additional images captured by a mobile device, are also propagated forward and backward through the domain classifier (purple and green, respectively). When they are backpropagated through the domain classifier, adaptive gradient reversal is applied, essentially causing the construction to ignore domain-specific features9. The input from the second dataset, namely, mobile-captured 12-lead ECG images, is propagated forward and backward (purple and green, respectively) through the domain classifier path only. When they are backpropagated through the domain classifier (purple), gradient reversal is applied9. As a result of this domain adversarial method, the feature extractor is forced to ignore domain-specific features. In our case, domain-specific features either belong only to “clean” images that were not mobile-captured or are format-specific. Thus, the domain adversarial method allows the extractor to focus on features that better generalize to other domains (e.g., other 12-lead ECG formats and/or 12-lead ECG mobile-captured images), even though it was not trained using data from these domains.
In our experiments, ECG-Adversarial was trained on a dataset comprising 79,226 scanned and labeled 12-lead ECGs from the NYU School of Medicine Emergency Care Database and a dataset comprising 39,613 unlabeled mobile device-captured 12-lead ECG images. The test here includes only mobile captured labeled 12-lead ECG images.
The first configuration tested did not include augmentation (a method to generate augmented training images), and used low dropout (a method for avoiding overfitting and achieving regularization), with early disturbance (training the domain classifier starting from the third epoch). The second configuration included augmentation and low dropout with early disturbance. Although one may assume that augmentation of the input should always improve system performance and reduce overfitting, some data showed otherwise10. Application of augmentation resulted in minor performance degradations for specific cardiac conditions (e.g., sinus tachycardia, left ventricular hypertrophy and left atrial enlargement; Table 2—ROC-AUC without augmentation vs. ROC-AUC with augmentation, dropout = 0.15). Thus, we tested a third configuration with adjusted training hyperparameters (ROC-AUC with augmentation, dropout = 0.25, epoch of starting to train the domain head = 5).
The third configuration exhibited the best performance using domain adversarial training on mobile-acquired NYU ECG images (Table 2 and Fig. 4). The network achieved ROC-AUC > 0.82 for all heart conditions. The accuracy on the test set was 0.77–0.93, depending on the heart condition (Table S3). F1 score was 0.79–0.93 and the ROC-AUC was 0.84–0.96, depending on the heart condition.
The receiver operation characteristic area under the curve (ROC-AUC) of ECG-Adversarial on a hidden test set of mobile device-acquired 12-lead ECG images. (A) Atrial fibrillation (AFIB), (B) Premature ventricular contraction (VPC), (C) Left-axis deviation (LAD), (D) Left bundle branch block (LBBB), (E) Sinus tachycardia, (F) Left atrial enlargement (LAE), (G) ST Changes—Nonspecific, (H) Left ventricular hypertrophy (LVH), (I) Sinus bradycardia, (J) Sinus arrhythmia, (K) ST-Elevation due to myocardial infarction, (L) Right bundle branch block (RBBB), (M) Normal variant, (N) Prolonged QT interval. Average ROC-AUC of the best configuration experimented with is 0.91.
Note that the train and test databases include both ideally scanned images and augmented ones. Thus, the system is compatible with both scanned photos as well as mobile device captures images.
ECG-Adversarial detection of multiple disorders from an unseen format of 12-lead ECG images with mobile device acquisition distortion
12-lead ECG images do not have a uniform format, e.g., the lead placement on the image may vary or sometimes the long lead is not even present. In fact, obtaining training data with all existing formats is an unfeasible undertaking. To overcome this difficulty, we used our domain transfer approach, as in the case of mobile device-captured vs. perfectly scanned images, to train ECG-Adversarial to ignore format-specific features. To this end, the network domain classification head was fed with images of different formats and gradient reversal was used during backpropagation. We then tested whether ECG-Adversarial performs well on 12-lead ECG images with mobile device acquisition distortion whose format was not seen by the network during training (called here unseen formats). For testing, we used the third configuration of ECG-Adversarial, which achieved the highest average ROC-AUC scores in the mobile device capture experiments.
First, we tested ECG-Adversarial using mobile-acquired images of 12-lead ECG data with various unseen formats. The resulting ROC-AUC scores fell within the range of 0.94–1 (Fig. S2). Specifically, ROC-AUC for atrial fibrillation was 1.0, for sinus bradycardia was 0.99, for left-axis deviation was 0.99, for left ventricular hypertrophy was 1.0 and for sinus tachycardia was 0.94. Note that availability of data with various formats was restricted to this set of cardiac diseases, further emphasizing the importance and necessity of our methods: other formats using which there is no opportunity for training may suddenly be introduced as input during clinical operation.
One of the main advantages of the architecture and method proposed in this paper is that the results improve when, during training, or even later, during operation, the network is further retrained using several (unlabeled) examples from the target domain. Indeed, retraining the network after adding a few samples with unseen formats, improved the ROC-AUC score to 0.99–1 (Fig. S3).
ECG-Adversarial detection of multiple disorders from 12-lead ECG signals
New ECG machines produced in recent years may provide a digital version of the 12 leads, essentially, time series of values stored as vectors, which we call here signals. It is relatively simple to turn signals into 12-lead ECG images. The reverse direction, however, is not straightforward; signal extraction, especially from images with artifacts, is a challenging task11.
To validate that our approach can use a hybrid environment where the input comprises both images and signals, we used input consisting of 12-lead ECG signals plotted on ECG paper background (see “Methods” section). Table S4 shows four performance assessments. In the first assessment, ECG-Adversarial was trained and tested using 12-lead ECG images with mobile device acquisition distortion. The average ROC-AUC score was 0.95. In the second assessment, ECG-Adversarial was trained similarly using 12-lead ECG images with mobile device acquisition distortion, but tested on images that were generated from signals from different databases (see “Methods”). Performance was similar or higher than reported above for unseen formats (see Table S4). The third assessment aimed to show that the accuracy of classification of ECG images with mobile device acquisition distortion is not negatively affected by addition of images generated from signals to the training set. The network was trained on a dataset consisting of both 12-lead ECG images with mobile device acquisition distortion and images that were generated from signals. The test was only performed on 12-lead ECG images with mobile device acquisition distortion. Diagnostic accuracy under these conditions was relatively close to that of the first assessment; average ROC-AUC was 0.90. In the fourth assessment, the network was trained on both 12-lead ECG images with mobile device acquisition distortion and images that were generated from signals and tested on images that were generated from signals. Under these conditions, the average ROC-AUC was similar to that of the third assessment, but substantially better than the accuracy achieved in the second assessment, where the training set did not include images that were generated from signals.
To summarize, our novel domain adversarial approach copes well with 12-lead ECG images captured by mobile device cameras as well as with digital signals plotted as an image. In addition, ECG-Adversarial network performance on signals can be further improved by including images generated from signals in the training process.