The workflow of the current study is presented in Fig. 1. The following sections are dedicated to the description of data acquisition, radiomic features extraction, and diagnostic modeling framework, including feature selection methods, machine learning algorithms, and the process of evaluation and comparison of the models.
Dataset and image acquisition
A total of 395 patients suspicious of coronary artery disease who underwent 2-day stress-rest protocol MPI SPECT were enrolled in this study. All the data were anonymized and used without any intervention on patients’ diagnosis, treatment, or management. The study was approved by the institutional review board (IRB) of Shahid Beheshti University of Medical Sciences (IRB code: IR.SBMU.MSP.REC.1399.368). Informed consent was waived for all subjects by the same IRB listed above. All methods were performed in accordance with the relevant guidelines and regulations. To emulate a real clinical scenario, we did not apply any conditional inclusion/exclusion criteria to the dataset. However, it is noteworthy to mention that the enrolled dataset did not include patients with myocardial infarction.
SPECT imaging was performed for all patients with a 2-day stress-rest myocardial perfusion protocol. Both rest and stress (induced by exercise, dipyridamole, or dobutamine) myocardial perfusion images were included in this study. On average, 555 to 925 MBq of 99mTc-MIBI was administered intravenously into patients based on published guidelines37, 38. For exercise stress protocol, the radiopharmaceutical was injected when the patient’s heart rate reached 85% of its maximum value. Exercise testing was continued for at least 1 min after injection of the radiopharmaceutical to maintain constant maximal cardiac oxygen demand. For the pharmacological stress test, dipyridamole was injected at a dose of 0.56 mg/kg over 4 min (or dobutamine at a dose of 5 to 10 µg per kilogram every 3 to 5 min), followed by the injection of the radiopharmaceutical after three minutes39. Image acquisition was performed after 15–20 and 60 min post-injection for the exercise and pharmacologic stress tests, respectively40.
The images were acquired on a single-head gamma camera (Intermedical- MULTICAM 1000, Germany) imaging system using 32 projections over a 180° arc from right anterior oblique to left posterior oblique, stepping 30 s for each projection, with a matrix size of 64 × 64 and pixel dimension of 5.357 × 5.357 mm2. Supine stress imaging began 15 to 60 min after stress.
Definition of ground truth
Two nuclear medicine physicians reviewed patients’ gated MPI SPECT, additional clinical information and history, and classified patients as normal or diagnosed with CAD. Moreover, CAD positive patients were classified into low-, intermediate-, and high-risk groups. The ground truth was established based on a consensus between two physicians, and in cases where there was no agreement, a senior nuclear medicine physician made the final decision. Patients’ clinical information included prior MPI SPECT, blood pressure, echocardiography results, ECG and exercise test results, hyperlipidemia, Body Mass Index (BMI), and diabetes mellitus status. It is noteworthy that the physician had access to the traditional quantitative SPECT scores, such as Summed Stress (SSS), Rest (SRS), and Difference Scores (SDS), etc., and wall motion and thickening information from the gated datasets and the raw SPECT projections.
The dataset included 78 normal and 317 CAD patients including 135 low-, 127 intermediate, and 55 high-risk patients. The patients’ demographic information is summarized in Table 1.
The left ventricle myocardium, excluding the cardiac cavity, was manually segmented using the 3D-slicer software package41 by a nuclear medicine technologist with more than ten years of experience and edited/verified by an experienced nuclear medicine physician.
The Image Biomarker Initiative Standardization (IBSI)42 suggests interpolating images to isotropic voxel sizes to obtain rotationally invariant also to standardize the voxel size of images. However, in our dataset, all scans already had isotropic voxel spacing of 5.357 × 5.357 × 5.357 mm3. Hence, we kept them intact to avoid further manipulation of intensities. In addition, intensity levels inside the VOI were discretized to 64 Gy levels to ease the calculation of texture features. The radiomic features were calculated using Standardized Environment for Radiomics Analysis (SERA)43, a MATLAB-based package compliant with the IBSI guideline. For the purpose of validating reproducibility, this package has been evaluated in multi-center standardization studies44. A total of 118 features, including 13 intensity-based, 12 intensity histogram (ih), 3 intensity volume histogram (ivh), and 90 3D textural features (25 Gy-level co-occurrence matrix (GLCM), 16 Gy-level run length matrix (GLRLM), 16 Gy-level size zone matrix (GLSZM), 12 Gy-level distance zone matrix (GLDZM), 5 neighborhood gray-tone difference matrix (NGTDM), and 16 neighborhood gray-level dependence matrix (NGLDM)) were extracted for each VOI. Absolute value First-order statistical features (min, max, average, etc.) were considered irrelevant since MPI SPECT images were not quantitative36. Morphological features were also irrelevant since the VOI was the whole left ventricle myocardium. Family, names, and abbreviations of the extracted features are listed in Supplementary Table S1.
In this section, we introduce different rings in the chain of the proposed automated diagnostic framework, including establishment of diagnostic tasks and feature sets, feature selection, classifiers, and models’ evaluation process.
Diagnostic tasks establishment
Two diagnostic tasks were defined in this study for the models.
(1) The first task is CAD diagnosis, including classification of patients into negative, and positive CAD (normal/abnormal classification).
(2) The second task is risk diagnosis, including classification of patients into low-risk (negative, and low-risk CAD) and high-risk (intermediate- and high-risk) patients. Table 2 lists the tasks and their descriptions.
Feature set establishment
Rest-, Stress-, Delta-, and combined (combination of all) -radiomics feature sets were added to clinical features, including age, sex, family history, diabetes status, smoking status, and ejection fraction (calculated from SPECT images) to be fed into different models for diagnosing tasks 1 and 2.
The data were randomly divided into 80% and 20% for training and testing partitions. In all models, features extracted from the training dataset were normalized using the Z-score, and the obtained mean and standard deviation were applied to the corresponding feature extracted from the test dataset. Many of the extracted features may not correlate with the investigated outcome (not relevant features) or may correlate highly with each other (redundant features). These features do not provide new information and should therefore be excluded. We used three different FS methods, one filter-based: Maximum Relevance Minimum Redundancy (mRMR)45, and two wrapper-based: Boruta46 and Recursive Feature Elimination47 with the Random Forest as the core machine (RF-RFE). Since the used dataset for task 1 was unbalanced (78 normal and 317 abnormal patients), after the features were selected, we applied Synthetic Minority Over-sampling Technique (SMOTE) on the training data with selected features to correct for plausible biases48.
Classification of the patients was performed using nine different machine learning methods, namely Decision Tree (DT), Gradient Boosting (GB), K-Nearest Neighbor (KNN), Logistic Regression (LR), Multi-Layer Perceptron (MLP), Naïve Bayes (NB), Random Forest (RF), Support Vector Machine (SVM) and eXtreme Gradient Boosting (XGB) algorithms. The hyperparameters were optimized in fivefold cross-validation in the training data by random-search for models with more than 100 different parameter settings (XGB and Random Forest) and grid-search for models with less than 100 different parameter settings. Subsequently, the optimum parameters were applied to the test data with 1000 bootstraps. The hyperparameters for each classifier and the range of their values are presented in Table 3. All FS and ML models were selected based on their public availability to increase the reproducibility of the study.
The area under the ROC curve (AUC), accuracy (ACC), sensitivity (SEN), and specificity (SPE) metrics were used to evaluate the performance of the models. In addition, the performance of the best models was statistically compared using the DeLong test (significance threshold < 0.05). All analysis was performed using R 4.0 (mlr library version 2.18).