Study design and participants
The study of the imaging data was approved by the local ethics committee of the University of Heidelberg and the requirement to obtain informed consent was waived due to the due to the retrospective nature of the study and the thorough anonymization of the data.
In this retrospective multicentre study, we used CT-angiography data from 1179 patients previously treated at Heidelberg University Hospital (Heidelberg cohort) to develop and train a one-stage object detection ANN for detecting and localizing vessel occlusions on CT-angiography. The Heidelberg cohort included 800 consecutive patients with AIS and confirmed vessel occlusion on CT-angiography who subsequently underwent EVT between 03/2010 and 02/2020, as well as 379 consecutive patients with a suspected diagnosis of stroke but no vessel occlusion (control group) who underwent CT-angiography between 10/2019 and 02/2020. Pseudo-prospective external testing of the ANN was performed onto two different datasets, and namely (i) the FAST cohort, with 358 consecutive patients who underwent CT-angiography between 01/2022 and 06/2022 for suspected AIS at three primary/secondary care hospitals of the regional stroke consortium Rhine-Neckar with acute teleneurology/teleradiology coverage through the Heidelberg University Hospital, and the UKB cohort, with 323 patients who underwent CT-angiography between 09/2020 and 04/2021 for suspected AIS at the Department of Neuroradiology of the Bonn University Hospital.
Figure 1 depicts the flowchart with the inclusion and exclusion criteria for patients in the Heidelberg, FAST and UKB cohorts. All patients underwent multimodal CT, including CT-angiography. The scanner and acquisition parameters are shown in Supplementary Table 3.
Figure 1 depicts the flowchart of the procedures performed for training, model development and testing of the ANN. For the Heidelberg cohort, imaging data was exported from the PACS system and converted to the NifTI file format using dcm2niix (https://github.com/rordenlab/dcm2niix). All data was visually inspected, and cases presenting insufficient imaging quality were excluded (e.g., movement artifacts, insufficient vessel opacification, etc.). Vessel occlusions were then labeled by ES and reviewed by GB, a neuroradiology resident with 6 years of experience, and PV, a board-certified neuroradiologist with 10 years of experience, using ITK-SNAP (http://www.itksnap.org/). All vessel occlusions within the CT-angiography acquisition were labeled using a spherical 3D-ROI with 15 (MeVOs) or 30 (LVOs) voxels of diameter, placed with its center at the most proximal point of loss of contrast on one axial slice; the segmentation was then automatically propagated from the centerpoint in the 3D-plane. Both the treated occlusion and incidental findings were included in the labeling. The original radiological report was reviewed in all cases to improve robustness. Four main classes of occlusions were defined:
Anterior LVOs—occlusions in the common carotid artery (CCA), internal carotid artery (ICA), M1-segment of the middle cerebral artery (MCA) and A1-segment of the anterior cerebral artery (ACA)19
Anterior MeVOs—occlusions of the M2-/M3-segment of the MCA, A2-/A3-segment of the MCA20
Posterior LVOs—occlusions in the vertebral artery (VA), basilar artery (BA) and the P1-segment of the posterior cerebral artery (PCA)19
Posterior MeVOs—occlusions of the P2/3-segment of the PCA20
Next, patients with vessel occlusion were randomly split on a per-class basis into a training set (75%) and test set (25%) to maintain the distribution of occlusion locations. Within the test set, a 1:1 distribution between patients with vs. without vessel occlusion was established i.e., the same number of patients without vessel occlusion was added to the test set, whereas the remaining patients without vessel occlusion were added to the training set.
Detecting objects based on coarse annotations is a fundamental problem of computer vision and can be tackled with various methods. Here, we based our study on the use of RetinaNet21,22, a single-stage object detector that is both simple in design and provides a solid foundation for robust performance across various clinical problems23. A confidence threshold for the ANN prediction was determined on the training set by maximizing the F2-score, thereby putting more attention on minimizing false-negatives rather than false-positives. Performance was evaluated using 5-fold cross-validation on the training set and using the ensemble model for predicting the test set.
For the FAST cohort, imaging data from three regional hospitals (located in Mosbach, Sinsheim and Heppenheim – Germany) were previously sent to Heidelberg University Hospital for teleradiological reporting and therefore available in our local PACS system. Imaging data was exported in batch from the PACS system using ADIT (https://github.com/radexperts/adit) and converted to the NifTI file format using dcm2niix. Patients presenting insufficient data quality or missing data were excluded. To simulate a realistic usage of the developed ANN within a clinical scenario, CT-angiography data were then processed through the ANN (previously developed onto the Heidelberg cohort) to detect vessel occlusions, producing both labels and confidence scores (Fig. 2).
Commercial software comparison
Benchmarking of the developed ANN was performed against two FDA-cleared and CE-marked commercial software solutions which are currently available for purchase on the market (Software #1 and Software #2, respectively). Both software were tested on the UKB cohort, but only Software #2 could be tested onto the FAST dataset due to contractual limitations. The software names are blinded throughout the paper and cannot be disclosed at any given point; information on the architecture or mechanisms used by the software to detect the vessel occlusions are also not available due to the proprietary nature of the software solutions. Briefly, the software provided binary predictions and localizations of vessel occlusions, without further measures of uncertainty or confidence scores. Comparisons were performed visually by GB and reviewed by PV and UN; disagreements were resolved through consensus discussion. Written reports within the PACS system were referenced during the review process to increase accuracy and avoid misdiagnosis.
In order to provide a fair comparison with commercial software, which by design were both limited to detecting occlusions in the anterior circulation only, we analyzed all patients by considering only the detection of occlusions in the anterior circulation, and specifically limiting the analysis to occlusions in the internal carotid artery (ICA) and the M1-segment of the middle cerebral artery for LVOs, and in the M2- and M2-segment of the middle cerebral artery for MeVOs. Findings were considered correct as long as labeled on the correct vessel, without considering the precise localization of the occlusions, in order to provide a fair comparison between the software. Findings labeled in vascular territories not considered by the commercial software (e.g., posterior circulation, anterior cerebral artery) were ignored. McNemar’s test was used to compare specificity and sensitivity; comparison of relative predictive values was used instead to compare PPV and NPV (rpv.test function of R’s DTComPair package).
In order to provide a full analysis of the capabilities of our algorithm while increasing comparability to previous studies5,6,7,8,9,10,14,16,24,25,26,27,28,29,30,31,32,33,34, the evaluation of the models was divided into (i) object-level and (ii) patient-level evaluation.
We performed automated and quantitative evaluation by using the expert-generated segmentation masks and the predicted bounding boxes from the CNN as input. Specifically, as further explained, the localization of a vessel occlusion was referred to as correct if the Intersection over Union (IoU) between the predicted bounding box exceeded 0.1027,28.
Briefly, for the calculation of patient-level performance the localization information was ignored, and patient-level classification results were produced by selecting the maximum of the predicted confidence scores. All CTA scans with at least one marked VO were then considered positive findings. The AUC was then used to evaluate the continuous predictions while sensitivity, specificity, PPV, and NPV were calculated at the same confidence thresholds as the object level evaluation. Only patients with a single VO were included in the patient-level evaluation when performance was assessed for each subgroup separately. Bootstrapping with 1000 iterations was used to determine the bootstrap percentile with 95% confidence intervals for the free-response operating characteristics (FROC35) and all other performance estimates.
FROC36 are a commonly found metric to evaluate CAD systems, and assesses the sensitivity at multiple working points and with a varying number of false-positive predictions per image. To obtain a single performance score from the entire curve, the sensitivity values at [1/8, 1/4, 1/2, 1, 2, 4, 8] were averaged. These values were selected in accordance with previous publications of CAD tools35,37 and account for the need for methods with high sensitivity in a screening setting while rewarding a low number of false positives. To account for the cubic decline of the IoU in three dimensional data and the coarse annotation of the RoI, the localization of a VO was referred to as correct if the IoU between the predicted bounding box and ground truth bounding box exceeded 0.1023,38.
For the object-level evaluation, localization information was maintained, and the localization of a vessel occlusion was referred to as correct if the IoU between the predicted bounding box and ground truth bounding box exceeded 0.1039. Duplicate predictions of the same VO were considered false positives. Since the detection task was formulated as a binary detection task (differentiating vessel occlusions from background), false-positive predictions could not be assigned to a respective subgroup. Subgroup analysis on the object-level was thus performed by computing sensitivity for each subgroup separately while the number of false positives were counted across all subgroups.
As further listed in the results, it also became apparent through the visual case review process that the network was also focusing on high-grade stenosis (HGS), and labeling these as false positive occlusions. High-grade stenoses constitute a clinically relevant vessel pathology which can cause stroke symptoms at presentation and require additional considerations when evaluating stroke CTAs, as well as when planning the following intervention. Therefore, we conducted a separate sub-analysis by also documenting high-grade vessel stenoses. HGS were labeled with the same procedures as VOs, and were considered high-grade if above 70% of the vessel lumen; measurements at the carotid bifurcation were performed according to the NASCET trial standard40. Within this sub-group analysis, previous false positives labels on confirmed high-grade stenoses were considered true positives. Conversely, any missed high-grade stenosis was considered a false negative both at case- and object-level. Cross-referencing with the radiological report was always performed during the labeling procedures to increase accuracy.
Network training: image pre-processing
The target spacing was set to the median spacing of the training cohort (namely, 0.5 mm × 0.453 mm × 0.453 mm). Since the single voxel density values of CT scans are implicitly measured on an absolute scale expressed in Hounsfield units, we employed a global normalization scheme for all cases in order to avoid loss of information41; specifically, the statistical properties of the voxel intensities such as mean, standard deviation, and percentiles were collected across the entire training dataset and were used to clip the voxel intensities to their 0.5 and 99.5 percentiles followed by z-score normalization41.
The RetinaNet21 architecture consists of three main components: the encoder network which consecutively downsamples the image to extract features on multiple resolutions, the decoder network which progressively upsamples the obtained features to combine coarse (low resolution) with fine grained (high resolution) features and the detection heads which are responsible for classifying and regressing the anchors. The ANN receives three-dimensional input patches with [192, 128, 128] voxels for processing. A detailed overview of the architecture can be found in Supplementary Fig. 1.
The encoder network utilizes 3 × 3 × 3 convolutions, Instance Normalisation and Leaky Rectified Linear Units to extract features. Strided convolutions at the beginning of each resolution stage are used to downsample the feature maps. The four deepest (i.e., lowest resolution) feature maps are used for further processing by the decoder network.
A Feature Pyramid Network42 is used to recombine information from different resolutions. First, each feature map is processed by a 1 × 1 × 1 convolution to reduce the number of channels to 128. Transposed convolutions are used to progressively upsample them and element wise addition is used to combine the features.
These feature maps are fed to a set of shared convolutions, usually referred to as the detection head. It is responsible to classify and regress the predefined set of anchors and consists of 3 × 3 × 3 convolutions, Group Normalisation43, and Leaky Rectified Linear Units.
Network training details
Anchors are an essential concept of several commonly used object detectors, as they act as initial estimates of objects and are used to formulate the detection task as a classification and regression problem. To account for the differently sized annotations, two anchors of size [8, 10, 10] and [15, 14, 14] were used during our experiments, which were derived by the planning procedure of nnDetection23. The assignment of ground truth objects to anchors during the training was conducted via Adaptive Training Sample Selection44. Binary cross-entropy loss was used to train the classification branch of the detection head and the regression branch was trained with the smooth L1 loss45.
To reduce overfitting and artificially increase the diversity of the training samples, online data augmentation was utilized throughout the entire training. In order to avoid artifacts at the edges when applying spatial augmentations, a patch size of [328, 249, 295] was extracted from the CTA scan and cropped to the training patch size of [192, 128, 128] after the spatial augmentations were applied. We utilized the same set of augmentations as nnU-Net41 except dropping the Simulation of Low-Resolution Samples due to the observation of a slightly reduced performance when using it.
The ANN was trained with Focal loss21 and smooth L1 loss45 in a fivefold cross-validation fashion to differentiate between background and labeled VOs. The 5 folds were generated by generating stratified randomized folds considering all available classes, and the overall least frequent class present in a patient was used as a basis for randomization. During our cross-validation experiments, we found that training for 60 epochs with 2500 batches each while using SGD with Nesterov Momentum23,41 to update the weights achieved the highest FROC score on our dataset. Here, the last 10 epochs were used for Stochastic Weight Averaging to further optimize the final model46. Training was performed on patches to overcome the memory limitations caused by the 3D model configuration; patches were set to a size of 192 × 128 × 128 voxels, with a batch size of 8, and were sampled from the CTA scans while ensuring an equal number of foreground and background patches per batch.
Since occlusion annotation was depending on the occurrence of intravascular loss of contrast, multiple annotated occlusions may be present in a single patient. Frequently this was caused by tandem occlusions, especially simultaneous occlusion of the internal carotid artery and the middle cerebral artery47. Nevertheless, coincidental findings as well as combinations of recent and preexisting vessel occlusions were also included.
Network training: selection of the confidence threshold
To provide an analysis at one specific working point, a cutoff had to be determined for the continuous confidence scores produced by the network. At the time of testing, one model from each fold was then ensembled to form a single prediction. Since the models need to agree on the prediction, the distribution of the confidence scores between the fivefold cross-validation and the test set are implicitly different. To account for this shift, we designed an additional experiment on the training set data.
The training cohort (n = 835) was separated into a mini-training (n = 418) and mini-evaluation (n = 417) cohort. The mini-evaluation dataset contained 207 control patients and 210 patients with at least one vessel occlusion, see Supplementary Fig. 2. This experiment used the same hyperparameters for training and class balancing procedure to generate the folds as the primary experiment. After predicting the mini-evaluation dataset, the cutoff was determined to maximize the F2 score on the object level. The F2 score was previously used by ref. 48 to assess object level performance and was chosen in our experiment to adjust the tradeoff between sensitivity and precision. The best F2 score of 0.74 on the mini-evaluation set was reached at a confidence cutoff of 0.647, see Supplementary Fig. 3.
During testing and fivefold cross-validation, each patient was predicted via a sliding window scheme with 50% patch overlap. To suppress duplicate predictions of the same vessel occlusion, Non-Maximum Suppression with an Intersection Over Union Threshold of 0.3 was applied and predictions that were close to the center of the patch received a higher weighting than predictions close to the border. For each model, all predictions with a confidence score above 0.2 were selected for further ensembling. Weighted Box Clustering (WBC)38 (without restrictions on the area of the predictions) was used to combine predictions from the different models. Predictions that exceeded an Intersection over Union of 0.4 were considered clusters and merged into a single prediction. All bounding boxes which had any axis smaller than 7 voxels were removed from the final set of predictions.
Model inference was performed on a NVIDIA DGX A100 (NVIDIA, Santa Clara, CA, USA) by using four GPUs with a sliding window scheme and 50% patch overlap23. Non-maximum suppression was applied in order to remove duplicate predictions from neighboring patches, with predictions near the center of a patch weighted with higher importance than predictions close to the borders. The final five models (one from each fold) were ensembled via WBC38. Postprocessing parameters were determined by empirical hyperparameter tuning on the training set as previously described in the literature23.
CTA phase correlation
To assess possible correlations between CTA scan phase and detection performance, as reported before8, test set scans were classified into five different groups, ranging from arterial (early arterial, peak arterial) and equilibrium to venous phase (peak venous, late venous). Classification was performed following a previously published method49.
The prediction of the test cases was performed on a DGX A100 with Ubuntu 20.04.4 LTS. The computer was equipped with two AMD EPYC 7742 with 64 physical cores each as its CPU and 1-TB of Random Access Memory (RAM). Four NVIDIA A100 with 40-GB of Video Random Access Memory (VRAM) were used as the GPUs. The preprocessing step to normalize the data is the predominant bottleneck when multiple GPUs are used to predict the ANNs.
Distributed Data Parallel from PyTorch was used to predict a subset of the extracted patches for the sliding window approach on each GPU. The predictions from each GPU were gathered before the ensembling step. This approach allows for a flexible number of GPUs to be used to predict each patient with little synchronization overhead between different processes.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.