Detection performance of AI systems
Previous research has identified external validation as one of the main difficulties for the adoption of AI systems in screening mammography22,23. Our experiments with four state-of-the-art AI systems for screening mammography showed low to moderate performance with our independent, external validation data. Comparison of the AUCs estimated by the original authors of each system (last column of Table 2) and our results (first column of Table 3) shows that the performance of all the systems decreased, with reductions in AUC ranging between 0.256 and 0.361. We argue that this reduction can be attributed to two main factors: differences in test populations and differences in experimental design. We elaborate on the implications of these differences below.
Our test population was different from the populations used in the development of all the AI systems in this study. Breast cancer epidemiology can be significantly affected by the interplay between complex factors, including the population’s mean age, ethnicity, race, lifestyle, environment, socioeconomic status, and healthcare system24. To the best of our knowledge, this is the first study to assess the performance of AI systems for cancer detection in screening mammography in Finland. Our results highlight the importance of extensively testing AI systems in populations different from the ones used in the development of the systems.
There were also some important differences in our experimental design that could affect the performance of the AI systems. First, we used a case–control design matched by age and mammographic system. Age is one of the strongest risk factors for breast cancer25. In the studies where the AI systems were developed, however, age was not included in the experimental design nor accounted for in the statistical analysis. It is well known that age affects the radiological appearance of breast parenchyma26. This, in turn, could affect the performance of the AI systems: the AI systems may have relied at least partly on the age-associated changes, not the breast-related changes. Moreover, previous studies have demonstrated that differences in systems can affect the reliability of computerized mammographic analysis algorithms27,28. In fact, recent research has demonstrated the impact of technological settings on the performance of AI systems for breast cancer screening29. Finally, a previous history of breast cancer is a strong risk factor30. Changes in breast parenchyma due to previous interventions (e.g., metal clips) and treatments (e.g., radiotherapy-associated changes) can be cues for AI systems. Because we excluded symptomatic women and women with previous findings or histories of breast cancer, we believe that our experimental setting represents a more challenging scenario for the detection of breast cancer.
In addition to the aforementioned factors, previous studies have pointed to overfitting and bias as plausible explanations for the inconsistent performance of AI systems in independent test data31. A recent meta-analysis of the external validation of AI systems for screening mammography found that most studies suggest a potential diagnostic improvement when the AI systems are used together with radiologists, but warned about the persistent risk of bias22.
Relevance of breast lesions in cancer detection
In this work, we defined the area of interest of an AI system as the region in a mammogram with a saliency level above a threshold. For an input mammogram, this threshold was determined automatically by maximizing the overlap between the area of interest and breast lesions segmented by expert radiologists. Due to the importance of breast lesions in clinical mammographic analysis by radiologists, our hypothesis was that the area of interest of AI systems should have a high overlap with breast lesions. Our results contradict this hypothesis, however. Specifically, the DSCs showed low overlap between regions of interest and breast lesions (median DSC between 4.2% and 38.0%). In addition, the AI system with the highest performance, the End2End system with an AUC of 0.69, showed a remarkably low overlap, with a median DSC of 4.2% (IQR: 15.1%). Our results suggest that, unlike human readers, breast lesions are not as relevant to AI systems when interpreting mammograms. Specifically, the low overlap between the areas of interest and breast lesions suggests that AI systems do not rely on breast lesions as the main decision cue in diagnostics.
In recent years, the question of the interpretability of AI systems has increasingly gained attention in the machine learning community3,4,5,7,32. In medical imaging, XAI has been identified as one of the key factors in gaining radiologists’ acceptance and, ultimately, fostering its adoption in clinical practice3. For the sake of explainability, a highly localized saliency would facilitate the understanding of what image regions or features are more relevant for the AI system. Surprisingly, in our experiments, the highest overlap between the areas of interest and the breast lesions was observed in systems with low detection performance (AUCs between 0.52 and 0.57). As shown in the last two columns of Fig. 3, the systems with the lowest performance, GMIC and GLAM, showed highly localized saliencies. The discussion of these results should take into account the training strategy adopted for the development of the AI system. Among the methods considered in this study, GMIC and GLAM were developed to improve the “interpretability” of the AI system by focusing the analysis on localized regions of interest using labeled data. On the one hand, this helps to explain why the areas of interest of these methods are highly concentrated in specific spatial regions. On the other hand, the lower performance of these methods raises the question of whether the interpretability of AI systems is attained at the expense of detection performance. Our results are highly relevant for the future development of AI systems, as they show that giving a high relevance to breast lesions does not translate into a higher detection performance at the image level.
The finding regarding the relevance of large image regions for the outcome of computerized systems in the analysis of mammograms has been reported before: in breast cancer risk assessment, the extraction of high-throughput quantitative imaging biomarkers in the whole breast region, namely radiomic analysis, has consistently shown promising performance in the prediction of future breast cancer33. Based on these findings, some researchers have asked whether small changes in radiological patterns that are inconspicuous to the human eye but occupy large regions in a mammogram could play a role in the detection capabilities of computerized systems34. The fact that AI systems use information found in large image regions not circumscribed to lesions is a feasible explanation of why the joint use of AI systems with radiologists outperforms both radiologists and AI systems alone14,22,35,36. Our results suggest that successful breast cancer detection using AI systems exploits non-localized image cues not limited to breast lesions.
Our finding has great clinical significance. Recent literature has proposed that stand-alone AI algorithms could, independently or in conjunction with a radiologist, detect breast cancer or triage mammograms. Triaged normal studies could be read in an adapted manner (e.g., by only one reader), and mammograms with suspicious findings could be prioritized37. AI systems that detect mammograms with findings suggestive of malignancy, albeit with limited ability to localize the tumor, would be especially beneficial for the triaging purposes of the mammograms. Such algorithms could also potentially replace one of the two readers. Nevertheless, a radiologist would still be needed to confirm the presence of the actual lesions. AI systems that localize the tumors more accurately and yet have worse performance with respect to lesion detection could be used to reduce missed diagnoses. Indeed, the results support the idea that when a developed or to-be-developed AI system is reported, the authors ought to disclose how well the system can detect mammograms with a high likelihood of breast cancer and how well it can localize the lesion.
Limitations and future work
We identify three main limitations in our work. First, the small sample did not allow for a saliency analysis according to histopathology and tumor grading. Recent works have pointed out how the localization performance of saliency methods changes according to certain tumor-related features, such as the shape and size of lesions38. Future research should explore the performance of AI systems while considering clinical information such as breast density, tumor biology, and previous interventions and treatments. This, however, would require a substantially larger sample with annotated lesions. In this regard, we would like to highlight the importance of current efforts in the construction of large screening datasets, including clinical data and image annotations39,40,41.
Another limitation in this study is related to the use of saliency analysis as a means for identifying the regions that most influence the decision-making process of AI systems. Among the existing state-of-the-art methods33,6, we selected a visualization-based method, since we were interested in establishing a connection between the outcome of the AI system and specific imaging features: breast lesions. While interpreting our results, however, one should take into consideration that the interpretability of AI systems remains an open problem, and saliency does not fully explain the decision-making process of AI systems6. Few studies have measured how explainability is related to the accuracy of the system. A recent review found that, of 179 works that used XAI, only one reported measures to evaluate the outcome of the XAI5. Research on the validity of XAI methods is also scarce6. In a recent study42, the authors compared four visualization methods for pneumonia detection in chest X-rays: Grad-CAM yielded the best performance. Further investigation of saliency analysis in the context of screening mammography is warranted.
Finally, an unavoidable limitation of our work is the fact that we included a limited number of AI systems that were state-of-the-art. These systems were selected because of their good performance in previous studies and publicly available source code, which enabled the implementation of the saliency analysis. As more AI systems become available, future research is warranted to corroborate our findings.