Prostate cancer detection in histological sections of multiregional prostate biopsies and Gleason grading of the detected carcinoma are routine, laborious pathology tasks. Artificial intelligence-based algorithms proved to be accurate tools in many tumor types, including prostate cancer1,2,3,4,5,6,7,8,9,10,11,15. In this study, we clinically validate an AI-based tool (recently received an CE-IVD certification) for tumor detection and Gleason grading in histological sections of prostate core biopsies (Figs. 1a, b and 2 and Supplementary Figs. 1–3). This validation study is one of the largest clinical validation studies of AI tool for digital pathology to date. It includes 6 datasets stemming from 5 pathology departments and comprising >5900 diagnostic slides, scanned with three different scanners and at two different magnifications (Fig. 1c). The amount of heterogeneity concerning different lab techniques, quality of cutting, staining, digitization, captured by the study cohort is enormous (Supplementary Figs. 4 and 5) and represents “real-world” practice without pre-selection of cases.
The AI tool showed a high accuracy for prostate adenocarcinoma detection. In the study we tested two slightly different approaches to render single biopsy cores positive or negative for tumor (Fig. 3). Both approaches provided very similar tumor detection accuracy metrics (Fig. 3a, b). However, our approach, using aggregated maximal probability of tissue regions being a tumor systematically provided better balance between very high sensitivity (0.975–1.000) and negative predictive value (0.988–1.000) and high specificity. This was true in all six independent datasets used for validation. High NPVs/sensitivity are naturally of particular importance for routine diagnostic cases. Importantly, an additional value of the AI tool was demonstrated by its detection of biopsy cores containing tumor tissue that was missed by pathologists during initial review (up to 13 cores per cohort, see “Results”). Even if it did not have any implications for the whole case status in our study, it might certainly have, especially in pathology departments not sub-specialized in genitourinary pathology.
Most false positive tumor misclassifications issued by the AI tool stemmed from known mimickers of carcinoma or morphologically complex regions representing useful alerts for pathologists in clinical practice (Fig. 4a and Supplementary Fig. 7). False negative tumor detections were occasionally evident with at least some of them arising in regions with mechanical/cutting artifacts or other quality control issues (out-of-focus regions), a known problem for AI-based algorithms14. This warrants two strategies to be implemented. First, AI tool predictions in context of any artifacts should be interpreted by pathologists with additional caution. Second, using an automatized quality control tool before processing slides with tumor detection algorithm might be of additional benefit as the former will identify and highlight or mask all artificially changed regions during the tumor detection step.
Several studies published to date have validated clinical grade AI-based algorithms for prostate cancer detection in histological sections using external data, summarized in Supplementary Table 1. Campanella et al.1 developed an algorithm based on the weakly supervised approach using 12,132 core needle biopsy slides which was validated using external dataset of another 12,727 slides reaching AUROC of 0.986. The AUROC might be a suboptimal metric for diagnostic tools in certain circumstances16 and does not allow a direct comparison to the results of the actual study as we use a fixed threshold (AUROC value for our tool in the development study was 0.99211). An updated version of the Campanella et al. algorithm was validated clinically in three studies2,10,17. In the study of Raciti et al.17 a dataset consisting of 232 slides (slide with tumor n = 93, without intraductal carcinoma). The sensitivity and specificity of the algorithm for detection of “suspicious” slides was 0.96 (4/93 slides with tumor missed) and 0.98, respectively. Authors show improvements of pathologists’ sensitivity using the same cases after a wash-out period of 4 weeks when assisted by algorithm. In the study of da Silva et al.2, the sensitivity and specificity on a dataset containing 579 slides from 100 patients was 0.99 and 0.93, respectively, with some slides excluded from analysis due to disagreement of pathologists on the ground truth status. In the study of Perincheri et al.10 algorithm reached sensitivity of 0.977 and specificity of 0.993 for detection of “suspicious” biopsy slides (n = 1876). In all three studies all slides originated from one pathology department, respectively. Importantly, the algorithm used in these three studies2,10,17 does not detect tumor, but renders slide as suspicious (presence of any of the following lesions: tumor, focal glandular atypia, atypical small acinar proliferation, high-grade prostatic intraepithelial neoplasia with adjacent atypical glands—conditions with high interobserver variability and interpretability) which prevents exact comparison to our results (we concentrated on only tumor detection). Even so, the AI tool in our study (>5900 diagnostic slides, >420 patient cases from five pathology departments) show similar performance with high, diagnostically meaningful accuracy metrics for tumor detection. Also, in the slides with unclear classification which were considered as suspicious by pathologists, the AI tool provides positive alerts in a substantial number of cases (53.9%–75.0% dependent on test cohort) allowing for high awareness levels to such regions among pathologists.
One other clinical grade algorithm was validated in a study by Pantanowitz et al.9. The sensitivity and specificity on the internal (same institute as training data, slides n = 2501) and external (slides n = 355) datasets was 0.996 and 0.901 and 0.985 and 0.973, respectively. Importantly, authors used additional slides from external dataset to first calibrate the algorithm to this external dataset (to digitization, staining parameters, tissue quality, etc.) which is not a typical practice. Therefore, the real generalization capabilities of the algorithm to new/external data could not be estimated based on this study. The parameters of our algorithm were frozen at the beginning of the study without any forms of accommodation of the algorithm to external data, which is also a regulatory requirement for clinical-grade tools.
In one other study by Ström et al.7, authors report sensitivity of 99.6% and specificity of 86.6% on the reserved internal validation dataset (enriched for high-grade cases). Both studies used original semiautomatic labeling techniques for annotation creation.
The second diagnostic aspect of our study is AI-based Gleason grading. Using two external sets of biopsy cores (slides n = 227 and 159) representative of all Gleason grade groups and a large, international group of board-certified pathologists (n = 11; 2 general surgical pathologists, 9 experiences genitourinary pathologists) representing diagnostic practices of different countries (Germany, Austria, USA, Netherlands, Israel, Japan, Vietnam, Russia) we showed that the developed algorithm performs on par with experienced genitourinary pathologists (Figs. 5 and 6). The average quadratically weighted kappa value for the AI tool was 0.77 in the first cohort (UKK; pathologists average kappa values 0.62–0.80) and 0.72 in the second cohort (WNS; pathologists average kappa values 0.64–0.76). Moreover, the agreement between the AI tool and pathologists was especially high in cases where consensus among pathologists could be reached (>0.855; Fig. 6c, d). Also, for the diagnostically critical Gleason grade group 1 (Gleason Score 3 + 3 = 6; clinical decision: active surveillance vs. active therapy) the AI tool showed similarly high levels of agreement to participating pathologists (Supplementary Fig. 12C, D). Several large studies evaluated performance of AI-based tools for prostate cancer Gleason grading against human pathologists in a controlled setting using external datasets5,7, summarized in Supplementary Table 2. Studies by Strom et al.7 and Bulten et al.5 show similar levels of agreement compared to our study among pathologists and the AI tool in external validation datasets (kappa levels 0.60–0.72). Some other studies showed that pathologists assisted by AI algorithms can provide more concordant and reliable grading6,15, mirroring the real diagnostic benefits of complementary human-AI tool interaction within a diagnostic process. Moreover, one large computational challenge (PANDA) addressed the development of Gleason grading algorithms in a competitive manner releasing large datasets for training and validation8. In our study, we compared the developed algorithm with a winning solution of PANDA challenge (Supplementary Tables 3 and 4) showing superiority of our algorithm. To facilitate further academic research in the area of Gleason grading and interoperability studies of algorithms, we release part of our Gleason Grading datasets (WNS, UKK) with accompanying grading results by pathologists.
Our study is not devoid of limitations. All cohorts analyzed in the study are retrospectively gathered archived cases. Further prospective evaluation with integration of the AI tool into diagnostic routine of pathologists is necessary. The optimal ways of interaction between human pathologists and AI tools to achieve maximal complementary effects is still an open field of research. Issues such as a overly high confidence of pathologists in the predictions of AI tool should be addressed by prospective evaluation. Although this study is one of the largest validation studies of AI tools for digital pathology to date including 5 different departments, the heterogeneity of pathology practice is huge in the real world. Additional validation with inclusion of more pathology departments is warranted. The AI tool might still make diagnostic mistakes and misses tumor, as human diagnosing also does. Further (continuous) development with the inclusion of difficult cases into the training data is a typical way to mitigate this problem. We used a small part of cases from one department to extend our training data. These cases were temporally separated from the cases included into the test dataset and represent a negligibly small volume of training material compared to the training dataset and to the size of the remaining test dataset. We did not see any effects on the accuracy of the algorithm on the compromised dataset, especially compared to four other, completely independent, external test datasets.
In this large, multi-institutional, international study we validate a clinical grade AI tool for prostate cancer detection and Gleason grading on biopsy material from 5 pathology departments, digitized with three different scanners at two different magnifications. We show high levels of diagnostic accuracy for prostate cancer detection and agreement levels for Gleason grading comparable with experienced genitourinary pathologists.