As shown in Fig. 1, the literature searches returned a total of 1573 records, of which 557 were duplicates. Nine hundred and thirty records were excluded during the screening of titles and abstracts, and 41 were excluded based on full paper screening, including 3 records for which full articles could not be obtained. The remaining 45 studies were included in the review, of which 11 were conference papers and 34 were journal papers. All accepted studies were originally identified through searches of research databases, with no records from trial registries meeting the inclusion criteria. While the searches returned literature from as early as 1949, all of the research which met the inclusion criteria was published since 2010, with over 70% of the included literature published since 2020. Study characteristics are shown in Table 1. The 45 accepted articles contained 80 models of interest, details of which are shown in Table 2.
Risk of bias assessment
The results of the PROBAST assessments are shown in Table 3. While some studies contained multiple models of interest, none of these contained models with different risk of bias scores for any section of the PROBAST assessment, so one risk of bias analysis is presented per paper. All models showed either a high overall risk of bias (37/45) or an unclear overall risk of bias (8/45). Every high-risk model had a high-risk score in the analysis section (37/45), with several also being at high risk for participants (6/45), predictors (11/45), or outcomes (13/45). Less than half of the studies achieved a low risk of bias in any domain (21/45), with most low risks being found in the outcomes (16/45) and predictors (9/45) sections. Nearly all of the papers had an unclear risk of bias in at least one domain, most commonly the participants (36/45) and predictors (25/45) domains. Qualitative summaries are presented in Fig. 2.
Data synthesis results
Data in included literature
The number of participants in internal datasets varied by orders of magnitude, with each study including 1–776 ovarian cancer patients, and one study including over 10,000 total patients across a range of 32 malignancies15. Most research only used data from the five most common subtypes of ovarian carcinoma, though one recent study included the use of sex cord-stromal tumours16. Only one study explicitly included any prospective data collection, and this was only for a small subset which was not used for external validation17.
As shown in Fig. 3, the number of pathology slides used was often much greater than the number of patients included, with three studies using over 1000 slides from ovarian cancer patients18,19,20. In most of the studies, model development samples were WSIs containing resected or biopsied tissue (34/45), with others using individual tissue microarray (TMA) core images (5/45) or pre-cropped digital pathology images (3/45). Most studies used H&E-stained tissue (33/45) and others used a variety of IHC stains (11/45), with no two papers reporting the use of the same IHC stains. Some studies included multi-modal approaches, using genomics 17,21,22,23,24, proteomics21,24, transcriptomics24, and radiomics17 data alongside histopathological data.
The most commonly used data source was The Cancer Genome Atlas (TCGA) (18/45), a project from which over 30,000 digital pathology images from 33 malignancies are publicly available. The ovarian cancer subset, TCGA-OV25, contains 1481 WSIs from 590 cases of ovarian serous carcinoma (mostly, but not exclusively, high-grade), with corresponding genomic, transcriptomic, and clinical data. This includes slides from eight data centres in the United States, with most slides containing frozen tissue sections (1374/1481) rather than formalin-fixed, paraffin-embedded (FFPE) sections. Other recurring data sources were the University of British Columbia Ovarian Cancer Research Program (OVCARE) repository26,27,28, the Transcanadian study29,30, and clinical records at the Mayo Clinic31,32, Tri-Service General Hospital33,34,35, and Memorial Sloan Kettering Cancer Center17,36. All other researchers either used a unique data source (12/45) or did not report the provenance of their data (8/45). TCGA-OV, OVCARE, and the Transcanadian study are all multi-centre datasets. Aside from these, few studies reported the use of multi-centre data17,24,28,37,38,39. Only two studies reported the use of multiple slide scanners, with every slide scanned on one of two available scanners27,28. The countries from which data were sourced included Canada, China, Finland, France, Germany, Italy, Japan, the Netherlands, South Korea, Taiwan, the United Kingdom, and the United States of America.
Methods in included literature
There was a total of 80 models of interest in the 45 included papers, with each paper containing 1–6 such models. There were 37 diagnostic models, 22 prognostic models, and 21 other models predicting diagnostically relevant information. Diagnostic model outcomes included the classification of malignancy status (10/37), histological subtype (7/37), primary cancer type (5/37), genetic mutation status (4/37), tumour-stroma reaction level (3/37), grade (2/37), transcriptomic subtype (2/37), stage (1/37), microsatellite instability status (1/37), epithelial-mesenchymal transition status (1/37), and homologous recombination deficiency status (1/37). Prognostic models included the prediction of treatment response (11/23), overall survival (6/23), progression-free survival (3/23), and recurrence (2/23). The other models performed tasks that could be used to assist pathologists in analysing pathology images, including measuring the quantity/intensity of staining, generating segmentation masks, and classifying tissue/cell types.
A variety of models were used, with the most common types being convolutional neural network (CNN) (41/80), support vector machine (SVM) (10/80), and random forest (6/80). CNN architectures included GoogLeNet40, VGG1619,32, VGG1926,28, InceptionV333,34,35,38, ResNet1817,27,28,39,41,42, ResNet3443, ResNet5016,44,45, ResNet18236, and MaskRCNN32. Novel CNNs typically used multiple standardised blocks involving convolutional, normalisation, activation, and/or pooling layers22,46,47, with two studies also including attention modules20,35. One study generated their novel architecture by using a topology optimisation approach on a standard VGG1623.
Most researchers split their original images into patches to be separately processed, with patch sizes ranging from 60×60 to 2048×2048 pixels, the most common being 512×512 pixels (19/56) and 256×256 pixels (12/56). A range of feature extraction techniques were employed, including both hand-crafted/pre-defined features (23/80) and features that were automatically learned by the model (51/80). Hand-crafted features included a plethora of textural, chromatic, and cellular and nuclear morphological features. Hand-crafted features were commonly used as inputs to classical ML methods, such as SVM and random forest models. Learned features were typically extracted using a CNN, which was often also used for classification.
Despite the common use of patches, most models made predictions at the WSI level (29/80), TMA core level (18/80), or patient level (6/80), requiring aggregation of patch-level information. Two distinct aggregation approaches were used, one aggregating before modelling and one aggregating after modelling. The former approach requires the generation of slide-level features before modelling, the latter requires the aggregation of patch-level model outputs to make slide-level predictions. Slide-level features were generated using summation16, averaging21,24,36, attention-based weighted averaging20,41,42,44,45, concatenation15,30, as well as more complex embedding approaches using Fisher vector encoding29 and k-means clustering48. Patch-level model outputs were aggregated to generate slide-level predictions by taking the maximum22,35, median43, or average23, using voting strategies27,34, or using a random forest classifier28. These approaches are all examples of multiple instance learning (MIL), though few models of interest were reported using this terminology22,41,42,44.
Most studies included segmentation at some stage, with many of these analysing tumour/stain segmentation as a model outcome32,36,37,47,49,50,51,52,53,54. Some other studies used segmentation to determine regions of interest for further modelling, either simply separating tissue from background15,18,44,45, or using tumour segmentation to select the most relevant tissue regions33,34,35,55,56. One study also used segmentation to detect individual cells for classification57. Some studies also used segmentation in determining hand-crafted features relating to the quantity and morphology of different tissues, cells, and nuclei17,18,21,24,30,31.
While attention-based approaches have been applied to other malignancies for several years58,59, they were only seen in the most recent ovarian cancer studies20,28,33,34,35,41,42,44,45, and none of the methods included self-attention, an increasingly popular method for other malignancies60. Most models were deterministic, though hidden Markov trees51, probabilistic boosting trees52, and Gaussian mixture models61 were also used. Aside from the common use of low-resolution images to detect and remove non-tissue areas, images were typically analysed at a single resolution, with only six papers including multi-magnification techniques in their models of interest. Four of these combined features from different resolutions for modelling29,30,36,48, and the other two used different magnifications for selecting informative tissue regions and for modelling33,34. Out of the papers for which it could be determined, the most common modelling magnifications were ×20 (35/41) and ×40 (7/41). Few models integrated histopathology data with other modalities (6/80). Multi-modal approaches included the concatenation of separately extracted uni-modal features before modelling21,23,24, the amalgamation of uni-modal predictions from separate models17, and a teacher–student approach where multiple modalities were used in model training but only histopathology data was used for prediction22.
Analysis in included literature
Analyses were limited, with less than half of the model outcomes being evaluated with cross-validation (39/80) and with very few externally validated using independent ovarian cancer data (7/80), despite small internal cohort sizes. Cross-validation methods included k-fold (22/39) with 3–10 folds, Monte Carlo (12/39) with 3–15 repeats, and leave-one-patient-out cross-validations (5/39). Some other papers included cross-validation on the training set to select hyperparameters but used only a small unseen test set from the same data source for evaluation. Externally validated models were all trained with WSIs, with validations either performed on TMA cores (2/7) or WSIs from independent data sources (5/7), with two of these explicitly using different scanners to digitise internal and external data27,28. Some reported methods were externally validated with data from non-ovarian malignancies, but none of these included ovarian cancer data in any capacity, so were not included in the review. However, there was one method which trained with only gastrointestinal tumour data and externally validated with ovarian tumour data16.
Most classification models were evaluated using accuracy, balanced accuracy, and/or area under the receiver operating characteristic curve (AUC), with one exception where only a p-value was reported measuring the association between histological features and transcriptomic subtypes based on a Kruskal–Wallis test19. Some models were also evaluated using the F1-score, which we chose not to tabulate (in Fig. 3) as the other metrics were reported more consistently. Survival model performance was typically reported using AUC, with other metrics including p-value, accuracy, hazard ratios, and C-index, which is similar to AUC but can account for censoring. Segmentation models were almost all evaluated differently from each other, with different studies reporting AUC, accuracy, Dice coefficient, intersection over union, sensitivity, specificity, and qualitative evaluations. Regression models were all evaluated using the coefficient of determination (R2-statistic). For some models, performance was broken down per patient39,61, per subtype16, or per class15,24,32,57, without an aggregated, holistic measure of model performance.
The variability of model performance was not frequently reported (33/94), and when it was reported it was often incomplete. This included cases where it was unclear what the intervals represented (95% confidence interval, one standard deviation, variation, etc.), or not clear what the exact bounds of the interval were due to results being plotted but not explicitly stated. Within the entire review, there were only three examples in which variability was reported during external validation27,38,39, only one of which clearly reported both the bounds and the type of the interval38. No studies performed any Bayesian form of uncertainty quantification. Reported results are shown in Table 2, though direct comparisons between the performance of different models should be treated with caution due to the diversity of data and validation methods used to evaluate different models, the lack of variability measures, the consistently high risks of bias, and the heterogeneity in reported metrics.