The searches identified 13,857 references. After removing duplicates, 12,320 titles and abstracts were screened. Subsequent full-text screening of 268 studies identified 64 studies that met the eligibility criteria (Fig. 1 and Supplementary Material 1). No studies prior to 2012 met our eligibility criteria. This is in keeping with the paradigm shift towards the use of DL in ML research in 2012, when AlexNet (a type of DL) was shown to significantly outperform traditional ML image analysis methods10.
Risk of bias and applicability concerns for all 64 included studies were assessed using our modified QUADAS-2 framework (Supplementary Material 2). 59 studies (92%) had overall high risk of bias judgements, and 62 studies (97%) had overall high level applicability concerns (Fig. 2 and Supplementary Material 3). Of 9 studies that used external datasets to validate or test their DL algorithms, 8 studies (89%) still had an overall high risk of bias and all 9 studies (100%) had overall high level applicability concerns (Supplementary Material 4).
With respect to risk of bias, the participant and outcome domains were more commonly rated as high/unclear (92% and 83%, respectively), in contrast to the reference standard and index test domains (9% and 9%, respectively) (Fig. 2). To determine the reference standard, most studies (61%, n = 39) had datasets verified by at least one clinician. With respect to the index test, 91% (n = 58) accounted for overfitting, underfitting, and/or optimism when assessing DL algorithm performance against the reference standard.
All 64 studies scored high or unclear in the participant domain of applicability concerns (Fig. 2). All externally validated/tested DL algorithms (n = 9) also had high level applicability concerns in this domain (Supplementary Material 4). There was poor reporting of participant characteristics such as Fitzpatrick skin type, age, and gender, as well as poor generalisability of the study settings. 63% and 38% of all studies had high/unclear applicability concerns in the index test and outcome domains, respectively (Fig. 2). This reduced to 0% and 11%, respectively, when considering only externally validated/tested DL algorithms (Supplementary Material 4).
General study characteristics
Overall, 144 skin diseases were studied. Of these, the most frequently studied diseases were acne (n = 30), psoriasis (n = 27), eczema (n = 22), rosacea (n = 12), vitiligo (n = 12) and urticaria (n = 8) (Tables 1 and 2). The most common skin disease categories were inflammatory, follicular, pigmentary and infectious disorders (Table 3).
47 of 64 (73%) included studies reported research funding, 6 (9%) did not and 11 (17%) were unclear (Supplementary Material 5). The authors were most frequently affiliated to China (n = 20), India (n = 9) and the USA (n = 5), and private datasets were mostly from Asia (73%, n = 35) (Supplementary Material 6).
Most studies (88%, n = 56) used retrospectively collected data and most (85%, n = 55) used the same image dataset for both training and validation/testing (Table 1). Few studies (14%, n = 9) used independent external data to validate or test their DL algorithms (Supplementary Material 7). Overall, 24 studies (37.5%) evaluated the algorithm in an independent dataset, a clinical setting or prospectively. No RCTs of DL in skin diseases were found.
DL algorithms were developed predominantly for disease diagnosis (81%, n = 52), rather than severity assessment (19%, n = 12). Diagnostic DL algorithms were most commonly developed for acne (n = 24), psoriasis (n = 23) and eczema (n = 21) (Table 1). Disease severity DL algorithms were most commonly developed for acne (n = 6) and psoriasis (n = 4).
Participants and images
Of those studies performing training (n = 60), internal/external validation (n = 34) and internal/external testing (n = 52) of DL algorithms, the number of participants was reported by 18% (median 2000 participants, IQR 416–5860; n = 11), 24% (median 626 participants, IQR 167–3102; n = 8) and 15% (median 185 participants, IQR 90–340; n = 8), respectively (Table 1). Participant age was reported in 13 (20%) studies and sex was reported in 12 (19%) studies.
In the minority of studies reporting participant ethnicity and/or Fitzpatrick skin type (19%, n = 12; Table 1), there was representation across most ethnicities and skin types. Of 10 studies reporting Fitzpatrick skin types, 4 specified the number of participants per Fitzpatrick skin type group: most (>85%) participants had skin types II–IV. In the other 6 studies, 5 specified that participants were mostly skin types III–IV and 1 study stated that participants were mostly skin types II–III.
Most image datasets (88%, n = 56) comprised macroscopic images of skin, hair or nails. Dermoscopic images were most commonly used for psoriasis (n = 5) and eczema (n = 4) (Table 1). In contrast to participant characteristics, the number of images used in training, validation and testing datasets was reported by most studies: 60 (93%), 61 (91%) and 62 (96%) studies, respectively. Generally, a greater number of images was used to train DL algorithms (median 2555 images, IQR 902–8550) than to validate (median 1032 images, IQR 274–2000) or test (median 331 images, IQR 157–922) DL algorithms. The ratio of median number of images to participants was 1.3 for training datasets, 1.6 for validation and 1.8 for testing datasets. This indicates that a single participant contributed more than one image through, for example, multiple photographs of anatomically distinct sites or splitting/modification of an image (Table 1).
Five studies used more than one type of DL algorithm, hence the total number of algorithms was 69 across 64 studies. Overall, the commonest types of DL algorithm were convolutional neural networks (CNN) and deep convolutional neural networks (DCNN) (80%, n = 55 of 69 algorithms) (Fig. 3 and Table 1). CNNs and DCNNs are considered interchangeable terms, as ‘deep’ refers to the number of layers in the algorithm architecture and most modern CNNs consist of a large number of layers11. The first CNN/DCNN study included in our review appeared in 2017. By 2021, 85% (n = 17 of 20) of studies applied CNN/DCNN algorithms. Ensemble DL algorithms, which combines multiple DL algorithms to improve prediction performance, first appeared in 2018, however was less frequently used compared to CNN/DCNN in subsequent years. Multilayer perceptron (MLP) (3%, n = 2 of 69 algorithms) and artificial neural networks (ANN) (3%, n = 2 of 69 algorithms), which are now considered outdated types of DL, were also less commonly used.
Most studies (77%, n = 49) reported the reference standard of the DL algorithm; 36 (73%) used a clinician assessment of images, of which 27 (75%) were dermatologists. The remaining 27% (n = 13 of 49) used multiple reference standards inconsistently across datasets or other reference standards including biopsies, blood tests and curated databases (Table 1). The severity scales used for disease severity grading DL algorithms were varied (Supplementary Material 8).
Most studies (95%, n = 61) disclosed the source of images. The most common sources of images were hospital/university databases (47%, n = 30), and many studies also used public databases (22%, n = 14). Image datasets were fully or partially available in under one third of studies (31%, n = 20). DL algorithm codes were available in 26 (41%) studies. In seven studies (11%) there were no details provided on the architecture of the DL algorithms (Table 1). With regards to transparency of reporting of primary and secondary outcomes, 26 of 64 studies (41%) provided the raw values that were used to calculate accuracy, sensitivity or specificity.
Accuracy of diagnostic DL algorithms: six most studied diseases
Accuracy (the primary outcome) was the most commonly reported outcome for assessing the performance of all DL algorithms (75%, n = 48). The median diagnostic accuracy of the DL algorithms for the six most studied diseases (acne, psoriasis, eczema, rosacea, vitiligo, urticaria) was high, ranging from 81% for urticaria (n = 2) to 94% for both acne (IQR 86–98, n = 11) and rosacea (IQR 90–97, n = 4) (Table 2). The accuracies of the externally validated/tested diagnostic DL algorithms were higher for acne (median 92%, n = 2) and eczema (96%, n = 2) compared with psoriasis (74%, n = 1), however direct comparison was limited by the small number of studies. Most diagnostic DL algorithms for the six most studied diseases performed multiclass classification (79%, n = 26 of 33), rather than binary classification (21%, n = 7 of 33) (Supplementary Material 9).
Accuracy of diagnostic DL algorithms: five categories of disease
The median diagnostic accuracy of DL algorithms for the five categories of skin diseases (inflammatory disorders, follicular disorders of skin, alopecia, pigmentary disorders, skin infections) was high, ranging from 88% for both skin infections (IQR 60–95, n = 17) and pigmentary disorders (IQR 80–99, n = 5) to 100% for alopecia (n = 2) (Table 3). The diagnostic accuracies of DL algorithms for inflammatory disorders (median 92%, IQR 80–96; n = 30) and follicular disorders of skin (median 93%, IQR 87–97; n = 16) were similarly high.
The median diagnostic accuracy of externally validated/tested DL algorithms was high for inflammatory disorders (83%, IQR 53–100; n = 6) and follicular disorders of skin (84%, n = 3), although numerically lower than that of all DL algorithms. Both studies reporting diagnostic accuracy of DL algorithms for alopecia used external testing and had an accuracy of 100%. In contrast, the median accuracy of externally validated/tested DL algorithms for diagnosing skin infections was low (59%, IQR 50–74; n = 7).
Accuracy of severity grading DL algorithms
The analysis of DL algorithms for disease severity grading was limited by a paucity of studies (n = 12, Supplementary Material 8). The accuracy of DL algorithms in grading psoriasis severity was 93–100% (n = 2), however external validation/testing was not performed (Supplementary Material 10). The single study of a DL algorithm for grading eczema severity did perform external validation and reported 88% accuracy. Of 4 studies assessing DL algorithms that grade acne severity (median accuracy 76%, IQR 68-85), one performed external testing and reported lower accuracy (68%).
Secondary outcomes of diagnostic DL algorithms: six most studied diseases
A total of 23 studies reported AUC. The median AUC of diagnostic DL algorithms was high, ranging from 0.90 (IQR 0.87–0.94, n = 4) for rosacea to 0.98 (IQR 0.93–0.99, n = 4) for acne (Table 2). Diagnostic accuracy of externally validated/tested DL algorithms was similarly high.
Overall, 29 studies reported specificity. The median specificity of diagnostic DL algorithms was high, ranging from 88% (IQR 80-98, n = 4) for vitiligo to 100% (n = 2) for urticaria (Table 2). Externally validated/tested algorithms had similarly high specificity, all above 96% (n = 6).
A total of 43 studies reported sensitivity. The median sensitivity of diagnostic DL algorithms was variable, ranging from 63% (IQR 42–92, n = 6) in rosacea to 91% (IQR 80-95, n = 5) in vitiligo. The range of sensitivity values for each disease was wide, and contrasted the narrower ranges for specificity. Externally validated/tested diagnostic DL algorithms generally had lower sensitivities compared with the overall dataset, ranging from 42% (n = 1) in rosacea to 87% (n = 2) in acne.
31 and 8 studies reported PPV and NPV, respectively. The median PPV of diagnostic DL algorithms varied from 77% for urticaria (n = 2) to 91% for vitiligo (n = 3) (Table 2). In contrast, the NPV of diagnostic DL algorithms was >90% for all six diseases, which was also a consistent finding for externally validated/tested DL algorithms.
Secondary outcomes of diagnostic DL algorithms: five categories of disease
In line with the above findings, diagnostic DL algorithms for the five disease categories (inflammatory disorders, follicular disorders of skin, alopecia, pigmentary disorders, skin infections) were broadly highly specific but had variable sensitivity.
The median specificity of diagnostic DL algorithms ranged from 97% (IQR 93-99, n = 15) for follicular disorders of skin to 100% (n = 2) for alopecia (Table 3). With respect to inflammatory skin diseases (the most frequently studied disease category), the median accuracy of diagnostic DL algorithms was 98% (IQR 95-99, n = 40) and this remained high when only externally validated/tested algorithms were considered (100%, IQR 98-100; n = 10).
The median sensitivity of diagnostic DL algorithms ranged from 77% in inflammatory skin diseases (IQR 63–92, n = 47) and skin infections (IQR 63–93, n = 33) to 87% (IQR 67–94, n = 19) in follicular disorders (Table 3). When considering only externally validated/tested diagnostic DL algorithms, the median sensitivities remained variable and were lowest in inflammatory disorders (58%, IQR 48–72; n = 12) and skin infections (70%, IQR 56–80; n = 17), compared to follicular disorders (87%, IQR 63–92; n = 4). The range of sensitivity values for each disease category was also wide, and contrasted the narrower ranges for specificity.
Secondary outcomes of severity grading DL algorithms
Although data were limited, the specificity of disease severity DL algorithms was high and ranged from 94–95% for acne (n = 2) to 97–100% for psoriasis (n = 2) (Supplementary Material 10). AUC was reported in only one study of a severity grading DL algorithm, in psoriasis (AUC 0.99). The sensitivity of disease severity grading DL algorithms ranged from 82–84% for acne (n = 2) to 93–96% for psoriasis (n = 2). PPV was reported for severity grading DL algorithms in acne (range 54–86%, n = 3) and psoriasis (93%, n = 1). No studies reported these metrics for externally validated/tested DL algorithms of disease severity (Supplementary Material 10).