Introduction
AI methodology overview
-
AI is a phenomenon in which non-living entities mimic human intelligence [12]. It is an umbrella term encompassing a spectrum of computing programs. ‘Rule-based’, ‘hard-coded’ or ‘symbolic AI’ has existed for many decades and is the basis of any software system, from a traffic light management system to the autopilot flying every plane. In healthcare, symbolic AI has multiple applications, e.g. calculating cardiovascular risk index or eGFR.
-
Machine learning (ML) is an AI subfield in which a program achieves a task by being exposed to vast volumes of data and gradually learning to recognise patterns within the data, allocating data to distinct classes [13]. It involves ‘soft coding’, which means that the model learns from examples instead of being programmed with rules [12]. ML models can be supervised (based on data labelled by humans), unsupervised (i.e., grouping features within categories), or reinforcement learning (the system accumulates its own feedback to improve through a reward function) [14]. In medicine, supervision is the most common.
-
Nonneural network-supervised ML algorithms are useful in healthcare for prediction modelling and evaluating associations and best-fitted lines between two (linear regression, parametric) or multiple variables (random forest, non-parametric). The latter combines different inputs using a network of flowcharts (known as decision trees); each tree creates an outcome, and a collective one will be made by combining all the singular outputs [15]. Non-neural networks are often combined with deep neural network (DNN) architectures and achieve improved performance (Fig. 1) [16].
-
Deep learning (DL) is a subdivision of ML, defined by the presence of multiple layers of artificial neural networks (ANN) [17]. An ANN is composed of an input layer of multiple nodes—‘artificial neurons’—that represent characteristics to be analysed, e.g. pixels on an image, diagnoses (International Classification of Disease (ICD) coded), age, nucleotide changes, etc.; connected to one or more hidden layers that sum and analyse all inputs, and transmit a final value to an output layer (Fig. 2A).
-
DNN corresponds to multi-layered DL algorithms (with often over 100 hidden layers), which are currently the gold standard for image classification [15]. As more layers are added, an iterative training phenomenon starts occurring, by which deep layers combine stimuli sent from other layers and design new stimuli, improving the output layer and ultimately leading to better diagnoses [8].
-
Convolutional neural network (CNN) is a type of DNN particularly useful for image and video analysis [15]. These algorithms divide the files into pixels, convert them into numbers or symbols, analyse them by multiple convolutional layers that filter, merge, mask, and/or multiply features, and feed the results to a dense neural network that will create an output layer [18]. Fully convolutional networks (FCN) feed the output layers themselves, without the final step of dense layers (Fig. 2A) [17].
Useful concepts to better understand AI literature
Selected retinal diseases for which AI-based tools have been developed
Diabetic retinopathy (DR)
Condition | Imaging analysed | Database (n) | AI tool | Task | Performance (metrics provided by each paper) | Publication |
---|---|---|---|---|---|---|
DR | Colour | DiaretDB0 (130), DiaretDB1 (89), and DrimDB (125) | CNN | Referable/non-referable DR | Accuracy 99.17% (DiaretDB0), 98.53% (DiaretDB1), 99.18% (DrimDB) Sensitivity 100% (DiaretDB0), 99.2% (DiaretDB1), 100% (DrimDB) Specificity 98.4% (DiaretDB0), 97.97% (DiaretDB1), 98.44% (DrimDB) | Adem et al. [41] |
Colour | Kaggle (88,702), DiaretDB1 (89), and E-ophtha (107,799) | CNN | Referable/non-referable DR | AUC 0.954 (Kaggle), 0.949 (E-ophtha) | Quellec et al. [43] | |
Colour | Kaggle (35,000) | CNN-ResNet34 | Referable/non-referable DR | Sensitivity 85% Specificity 86% | Esfahani et al. [44] | |
Colour | Messidor-2 (1748), Kaggle (88,702), and DR2 (520) | CNN | Referable/non-referable DR | Accuracy 98.2% (Messidor-2), 98% (DR2) | Pires et al. [45] | |
Colour | Own dataset (30,244) | CNN (Inception V3, Inception-Resnet-V2, and Resnet152) | Referable/non-referable DR | AUC 0.946 Accuracy 88.21% Sensitivity 85.57% Specificity 90.85% | Jiang et al. [46] | |
Colour | Own dataset (60,000) and STARE (131) | CNN (WP-CNN, ResNet, SeNet, and DenseNet) | Referable/non-referable DR | AUC 0.9823 (Own dataset), 0.951 (STARE) Accuracy 94.23% (Own dataset), 90.84% (STARE) Sensitivity 90.94% (Own dataset) Specificity 90.85% (Own dataset) | Liu YP et al. [47] | |
Colour | DIARETDB1 (89), DIARETDB0 (130), Kaggle (15,919), Messidor (1200), Messidor-2 (874), IDRiD (103), and DDR (4105) | CNN (VGG16, custom CNN) | Referable/non-referable DR | AUC 0.786 (Kaggle, Messidor), 0.764 (Messidor-2), 0.912 (IDRiD, DDR) Accuracy 82.1% (Kaggle, Messidor), 91.1% (Messidor-2), 94% (IDRiD, DDR) | Zago et al. [48] | |
Colour | Messidor-2 (1748) | CNN | Different clinical stages of DR | AUC 0.98 Sensitivity 96.8% Specificity 87% | Abramoff et al. [49] | |
Colour | Kaggle (80,000) | CNN | Different clinical stages of DR | Accuracy 75% Sensitivity 30% Specificity 95% | Pratt et al. [50] | |
Colour | Kaggle (2000) | DNN, CNN (VGGNET architecture), BNN | Different clinical stages of DR | Accuracy: BNN = 42%, DNN = 86.3%, CNN = 78.3% | Dutta S et al. [51] | |
Colour | Kaggle (166) | CNN (InceptionNet V3, AlexNet, and VGG16) | Different clinical stages of DR | Accuracy: AlexNet = 37.43%, VGG16 = 50.03%, InceptionNet V3 = 63.23% | Wang X et al. [52] | |
Colour | Kaggle (35,126) | CNN (AlexNet, VggNet, GoogleNet, and ResNet) | Different clinical stages of DR | AUC 0.9786 (VggNet) Accuracy 95.68% (VggNet) Sensitivity 90.78% (VggNet) Specificity 97.43% (VggNet) | Wan S et al. [53] | |
Colour | MESSIDOR (1200) | CNN (AlexNet, VggNet16, custom CNN) | Different clinical stages of DR | Accuracy 98.15% Sensitivity 98.94% Specificity 97.87% | Mobben-ur-Rehman et al. [54] | |
Colour | Own dataset (13,767) | CNN (ResNet50, InceptionV3, InceptionResNetV2, Xception, and DenseNets) | Different clinical stages of DR | 96.5%, 98.1%, 98.9% | Zhang et al. [55] | |
Colour | Kaggle (22,700) and IDRiD (516) | CNN (AlexNet) | Different clinical stages of DR | Accuracy 90.07% | Harangi et al. [56] | |
Colour | DDR (13,673) | CNN (GoogLeNet, ResNet-18, DenseNet-121, VGG-16, and SE-BN-Inception) | Different clinical stages of DR | Accuracy 82.84% | Li T et al. [57] | |
Colour | Messidor (1190) | CNN (modified Alexnet) | Different clinical stages of DR | Accuracy 96.35% Sensitivity 92.35% Specificity 97.45% | Shanthi T et al. [58] | |
Colour | Own dataset (9194) and Messidor (1200) | CNN | Different clinical stages of DR | Accuracy 92.95% (own dataset) Sensitivity 99.39% (own dataset), 99.93% (Messidor) Specificity 92.59% (own dataset), 96.2% (Messidor) | Wang J et al. [59] | |
Colour | Messidor (1200) and IDRiD (516) | CNN (ResNet50) | Different clinical stages of DR | AUC 0.963 (Messidor) Accuracy 92.6% (Messidor), 65.1% (IDRiD) Sensitivity 92% (Messidor) | Li X et al. [60] | |
AMD | Colour | 407 eyes with nonadvanced AMD | DL | Distinguishes between low and high-risk AMD by quantifying drusen location, area, and size | For drusen area: ICC > 0.85; for diameter: ICC = 0.69; for AMD risk assessment: ROC = 0.948 and 0.954 | Van Grinsven et al. [61] |
Colour, OCT, and IR | 278 eyes with/without reticular pseudodrusen (RPD) | DL | Automatic quantification of RPD | ROC = 0.94 and 0.958; κ agreement = 0.911; ICC = 0.704 | Van Grinsven et al. [62] | |
Colour | 2951 subjects from AREDS (834 progressors) | DL | Association between genetic variants and transition to advanced AMD | AUC: 5 years = 0.885; 10 years = 0.915 | Seddon et al. [63] | |
Colour and OCT | 280 eyes from 140 participants | DL | Prediction of progression to late AMD | AUC = 0.85 | Wu et al. [64] | |
Colour and microperimetry | 280 eyes from 140 participants | DL | Predictive value of pointwise sensitivity and low luminance deficits for AMD progression | AUC = 0.8 | Wu et al. [65] | |
Colour | > 4600 participants from AREDS | DL | Predict progression to advanced dry or wet AMD | Accuracy = 0.86 (1 year) and 0.86 (2 years); specificity = 0.85 (1 year) and 0.84 (2 years); sensitivity = 0.91 (1 year) and 0.92 (2 years) | Bhuiyan et al. [66] | |
Colour | 1351 subjects from AREDS (> 31,000 images) | DL | Predict progression to advanced dry or wet AMD | AUC = 0.85 | Yan et al. [67] | |
Colour | 67,401 colour fundus images from 4613 study participants | DL | Estimate 5-year risk of progression to wet AMD and geographic atrophy based on 9-step AREDS severity scale | Weighted κ scores = 0.77 for the 4-step and 0.74 for the 9-step AMD severity scales | Burlina et al. [68] | |
Colour | 4507 AREDS participants and 2169 BMES participants | DL | Validation of a risk scoring system for prediction of progression | Sensitivity = 0.87; specificity = 0.73 | Chiu et al. [69] | |
OCT | 2795 patients | DL | Prediction to nAMD within a 6-month window | AUC = 0.74 (conversion scan ground truth) and 0.886 (1st injection ground truth) | Yim et al. [70] | |
OCT | 671 AMD fellow eyes with 13,954 observations | DL | Predict progression to wet AMD | AUC = 0.96 ± 0.02 (3 months); 0.97 ± 0.02 (21 months) | Banerjee et al. [71] | |
OCT | 686 fellow eyes with non-neovascular AMD at baseline | DL | Predict conversion from non-neovascular to neovascular AMD | Drusen are within 3 mm of fovea (HR = 1.45); mean drusen reflectivity (HR = 3.97) | Hallak et al. [72] | |
OCT | 2146 OCT scans of 330 AMD eyes (244 patients) | DL | Predict neovascular AMD progression within 5 years | AUC = 0.74 (5 years), 0.92 (11 months), 0.86 (16 months), 0.7 (18 months), and 0.79 (48 months) | de Sisternes et al. [73] | |
OCT | 71 eyes of patients with early AMD and contralateral neovascular AMD (9088 OCT B-scans) | CNN | Prediction of conversion from early/intermediate to advanced neovascular AMD | AUC = 0.87 (VGG16) and 0.91 (AMDnet) | Russakoff et al. [74] | |
OCT | 495 eyes | DL | Predictive model to assess risk of conversion to advanced AMD | AUC = 0.68 for CNV and 0.8 for geographic atrophy | Schmidt-Erfurth et al. [75] | |
OCT | 2712 OCT B-scans | DL | Segmentation of features associated with AMD | Dice = 0.63 ± 0.15; ICC = 0.66 ± 0.22 | Liefers et al. [76] | |
OCT | 930 OCT B-scans from 93 eyes of patients with neovascular AMD | CNN | Segmentation of features associated with neovascular AMD | Dice = 0.78 (IRF), 0.82 (SRF), 0.75 (SHRM), and 0.8 (PED); ICCs = 0.98 (IRF), 0.98 (SRF), 0.97 (SHRM), and 0.98 (PED) | Lee et al. [77] | |
RP | Colour | 1128 RP and 517 healthy | CNN | Diagnose RP | AUROC 96.74% | Chen et al. [78] |
Colour | 99 RP and 21 healthy | FCN | Diagnose RP | Accuracy 99.52% | Arsalan et al. [79] | |
RP, best disease (BD), and Stargardt | FAF | 73 healthy, 125 Stargardt, 160 RP, 125 BD | CNN | Classify images into each group | Accuracy 0.95 | Miere et al. [80] |
Stargardt | OCT | 102 healthy (33 participants) and 647 Stargardt (60 patients) | CNN | Differentiate between Stargardt and healthy | Accuracy 99.6% | Shah et al. [81] |
BVMD and AVMD | FAF and OCT | 118 BVMD eyes and 96 AVMD eyes | CNN | Differentiate between BVMD and AVMD | AUROC 0.880 | Crincoli et al. [23] |
Stargardt and PRPH2-related pattern dystrophy | FAF | 304 Stargardt (40 patients) and 66 PRPH2 (9 patients) | CNN | Differentiate between Stargardt and PRPH2-related pattern dystrophy | AUROC 0.890 | Miere et al. [82] |
ABCA4-, RP1L1-, and EYS-related retinopathy | OCT | 58 IRD and 17 healthy | DL | Predict causative gene | ABCA4 100% accuracy; RP1L1 66.7 to 87.5%; EYS 82.4 to 100%; healthy 73.7 to 100% | Fujinami-Yokokawa et al. [83] |
Stargardt disease | FAF | 47 images (24 patients) | CNN | Segment flecks | Dice score: 0.54 ± 0.14 for diffuse speckled patterns; 0.71 ± 0.08 for discrete flecks | Charng et al. [84] |
Stargardt disease and AMD | FAF | 320 healthy 320 AMD and 100 Stargardt | CNN & FCN | Detect and segment atrophy | Atrophy screening: AMD 0.98 accuracy; Stargardt 0.95 Segmentation: AMD overlapping ratio of 0.89 ± 0.06; Stargardt: 0.78 ± 0.17 | Wang et al. [85] |
Stargardt and pattern dystrophy | FAF | 110 AMD, 204 Stargardt, and pattern dystrophy | CNN | Differentiate between AMD and IRD-associated macular atrophy | AUROC 0.981 | Miere et al. [86] |
Stargardt disease | OCT | 87 scan sets (22 patients) | FCN | Detect outer and inner limits of the retina | Mean difference: 2.10 µm and 0.059 mm3 in central macular thickness and volume between model and annotators | Kugelman et al. [87] |
AOSLO | 142 controls and 148 Stargardt | FCN | Identify cone photoreceptors | Dice score: 0.9431 ± 0.0482 | Davidson et al. [88] | |
RP and CHM | OCT | 300 B-scans with RP and 300 with CHM | FCN | EZ segmentation | Similarity of 0.894 ± 0.102 automatic vs manual grading for RP; 0.912 ± 0.055 for CHM | Camino et al. [89] |
CHM | OCT | 16 eyes CHM and 5 healthy | Nonneural (RF) | EZ segmentation | 0.876 ± 0.066 Jaccard similarity index | Wang et al. [90] |
USH2A-related RP | OCT | 126 volume scans (126 patients) | CNN | EZ segmentation | Dice score 0.79 ± 0.27 | Loo et al. [91] |
OCT | 86 volume scans (86 patients) | CNN | EZ segmentation | Dice score 0.867 ± 0.105 | Wang et al. [92] | |
RP | OCT and IR | 2918 (314 patients) | CNN and FCN | Predict VA below or above 20/40 | AUROC 0.85 | Liu et al. [93] |
Blue cone monochromacy (BCM) | OCT | 26 IRD, 16 BCM, 3 normal (patients) | Nonneural | Predict foveal sensitivity and VA | 0.174 RMSE for VA and 2.91 for sensitivity | Sumaroka et al. [94] |