Dataset
The retinal images used in the study were obtained from publicly available repositories. For retinal videos, the study was performed in accordance with the guidelines of the tenets of the Declaration of Helsinki and approved by the University of Technology Sydney’s Human Research Ethics Committee (ETH171392). Informed consent was obtained from each participant following an explanation of the nature of the study.
Two distinct sets of fundus videos and images were used to develop and test the performance of our proposed model. For fundus videos, a total of 185 were collected from 113 participants attending the Marsden Eye Clinic. All participants were recruited subject to the following inclusion/exclusion criteria:

Inclusion criteria

1.
Adults (i.e., over 18 years of age)

2.
A normal fundus on ophthalmoscopy with no visible vascular changes.

3.
Clear ocular media with visual acuity better than 6/12.

1.

Exclusion criteria

1.
Persistent vision loss, blurred vision, or floaters.

2.
History of laser treatment of the retina or injections into either eye, or any history of retinal surgery.

3.
Anomalies of the ocular media that would preclude accurate imaging.

4.
Participant is contraindicated for imaging by fundus imaging systems used in the study (e.g. hypersensitive to light or on medication that causes photosensitivity)

1.
Participants had a dilated fundoscopy and a minimum 3second recording (30 frames per second at a 46/60 degrees’ field of view of the retina and 2.2 image magnification) centered on the optic disc Fig. 1a. Coauthors SMG and SS reviewed all videos and marked SVPs as present or absent. Occurrence of SVPs were only assessed within onedisc diameter of the optic nerve head. Coauthor AA adjudicated any disagreement in the assessment between the two graders.
For fundus images, we used the DRIONSDB database^{25}, a public dataset containing 110 fundus images with their annotated ground truth, for training the optic disc localization model (Fig. 1b).
SVP classification
To classify SVPs, we developed an endtoend deep model called U3DNet. Figure 2 shows the overall structure of the model. The U3DNet receives fundus videos as input and classifies SVPs as present or absent. The U3DNet consists of two main blocks: Optic Disc Localizer and Classifier. Since SVP occurs on (or adjacent to) the optic disc, the U3DNet has been tuned to focus on the optic disc. For this purpose, the U3DNet has an accurate and fast localizer that processes individual video frames and locates the optic disc in each image. The order of the frames, due to their synchronization with the cardiac frequency, is also an essential factor. This has been taken into account in the design of the localizer, which feeds sequential frames into the classifier. Therefore, SVPs are classified based on a batch of 30 sequential frames.
Optic disc localizer
SVPs are mainly observable on the central retinal vein located on (or adjacent to) the optic disc. Therefore, prior to analyzing retinal videos for the presence or absence of SVPs, we developed a model that could localize optic discs in an image. For the purpose of our study, attention mechanisms^{26} with recurrent residual convolutional layers, which are depthwise separable^{27}, were used. A depthwise separable layer decreases the computational cost in the network. The process includes a depthwise and a spatial convolution operated separately across every input data channel. Following this, it is supported by a pointwise \(1\times 1\) kernel convolution. To obtain the outcome of each channel \((O_1, O_2, O_3, O_4)\), each of the convolution kernels \((K_1, K_2, K_3, K_4)\) is convolved with one of the input channels \((I_1, I_2, I_3, I_4)\). Ultimately, the outcomes from different kernels are fused into one. The output of the \(ith\) kernel, \(O_i\), is defined as
$$\begin{aligned} O_i = K_i \otimes L_i \end{aligned}$$
(1)
where \(K_i\) and \(O_i\) are convolution kernels and the outcome of each channel convolution kernels and outcome of each channel, respectively.
Equation (1) establishes the number of convolution operations needed for depthwise separable layers. Our proposed architecture for optic disc localizer (Fig. 3) contains recurrent residual layers and an attention mechanism. In this architecture, we have eliminated and modified the original UNet^{28} copying and cropping block and have employed a concatenation operation, resulting in a highly developed structure and improved proficiency. The fundamental idea of recurrent connections is to reuse maps or weights and keep some data. The output of a depthwise separable convolution layer returns to the layer’s input before passing it to the next layer. Also, a residual unit assists in avoiding vanishing gradient problems during the training. Hence, feature extraction with recurrent residual convolutional layers ensures a more compelling feature representation, enabling us to design a more accurate optic disc localizer. The localizer model trained with Attention Gates (AGs)^{26} thoroughly learns to ignore unnecessary areas in an input image and focus on distinctive features valuable for optic disc detection. AGs can be mixed with recurrent residual convolutional layers with minimum computational cost while improving the model’s accuracy.
Figure 4a displays the proposed AG. Attention values are computed for each pixel (u). We assumed that \(u_{l}^{down}\) and \(u_{l}^{up}\) are represented as \(u_l\) and \(g_l\), respectively. The gating signal \(g_l\) specifies the attention region per pixel. The additive attention^{29} is utilized to acquire the attention coefficient to achieve higher accuracy. The additive formula presents as follows:
$$\begin{aligned} Q_L = \psi (\sigma _1(W_u u_l + W_g g_l + b_g) + b)\psi , \alpha _1 = \alpha _2 (Q_L) \end{aligned}$$
(2)
where Wg is the weight \(\sigma _1\) and \(\sigma _2\) represent the activation functions of ReLU and sigmoid, respectively, and bg and \(b_\psi \) denote the bias. The AG parameter is updated and trained based on the backpropagation technique rather than utilizing the samplingbased update process^{30}. Finally, the result of AGs is the multiplication of the attention coefficient and the feature map are shown as follows:
$$\begin{aligned} c_l^{up} = \alpha \times u_1 \end{aligned}$$
(3)
The construction of the RRL block is illustrated in Fig. 4b. Localization of the optic disc encompasses contracting and expansive paths. The input of the localization block, which is individual video frames, is initially passed through a depthwise separable convolutional layer with \(3 \times 3\) filters. Then, the recurrent convolutional layers are utilized, and the final output of each recurrent convolution layer is passed on to the residual layer. We applied a time step of 1 second, indicating one forward convolution layer supported by one recurrent convolutional layer. Next, the ReLU activation function and maxpooling operation are applied, reducing the input width and height. The image resolution is reduced by passing the image through the sequence of layers multiple times. The same convolution layers and settings are used on the expansive side, with upsampling layers, which lead to increased image resolution. Information obtained from the contracting path is utilized in the attention gate to remove noisy and unnecessary responses in skip connections. This is implemented directly before the concatenation process to merge just relevant and important activations.
The optic disc localizer model’s input is video frames, and the output is the segmentation map of the optic discs. As shown in Fig. 5, by calculating the coordinates area of the optic discs (white pixels) from the segmentation map, the region of the optic discs has been characterized. Finally, by applying a function to the frames of the video, the optic disc region will be cropped as a sequence form.
75% part of the DRIONSDB dataset was used for training, 20% for validation, and the remaining 5% for testing the localizer model. An initial learning rate of 0.003 was used with a batch size of 6 and 100 epochs of training. In order to update the weights of the network iteratively, the RMSprop algorithm was used. In order to train and evaluate the optic disc localization model, the Dice loss function was selected since it is commonly used in medical image segmentation. By learning an effective feature representation and weight parameter, the model learned how to locate optic discs in fundus images accurately.
Classifier
Following localization of the optic disc, sequential frames of the input video are then passed on to the classifier block of the U3DNet. Each video frame was resized to 64\(\times \)64 pixels and converted into grayscale to decrease computational time and complexity. Different deep learning networks in this paper have been evaluated: 3D Inception, 3D DenseResNet, 3D ResNet, Longterm Recurrent Convolutional Network (LRCN), and ConvLSTM. These networks were chosen as they have been used widely for medical image and video application tasks. All networks comprise some layers such as convolutional layers, pooling layers, and a fully connected layer that makes a label for the input data. Also, the final performance of each network has been analyzed in the next part. Two characteristics are essential for video classification: spatial (static) features within each frame and temporal (dynamic) features between sequential frames. To evaluate the performance of each classifier model in classifying SVPs, and to increase the number and variety of fundus videos, data augmentation using a 180 degrees rotation of original videos was used. Also, to remove any bias due to the chosen sets, a Kfold crossvalidation was used in which the SVP dataset is indiscriminately divided into five equalsized folds (partitions). In this procedure, a single fold is chosen to serve as the test set, while the rest of the four folds are combined to form the training set. This process is repeated five times, with each fold serving as a test set once. Rotating the test set between the folds guarantees that the algorithm is assessed on various subsets of the data. Finally, the average of the results is calculated. This enables a more reliable estimation of its performance and generalization capabilities. The structure of each classifier model based on different deep learning structures with their detail has been presented in what follows.
3D inception
One of the classifier models used includes Inception modules^{31}, 3D pooling, and 3D convolution layers to extract spatialtemporal features from the input videos in realtime. As shown in Fig. 6a, the 3D InceptionBased classifier block consists of different layers, including the input. Each Inception module is a combination of 3D convolution, batch normalization, and ReLU activation functions in which their outputs merge into a single vector and create the input of the next layer. Maxpooling layers support alternating convolutional layers. Also, the dropout layer is applied as a regularization operation to limit overfitting. Finally, fully connected layers are linked to an output layer which classifies the SVP status.
3D DenseResNet
In this paper, we use the iterative advancement properties of ResNets to make densely connected residual networks for classifying SVP, which we call 3D DenseResnet. In FC DenseNets^{32}, the convolution layers are densely connected, but in DenseResnet, we apply dense connectivity to ResNets modules. Therefore, the 3D DenseResnet model executes iterative advancement at each representation step (in a single ResNet) and utilizes dense connectivity to get refined multiscale feature representations. Hence, by combining FCDenseNets and FCResNets into a single model that merges the advantages of both architectures. This brings the architecture to use the advantages of both dense connectivity and residual patterns, namely: iterative refinement of representations gradient flow, multiscale feature combination, and deep supervision^{33}. The connectivity pattern of 3D DenseResNet is shown in Fig. 6b. First, the input is processed with a Conv3D convolution followed by a MaxPooling 3D operation. After that, the output is fed to a Dense block organized by residual blocks based on ResNets.
The number of kernels in a convolution process in CNN is equal to that of the input maps used for the input. Also, to provide an output feature map of the layers, the outcomes add with a bias term; the procedure is repeated with various kernels to get the desired number of output feature maps. These convolution layers are followed by Batch Normalization and a Rectified Linear Unit (ReLU) and set to decrease the number of input feature maps at the output. GlobalAveragePooling and Dense layers followed the final output.
3D ResNet
The proposed 3D ResNet network is based on the ResNets structure^{34}. ResNets present shortcut connections that skip a signal from one layer to another layer. The connections transfer the gradient flows of the model from later layers to earlier layers, leading to facilitating the training process of deep models. The structure of the proposed 3D Resnet is shown in Fig. 6c.
First, the input is processed with a Conv3D convolution followed by MaxPooling, BatchNormalization, and a Rectified Linear Unit (ReLU) to decrease the number of input feature maps at the output. After that, the output is fed to a residual block organized by a skip connection, and to provide an output feature map of each layer, the outcomes add with a bias term. The number of kernels of Conv3D layers in residual blocks and in the first layer is \(3 \times 3 \times 3\). All the other Conv3D layers in the 3D ResNet have kernel size of \(1 \times 1 \times 1\). Finally, GlobalAveragePooling and Dense layers followed the final output.
Longterm recurrent convolutional network (LRCN)
Another method that can be utilized for the detection of SVP is a CNN model and LSTM model trained individually. To extract spatial features from the frames of the video, the CNN network can be used, and for this goal, a pretrained model can be employed that can be finetuned for the issue. Then, the LSTM network can use features extracted from the previous model to predict the absence or presence of SVP in the video. But here, another method known as the Longterm Recurrent Convolutional Network (LRCN) has been used^{35}, which integrates CNN and LSTM layers in a single network (Fig. 6d). The Convolutional layers are utilized for spatial feature extraction from the video frames, and after that, the spatial features are fed to the LSTM layer(s) at each timesteps. This process is Temporal sequence modeling, and the model directly learns spatiotemporal features in a robust endtoend model.
Also, the TimeDistributed wrapper layer has been utilized, which provides usage of the same layer for every frame of the video separately. So it creates a layer that has the potential to take input of shape (Num–of–Frames, Width, Height, Num–of–Channels) if the layer’s input shape was (Width, Height, Num–of–Channels), which is very advantageous as it authorizes the input of the whole video into the network in a single shot. For training the proposed LRCN model, timedistributed Conv2D layers have been used, followed by Dropout layers and MaxPooling2D layers. Conv2D layers extract features and then will be flattened by using the Flatten layer. After that, the output will be fed to an LSTM layer. The Dense layer with activation of softmax will then apply the final result from the LSTM layer. In this model, the size of kernel size of Conv2D layers is 3 \(\times \) 3, and the pooling size of MaxPooling2D is \(2 \times 2\).
ConvLSTM
The other approach proposed for detecting the presence or absence of SVP is a combination of ConvLSTM cells. A ConveLSTM cell is a kind of an LSTM model that includes convolutions functions in the model. It is an LSTM with convolution infixed in the network, which makes it apt to identify spatial features of the data while considering the temporal relation. This method effectively catches the spatial connection in the individual frames and the temporal connection across the various frames for video classification. Consequently, the ConvLSTM can take in 3D (Width, Height, Num–of–Channels) as input in this convolution network, whereas a simple LSTM takes in 1D input.
The overall structure of the proposed ConvLSTM cell is shown in Fig. 7, where \(\sigma \) is the sigmoid function, W is presented as the weight for each layer, b is the bias, \(X_t\) is the input in time step t, and the hyperbolic tangent function is represented by the tanh. Also, the Hadamard product operator is shown by \(\bigodot \), \(f_t\) is forget gate, \(c_t\) is the cell state, \(i_t\) is the input gate, and \(O_t\) is the output gate.
The value obtained by taking the sigmoid function after getting \(x_t\) and \(h_{t1}\) is equal to the value that the forget gate sends out. The range of the sigmoid function output is from 0 to 1. Information from the previous cell is forgotten if the output value is 0, and if it is 1, information from the previous cell is wholly memorized. Also, \(i_t \bigodot g_t\) is a gate for holding current information and catches \(h{t1}\) and \(x_t\), and uses the sigmoid function.
After that, the value that takes the Hadamard product operation and Hyperbolic Tangent (tanh) function is sent from the input gate. As the range of \(g_t\) is from 1 to 1 and \(i_t\) is from 0 to 1, each represents the direction and intensity of storing current information. The formula of ConvLSTM cell is shown in what follows:
$$\begin{aligned} f_t= & {} \sigma (W_{Xf} *X_t +W_{Hf} *H_{t1} + W_{cf} \odot C_{t1} + b_f) \end{aligned}$$
(4)
$$\begin{aligned} i_t= & {} \sigma (W_{Xi} *X_t +W_{Hi} *H_{t1} + W_{ci} \odot C_{t1} + b_{Hi}) \end{aligned}$$
(5)
$$\begin{aligned} g_t= & {} \tanh (W_{Xg} *X_t +W_{Hg} *H_{t1} + b_{hg}) \end{aligned}$$
(6)
$$\begin{aligned} C_t= & {} (f_t \odot C_{t1}) + (i_t \odot g_{t}) \end{aligned}$$
(7)
$$\begin{aligned} O_t= & {} \sigma (W_{Xo} *X_t +W_{Ho} *H_{t1} + W_{Co} \odot C_{t} + b_{ho}) \end{aligned}$$
(8)
$$\begin{aligned} H_t= & {} o_t \odot \tanh (c_{t}) \end{aligned}$$
(9)
The cell state H, input gate i, output gate O, cell output C, cell input X, and forget gate f are all 3D tensors while in the original LSTM, where all these elements were 1D vectors. Also, all matrix multiplications are considered by operations’ convolution, which shows that the number of presented weights in all W in each cell can be less than in the original LSTM^{36}.
In our proposed model, ConvLSTM2D has used Keras layers. Also, the ConvLSTM2D layer catches the number of kernels and filters size needed for using the convolutional processes. The outcome of the layers, in the end, is flattened and after that is fed to the Dense layer with SoftMax activation. Also, MaxPooling3D layers have been used to decrease the sizes of the frames and avoid unneeded calculations and Dropout layers to control the overfitting of the proposed model.
As the architecture is simple, the number of trainable parameters is small. The overall structure of our proposed method based on ConvLSTM is shown in Fig. 6e. The kernel size of ConvLSTM2D is \(3 \times 3\), and the Hyperbolic Tangent (Tanh) activation function is applied for ConvLSTM2D layers. After each ConvLSTM2D layer, MaxPooling3D layers with pooling sizes of \(1 \times 2 \times 2\) and Batch Normalization layers have been applied. The final result has passed from Flatten and Dense layers.
To analyze the best performance of every classifier model, we ran several different experiments modifying the number of epochs, batch size, and learning rate. Table 1 summarizes the characteristics of the proposed classifiers.