Our objective is to train a method capable of inferring in which ETDRS ring different markers are located, but only using 2D OCT slices and associated slice-level binary annotations. In a 2D OCT slice, ETDRS rings correspond to a set of non-continuous vertical stripes (see Fig. 1). From the placement of the ETDRS rings on the OCT volume, we make the following three important observations: (1) depending on where an OCT slice is positioned in the volume, different ETDRS rings are visible in the slice, (2) the width of different rings depends on where an OCT slice is positioned and (3) ring symmetry is preserved regardless of the slice position. We will explicitly leverage these observations to design and train our approach.

Specifically, instead of training our method to produce different outputs depending on the slice location, we predefined a partition of 2D OCT slices into image columns (see Fig. 2 left). That is, we will train our method to produce predictions for each of these columns, regardless of the specific slice location within the volume. At the end of this section we describe the straightforward post-processing mapping from column-level predictions to the ETDRS rings (as shown in Fig. 2 left).

### Model

Formally, we partition a 2D OCT slice, \(\textbf{x}\), into *C* equally spaced columns. We wish to train a model \(f:[0,1]^{H\times W} \rightarrow [0,1]^{(1+C)\times B}\) that maps \(\textbf{x}\) to a collection of probabilities \(\hat{\textbf{y}}\), where *B* is the number of different possible types of markers to be found. For each marker \(b\in B\), the collection \(\hat{\textbf{y}}\) contains both the probability of presence of *b* in the entire OCT slice, \(\hat{\textbf{y}}_{0,b}\), and the probability of presence of *b* in each column \(c\in C\), denoted \(\hat{\textbf{y}}_{c, b}\). Our training data is made of tuples \((\textbf{x}, \textbf{y}_0)\), with OCT slice \(\textbf{x}\) and corresponding slice-level annotations \(\textbf{y}_0\in \{0, 1\}^B\) with no reference whatsoever to the ring or column in which they are located. A comprehensive list of all the variables can be found in Table S1 in the supplementary material.

Figure 3 depicts our model architecture. The input OCT slice is processed by a CNN which produces a feature map \(\textbf{z}\in \mathbb {R}^{D_z\times H_z\times W_z}\) with width equal to the number of columns \(C = W_z\). We then apply a number of pooling operations over the feature map \(\textbf{z}\) to describe the entire OCT slice as well as every column *c*. In particular, to identify markers that may appear as large or small in a given image, we set the descriptor of the entire OCT slice to be a \(2D_z\)-dimensional vector \(\textbf{d}_0=[\mathop {\mathrm {avg\_pool}}\limits (\textbf{z}), \mathop {\mathrm {max\_pool}}\limits (\textbf{z})]\) obtained as the concatenation of average pooling and maximum pooling over the spatial dimensions of \(\textbf{z}\). Likewise, the descriptor of every column *c* is another \(2D_z\)-dimensional vector \(\textbf{d}_c=[\mathop {\mathrm {avg\_pool}}\limits (\textbf{z}_{\cdot ,\cdot ,c}), \mathop {\mathrm {max\_pool}}\limits (\textbf{z}_{\cdot ,\cdot ,c})]\) obtained as the concatenation of the two pooling operators acting on the corresponding column of \(\textbf{z}\). The descriptor vectors are then processed by a multi-layer perceptron (MLP) followed by an element-wise sigmoid activation to produce the final probabilities,

$$\begin{aligned} \hat{\textbf{y}}_0 = \sigma \left( \text {MLP}(\textbf{d}_0)\right) ,\quad \quad \hat{\textbf{y}}_c = \sigma \left( \text {MLP}(\textbf{d}_c)\right) \quad \forall c. \end{aligned}$$

(1)

### Training

We use a combination of three loss terms to train our model. The first term uses the standard binary cross entropy (BCE) of the slice-level predictions \(\hat{\textbf{y}}_0\) with the slice-level ground-truth annotations \(\textbf{y}_0\),

$$\begin{aligned} \ell _1(\hat{\textbf{y}}, \textbf{y}_0) = \sum _{b} \text {BCE}(\hat{\textbf{y}}_{0,b}, \textbf{y}_{0,b}). \end{aligned}$$

(2)

The second term incorporates constraints on column-level predictions based on the image-level ground-truth. Specifically, when a biological marker is not present in the input image, \(\textbf{y}_{0,b} = 0\), we penalize high predicted probabilities for *b* in all the columns. On the other hand, if the marker is present, \(\textbf{y}_{0,b}=1\), we encourage a high probability for *b* for at least one column. Formally, we compute,

$$\begin{aligned} \ell _2(\hat{\textbf{y}}, \textbf{y}_{0,b}) = -\sum _b (1-\textbf{y}_{0,b})\dfrac{1}{C}\sum _c \log (1-\hat{\textbf{y}}_{c,b}) – \sum _b \textbf{y}_{0,b}\max _c \log \hat{\textbf{y}}_{c,b}. \end{aligned}$$

(3)

The last term imposes invariance to horizontal symmetry on the column-level probabilities. When our model receives a horizontally flipped image \(\textbf{x}’\), the predicted column-level probabilities \(\hat{\textbf{y}}’\) should also be flipped, and therefore \(\hat{\textbf{y}}_{c,b}\) should be equal to \(\hat{\textbf{y}}’_{C-c,b}\) for all *b*. To this end, we penalize a symmetric KL divergence between the corresponding probabilities,

$$\begin{aligned} \ell _3(\hat{\textbf{y}}, \hat{\textbf{y}}’) = \dfrac{1}{2}\sum _{c, b} \left( D_{KL}\big ({\hat{\textbf{y}}_{c, b}}\Vert {{\hat{\textbf{y}}’_{C-c, b}}}\big ) + D_{KL}\big ({\hat{\textbf{y}}’_{c, b}}\Vert {{\hat{\textbf{y}}_{C-c, b}}}\big ) \right) . \end{aligned}$$

(4)

Specifically, \(\ell _3\) incorporates the symmetry of the ETDRS rings we wish to induce in our model. Note that the desired horizontal symmetry cannot be obtained by random horizontal image flipping augmentation, however, as \(\ell _3\) enforces predictions on the columns to be consistent regardless of whether the image is flipped or not. Using a similar symmetry argument for \(\ell _1\) and \(\ell _2\), our final loss is,

$$\begin{aligned} \mathscr {L}(\hat{\textbf{y}}, \hat{\textbf{y}}’, \textbf{y}_0)= & {} \ell _1(\hat{\textbf{y}}, \textbf{y}_0) + \ell _1(\hat{\textbf{y}}’, \textbf{y}_0) + \ell _2(\hat{\textbf{y}}, \textbf{y}_0) + \ell _2(\hat{\textbf{y}}’, \textbf{y}_0) + \ell _3(\hat{\textbf{y}}, \hat{\textbf{y}}’), \end{aligned}$$

(5)

where \(\hat{\textbf{y}}\) and \(\hat{\textbf{y}}’\) are the predicted probabilities for the input image \(\textbf{x}\) and corresponding horizontally-flipped version \(\textbf{x}’\), respectively. Figure 4 shows a graphical explanation for \(\ell _2\) and \(\ell _3\).

### Inference

At test time, we can infer the layout of ETDRS rings in an OCT slice once a slice is evaluated by our network. This correspondence is not one-to-one, as a single ring usually contains several columns, and one column may be shared between two rings. To thus produce ring-level predictions, we compute the maximum of the probabilities of the columns contained in each ring. For columns spanning two rings, we weigh the contribution of the column by the proportion of the column inside each ring, as shown in Fig. 2.