The Institutional Review Board of Chung-Ang University Hospital approved this study (IRB No, 2029-028-19439), and the informed consent requirement was waived due to its retrospective design. This study was conducted in accordance with the ethical standards outlined in the Declaration of Helsinki.
Participants
We obtained orbital CT scans (Philips Brilliance 256 Slice CT, Philips Healthcare Systems, Andover, MA, USA) without contrast from TAO patients diagnosed between January 2010 to October 2019. Our study included 432 TAO patients, among whom 144 were diagnosed as active and 288 inactive. TAO patients were diagnosed according to Bartley and Gorman’s criteria34. A seven-point modified CAS formula assessed inflammatory activity by assigning a point to each item: retrobulbar pain, eye movement pain, eyelid redness, conjunctival injection, caruncle or plica inflammation, eyelid swelling, and chemosis. TAO patients with a CAS ≥ 3 were classified as active, while patients with CAS < 3 were classified as inactive. The CT scan and inflammatory activity evaluation were performed on the same day. Two ophthalmologists with more than five years of experience in oculoplasty were blinded to patient information for the CT image analysis and clinical inflammatory activity evaluation. The two experts jointly reviewed the images to reach a consensus in case of a disagreement. Patients below age 18, with a previous history of orbital surgery, orbital tumor, blowout fracture, idiopathic orbital inflammation, those with IV steroid treatment or radiation therapy at the time of CT scan taken, and those with incomplete CT scans were excluded.
Slice selection
Each patient’s orbital CT had 80 to 400 image slices. However, only a few CT slices were selected from axial, coronal, and sagittal planes to improve TAO activity evaluation performance by avoiding diagnostic model confusion due to redundant information. First, we selected the slice with the largest lens in the axial plane (AX1), then slices 3 mm above (AX2) and below (AX3) and 7 mm above (AX4) and below (AX5) AX1. Next, the slice with the largest lacrimal gland was chosen (AX6). In the coronal plane, the slice exhibiting the largest eyeball was selected first (CO1). We then picked slices 1/2 and 2/3 of the distance between CO1 and the orbit exit (CO2, CO3). Next, slices with the largest eyeball in both eyes were selected from the sagittal plane (SA_L, SA_R). Eleven CT slices were selected for each subject: six axial, three coronal, and two sagittal plane slices.
Data preparation and processing
After slice selection, we performed the Hounsfield Unit windowing process for better structure identification. We used the Pydicom library’s Value of Interest Look Up Table (VOI LUT) function to convert the original CT pixel values into values ranging from 0 to 1. In this study, we set the Window Center to 0 and the Window Width to 200. Next, we segmented identifiable structures from the 11 CT slices, including the eyeball, four rectus muscles, the optic nerve, and the orbital fat with a few exceptions. Because the superior rectus and superior levator palpebrae muscles could not be reliably distinguished from each other, they were segmented together as a single muscle group, namely the levator-superior rectus (SR) complex. The oblique muscles were excluded as they are difficult to distinguish clearly in CT images. Since AX6 was selected to represent the lacrimal glands, only the lacrimal glands were segmented in AX6. In addition, we further segmented the upper eyelid from two sagittal slices due to its TAO relevance. Finally, we segmented the entire orbit in the CO3 slice because it was difficult to distinguish between each four rectus muscles and the optic nerve (Table 4). As a result, we acquired 78 segmentation images from eleven selected slices.
Neural network model
This study defined three considerations to achieve the best TAO activity evaluation performance when devising the proposed NN. First, input slices were chosen from three planes instead of one to capitalize on the advantage of information from multiple views. Second, to exploit possible interactions among identifiable structures in the same slice, the proposed NN processed all segmented images from one slice through a single pipeline; each segmented image was encoded into a channel. Third, we accomplished a pipeline combination ablation study to avoid possible evaluation degradation. Figure 3 illustrates the proposed NN’s overall architecture consisting of pipeline heads, a bottleneck layer, and a model body.
Each CT slice head is an operation sequence constituting 7 × 7 convolution, batch normalization (BN), and ReLU layers and one Cross Stage Partial (CSP) Block. The standard convolution layer output values were calculated with \({y}_{i,j,k}={{w}_{k}}^{T}{x}_{i,j}+{b}_{k}\) where \({x}_{i,j}\) was the input value subset centered at \((i, j)\), \({y}_{i,j,k}\) was the output value at \((i, j)\) in the \(k\)th feature map, and \({w}_{k}\) and \({b}_{k}\) were the \(k\)th filter’s weight vector and bias, respectively. ReLU activation function was defined as \(ReLU(x)=max(x, 0)\), and the max and average pooling operation output values were \({y}_{i,j,k}=a\) and \({y}_{i,j,k}=\frac{1}{|{x}_{i,j}|}{\sum }_{a\in {x}_{i,j,k}}a\) respectively, where \({x}_{i,j,k}\) was the input value subset centered at (\(i, j)\) in the \(k\)th input feature map and \({y}_{i,j,k}\) is the output value at \((i, j)\) in the \(k\)th output feature map. The CSP Dense Block was a modified Dense Block that reused vast amounts of gradient information with an improved gradient duplication problem and promising performance35,36,37,38,39. The Dense Block was a convolution block with multiple densely connected Dense layers40.
The bottleneck layer extracted essential information by compressing head-generated feature maps. Each pipeline head outputs 64 feature maps, generating (number of input CT slices) × 64 feature maps. Next, the proposed NN compresses feature maps based on a 1 \(\times \) 1 convolution. The bottleneck layer constitutes a BN layer, a ReLU layer, and a 1 \(\times \) 1 convolution. Feature maps passing through the bottleneck were finally compressed into 64 channels. The rest is the model body that processed the compressed feature maps through three CSP Dense Blocks containing 12, 24, and 16 Dense layers. Next, spatial features were compressed into a one-dimensional vector by a 7 × 7 average pooling. The proposed NN concatenated age and sex data with this one-dimensional vector through one-hot encoding, which was considered a pipeline. Finally, the concatenated vector is fed into a fully connected layer and a sigmoid function to output class probabilities. The final active probability \(y\) was calculated by
$$y=F(B\left({h}_{1}\left({x}_{1}^{1}, \dots , {x}_{1}^{{n}_{1}}\right), \dots , {h}_{m}\left({x}_{m}^{1}, \dots ,{x}_{m}^{{n}_{m}} \right)\right), c)$$
where \({x}_{i}^{{i}_{k}}\) was the \(i\)th selected slice’s \(j\)th segmented image, \({h}_{i}\) was \(i\)th head, \(B\) was a composition function from the bottleneck layer to a 7 × 7 average pooling, \(F\) was the fully connected layer, and \(c\) was clinical data.
Ablation study
The proposed NN individually examines input data (11 slices, age and sex) using each pipeline; thus, disabling uninfluential pipelines can distinguish essential information for TAO activity evaluation. To achieve this, we devised a procedure that enables or disables the 13 pipelines by referring to validation performance and estimating the optimal pipeline set. First, the algorithm checks the 13 single-input pipeline performances by only enabling one pipeline at a time to determine the best single-input pipeline. Next, in addition to the best single-input pipeline in the previous stage, a single-input disabled pipeline is enabled to identify the best dual-input pipeline. The algorithm again enables a single-input disabled pipeline with the best dual-input pipeline. The algorithm repeats this procedure until it has considered all 13 pipelines. Finally, the algorithm outputs the final pipeline combination with the best evaluation performance.
Neural network evaluation
We employed two latest NNs in computer vision to verify the proposed NN’s performance: CSPDenseNet and ConvNeXt. Similar to the proposed NN, the two latest NNs have been slightly modified to receive size 224 \(\times \) 224 \(\times \) 78 CT slices, age, and sex for a fair comparison. We implemented these models with the Pytorch (1.10.1) library, and all experiments were conducted in a Geforce RTX 3090 24 GB environment. The hyper-parameters were set to a 32 batch size, 30 epochs, the AdamW optimizer, and a 1e-3 learning rate halved every ten epochs by a step-learning rate scheduler. We assessed the three NN performances on six metrics: area under the receiver operating characteristics (AUROC), accuracy, F1-score, sensitivity, specificity, and precision. For training and evaluation, 432 patients were divided into training, validation, and test sets at 0.7, 0.15, and 0.15 ratios extracted through stratified random sampling. The data split process, training, and test were repeated ten times. Furthermore, we conducted additional model comparison experiments based on the selected ten pipeline subsets with the same settings (environment, hyper-parameters, evaluation metrics, data split, repetition). As a result, we could discern essential CT slices out of the collective to aid medical expert interpretation.
Visual explanation method
Gradient-weighted Class Activation Mapping (Grad-CAM) assisted us in analyzing experimental results41 as it generates a visual description regarding the final NN’s decisions. First, Grad-CAM uses gradients for class probabilities to create a coarse score map highlighting essential decision locations. Specifically, a filter’s average gradient values of a specific layer are regarded as that feature map’s importance. Next, each feature map and layer importance are multiplied. Then, all multiplied feature maps are averaged along the channel axis to produce a coarse score map of each region. Since the model receives multiple input images in our visualization, we used the head’s feature map and gradient for each slice visualization.
Statistical analysis
Descriptive statistics are shown as the mean ± standard deviation for continuous variables and the number and percentage for categorical variables. Measured continuous variable comparisons were analyzed using a pair-wise t-test. We used Python’s open-source library, Scikit-learn, for all statistical analyses of our study; p-values \(<\) 0.05 were considered statistically significant.