We use the Minkowski Engine proposed by Choy et al.5 as the backbone of a sparse CNN. Minkowski Engine is originally designed as a general-purpose tool for the analysis of 4D spatio-temporal data and uses sparse tensors as the basic data structure. A sparse tensor \(\mathscr {F}\) is a generalized representation of a sparse matrix in which most of the points are empty (zero). A third order sparse tensor can be expressed as:
$$\begin{aligned} \mathscr {F}(x_i,y_i,z_i)=\left\{ \begin{matrix} f_i, &{} (x_i,y_i,z_i) \in \mathcal {C}\\ 0, &{} otherwise \end{matrix}\right. \end{aligned}$$
(1)
where \(\mathcal {C}\) is the coordinate matrix (row-wise concatenation of coordinates) of the non-empty points and \(f_i \in \mathbb {R}^{N_F}\) is the non-empty value at coordinate \((x_i,y_i,z_i)\). \({N_F}\) is the number of channels at this point. \(\mathcal {F}=\begin{Bmatrix} f_1,f_2,…,f_i,f_{i+1}…\end{Bmatrix}\) is the feature vector. Sparse CNN relies only on \(\mathcal {C}\) and \(\mathcal {F}\) for feature computation. In our study, we use sparse CNN specifically on sparse binary volumes of static data, i.e., the skull images, which are typical examples of sparse tensors, since the majority of voxels in a skull image are zero. The input of the sparse CNN consists of a coordinate matrix \(\mathcal {C}_{in}\) and the associated feature vectors \(\mathcal {F}_{in}\):
$$\begin{aligned} \mathcal {C}_{in}=\begin{bmatrix} x_{1}&{} y_{1}&{} z_{1}\\ x_{2}&{} y_{2}&{} z_{2} \\ …&{} …&{}…\\ x_{N}&{} y_{N}&{} z_{N} \end{bmatrix} , \mathcal {F}_{in}=\begin{bmatrix} 1\\ 1\\ …\\ 1 \end{bmatrix} \end{aligned}$$
(2)
Here, N is the number of non-zero voxels in a skull image. Note that the coordinates we used in our study refer to voxel grid coordinates (e.g., from [0, 0, 0] to [512, 512, Z]) instead of the world coordinates of point clouds. Since the skull data are binary, and the number of channels per voxel is one (\({N_F}=1\)), the feature vector has a format of \(\mathcal {F}_{in} \in \mathbb {R}^{N\times 1}\), and the elements in \(\mathcal {F}_{in}\) are all 1. For three-dimensional voxel grid coordinates, \(\mathcal {C}_{in} \in \mathbb {Z}^{N\times 3}\). A general data pre-processing step for using the sparse CNN is to format the input and ground truth skull images according to Eq. (2).
Similar to existing CNN methods for shape completion, we use an auto-encoder architecture for the task, but we replaced the conventional dense convolutional layers with sparse convolutional layers5. Table 1 shows the configuration of each layer in the sparse CNN used for experiments. ch is a list of the channel numbers for each layer. As the number of output channels (\(C^{out}\)) in each layer is no longer constant 1 as in the input skull image, we use \(\mathcal {F}_{in}^i \in \mathbb {R}^{C_{i-1}^{out}}\) as a general notation for the output (i.e., the feature vector) of the intermediate layer i. The convolution operation at a coordinate \(D \in \mathbb {Z}^3\) in the sparse CNN can therefore be defined similar to that of the traditional dense CNN:
$$\begin{aligned} \mathcal {F}_{in}^{i+1}(D’)=\sum w^i\mathcal {F}_{in}^i(D)+b_i, \end{aligned}$$
(3)
where \(w^i \in \mathbb {R}^{ C_{i}^{out} \times C_{i-1}^{out}}\) and \(b_i\) is the weight matrix and bias of intermediate layer i. \(D’ \in \mathbb {Z}^3\) is the corresponding coordinate in layer \(i+1\) mapped from D. Note that, unlike a traditional dense CNN that operates on regular voxel grids sequentially, a sparse CNN requires specifying a coordinate mapping in order to know how D is mapped to \(D’\), as the non-zero voxels can be distributed arbitrarily, and, by extracting only the non-zero voxels, the spatial context within an image is lost. For such coordinate mapping, Minkowski Engine uses a pair of voxel indices from the input and ground truth images to memorize the mapping relationship as in a regular voxel grid, leading to coordinates-related computation overhead, comparable to Li et al.25. In Minkowski Engine, the coordinates and the voxel indices were stored in a hash table (the hash function used is FNV64-1A), where the coordinates were used as hash keys to retrieve the original voxel indices of the associated elements in a feature vector. Even if the hash table is not directly involved in feature computation, they determine how an element from the input feature vector is mapped to an element computed according to Eq. (3) in the output feature vector.
Shape completion
For skull shape completion, the input is a defective skull and the output (ground truth) is the complete skull. It can be divided into two sub-tasks: reconstructing the original defective skull \(\begin{Bmatrix} \mathcal {C}_{in}, \mathcal {F}_{in} \end{Bmatrix}\) and restoring the missing skull bone (i.e., the implant) \(\begin{Bmatrix} \mathcal {C}_{imp}, \mathcal {F}_{imp} \end{Bmatrix}\):
$$\begin{aligned} \mathcal {C}_{out}^{sc}=\begin{bmatrix} \mathcal {C}_{in}\\ \mathcal {C}_{imp} \end{bmatrix} , \mathcal {F}_{out}^{sc}=\begin{bmatrix} \mathcal {F}_{in}\\ \mathcal {F}_{imp}\ \end{bmatrix} \end{aligned}$$
(4)
where
$$\begin{aligned} \mathcal {C}_{imp}=\begin{bmatrix} x_{N+1} &{} y_{N+1} &{}z_{N+1} \\ x_{N+2} &{} y_{N+2} &{}z_{N+2} \\ …&{}… &{}… \\ x_{N+M} &{} y_{N+M} &{}z_{N+M} \end{bmatrix}, \mathcal {F}_{imp}=\begin{bmatrix} 1\\ 1\\ …\\ 1 \end{bmatrix} \end{aligned}$$
(5)
M is a variable denoting the number of non-zero voxels in the generated set of coordinates \(\mathcal {C}_{imp}\). \(\mathcal {F}_{imp} \in \mathbb {R}^{M\times 1}\), and the elements in \(\mathcal {F}_{imp}\) are all 1. Obviously, M can be different for different skull instances considering the varaitions of skulls and defects. According to Eq. (4), the sparse CNN needs to generate new sets of coordinates \(\mathcal {C}_{imp}\) at which the values are non-zero, for the skull shape completion task (\(M\ne 0\) such that \(\mathcal {C}_{in}\subset \mathcal {C}_{out}^{sc}\)). The generative sparse tensor decoder in Table 1 are composed of generative transposed convolutional layers32 that are capable of generating new non-zero points absent in the input. Given a sparse tensor \(\mathscr {F}\) as input, the output of a transposed convolution \(\mathscr {F}’\) can be written as:
$$\begin{aligned} \mathscr {F}'[x,y,z]=\sum _{i,j,k \in \mathcal {N}(x,y,z)}\mathcal {W}[x-i,y-j,z-k]\mathscr {F}[i,j,k] \end{aligned}$$
(6)
where \((x,y,z) \in \mathcal {C}’\) and \((i,j,k) \in \mathcal {C}\). \(\mathcal {W}\) is the kernel weight. \(\mathcal {C}\) and \(\mathcal {C}’\) are the input and output coordinate matrix respectively, and they have the following relationship:
$$\begin{aligned} \mathcal {C}’ =\mathcal {C}\otimes [-\frac{Ks-1}{2},…,\frac{Ks-1}{2}]^3 \end{aligned}$$
(7)
\(\otimes \) denotes outer-product. A point generated by a transposed convolution (x, y, z) has the following constraint with the input coordinate i, j and k according to Eq. (6):
$$\begin{aligned} \mathcal {N}(x,y,z)= \begin{Bmatrix} (i,j,k)||x-j|\leqslant \frac{Ks-1}{2},|y-j|\leqslant \frac{Ks-1}{2},|z-j|\leqslant \frac{Ks-1}{2} \end{Bmatrix} \end{aligned}$$
(8)
We can see from Eqs. (6), (7) and (8) that using a kernel size greater than two would expand the span (e.g., \([-Ks, Ks]\)) of the input coordinates, allowing a transposed convolution to dynamically generate new non-zero points for generative tasks like shape completion. In our specific task, the generative sparse tensor decoder in Table 1 is trained to generate M new points, while maintaining the original input coordinates.
Each transposed convolution layer in Table 1 is followed by a pruning layer that prunes out undesirable new points, which is essential for maintaining a low memory and computation cost during the generative process. During training, the ground truth masks teach the network when to keep or prune a point. During inference, the ground truth masks are unavailable. The network prunes a point if its feature value is lower than a pre-defined threshold \(\tau \). In our network, we choose \(\tau =0\).
Shape super-resolution
Skull shape super-resolution refers to the process of transforming a (completed) coarse binary skull shape to its smooth high-resolution representation with fine geometric details. The input is the completed skull at a low resolution \(\mathbb {Z}^a\), and the output is the same completed skull at a higher resolution \(\mathbb {Z}^b\), \(a<b\). Note that for the skull super-resolution task, the coarse and high-resolution skull (i.e., the ground truth) have to be in the same coordinate system for the coordinate mapping in the sparse CNN to work properly, meaning that the coarse skull image needs to be up-scaled to the same size as the target high-resolution image, i.e., \(a=b\). The network would fail to converge when, for example, the input is of resolution \(64\times 64\times (Z/8)\), while the ground truth is of resolution \(256\times 256\times (Z/2)\). By \(a=b\) we do not mean that the input and the ground truth have the same resolution from the perspective of image quality. Rather, we mean that the input is interpolated to the same size as the ground truth. The up-scaled input still appears blurry and coarse, and lacks geometric details.
According to Ref.25, the difference between a (up-scaled) coarse skull voxel grid and a high-resolution voxel grid is simply the arrangement patterns of the zero and non-zero voxels, and, by rearranging the voxels, a coarse skull shape can be upgraded to the high-resolution representation. The total number of non-zero voxels between the two types of skulls shows no statistical differences. Therefore, we use the following to represent the ground truth coordinate matrix \(\mathcal {C}_{out}^{sr}\) and feature vector \(\mathcal {F}_{out}^{sr}\) for the super-resolution task:
$$\begin{aligned} \mathcal {C}_{out}^{sr}=\begin{bmatrix} x_{1} &{} y_{1} &{}z_{1} \\ x_{2} &{} y_{2} &{}z_{2} \\ …&{}… &{}… \\ x_{N_0} &{} y_{N_0} &{}z_{N_0} \\ …&{}… &{}… \\ x_{N}’ &{} y_{N}’ &{}z_{N}’ \end{bmatrix} , \mathcal {F}_{out}^{sr}=\begin{bmatrix} 1\\ 1\\ …\\ 1\\ …\\ 1 \end{bmatrix}. \end{aligned}$$
(9)
If we assume that an up-scaled coarse skull and the ground truth each possesses \(N_{sr}\) (a variable) non-zero points and they share \(N_0\) (a variable) common points (\(N_0 < N_{sr}\)), then \(N_{sr}-N_0\) non-zero points in the input need to be pruned while \(N_{sr}-N_0\) new non-zero points need to be generated in new coordinates. Therefore, the sparse CNN specified in Table 1 is still applicable to the super-resolution task.
Memory usage and computation complexity
The memory consumption of a neural network comes primarily from the following sources during training time: (1) input and ground truth image batches, (2) the output of the intermediate layers (forward pass), (3) network parameters, (4) memory usage from back-propagation (errors and gradients at each parameter) and (5) optimizers. In test time, the parameters of the network, input image batches and intermediate layers’ output are the main sources of memory usage. In our study, we compare the memory consumption of sparse and dense CNN when the networks have the same configurations (Table 1) and number of parameters \(N_{param}\). For both dense and sparse CNN configurations, \(N_{param}\) can be estimated as:
$$\begin{aligned} N_{param}=\sum _{i} \mathcal {C}^{out}_i\times \mathcal {C}^{in}_i\times Ks^3+\mathcal {C}^{out}_i. \end{aligned}$$
(10)
The number of bias in layer i is the same as the number of output channels of the layer \(\mathcal {C}^{out}_i\). Assuming that the parameters are stored as float32 (32-bit), sparse and dense CNN consume the same amount of memory in storing these parameters. However, the input and ground truth for a dense CNN are the original voxel grids, while, for a sparse CNN, only the valid non-zero voxels are required, and thus a sparse CNN consumes significantly less memory than dense CNN in loading the input and ground truth image batches, as shown in Fig. 2. Similarly, the size of the output \(N_{f^i}\) corresponding to intermediate layers i is linear to the feature dimension of the \((i-1)th\) layer \(N_{f^{i-1}}\) and is calculated as:
$$\begin{aligned} N_{f^i}=\frac{1}{s}(N_{f^{i-1}}+2p-Ks), \end{aligned}$$
(11)
where p and s are the padding and stride size, respectively. According to Eq. (11), the memory consumption of the intermediate layers’ output is also linear to the input image size (Fig. 2).
The memory consumption related to back-propagation and optimizer is tricky to calculate. In our study, we estimate the overall GPU memory usage during training using the nvidia-smi command provided by NVIDIA. We query the system GPU memory usage at 50-millisecond intervals for \(N_{train}\) training iterations (\(N_{train}\) is the number of training samples, and the batch size was set to 1) and take the average of all the queried values as the final amount of memory consumed for training, considering that the number of non-zero voxels are different for each training sample. The static memory occupancy that is not caused by training the network was subtracted from the measurement. For inference, we used the same method except that the measurement was taken when the network loaded the trained parameters and was run on the test set.
Floating points operations (FLOPS) is commonly used to measure computational complexity of a CNN. The FLOPS consumed in CNN layer i is the product of \(N_{f^i}\), Ks and \(\mathcal {C}^{out}_i\times \mathcal {C}^{in}_i\). Given the same network configurations (\(\mathcal {C}^{out}_i\), \(\mathcal {C}^{in}_i\), Ks), the FLOPS are linear to \(N_{f^i}\) and thus to the input image size (Fig. 2). The sparse CNN therefore is significantly faster than a dense CNN in both training and inference time under the same configurations.
DSC (top) and RE (bottom, %) for the sparse CNN on the CT dataset at different resolutions for the shape completion task. Horizontal axis corresponds to the image resolutions. a: \(64^2 \times (Z/8)\) (ch1) b: \(64^2 \times (Z/8)\) (ch2) c: \(128^2 \times (Z/4)\) (ch1) d: \(128^2 \times (Z/4)\) (ch2) e: \(256^2 \times (Z/2)\) f: \(512^2 \times Z\).