Subject cohort
Our retrospective study was performed using knees selected from subjects in the publicly available Osteoarthritis Initiative (OAI) and Multicenter Osteoarthritis Study (MOST) databases. The OAI database contains demographic and clinical information, radiographs, and MRI examinations from 4796 subjects between 45 and 79 years of age with or at risk for knee OA evaluated at baseline and 12, 24, 36, 48, 60, 72, 84, and 108-month follow-up15. The MOST database contains the same information and imaging studies from 3026 subjects between 50 and 79 years of age with or at risk for knee OA evaluated at baseline and 15, 30, 60, 84, 144, and 168-month follow-up. The OAI and MOST were approved by the Internal Review Boards at University of California at San Francisco, Boston University Medical Center, and each individual clinical recruitment site and was performed in compliance with the Declaration of Helsinki. All subjects signed written informed consent.
Training and validation group
A training and validation group from the OAI database was selected to train the models that are not biased towards age, BMI, sex, and race, and use only features on radiographs and MRI to predict TKR. A balanced case–control cohort was selected by matching case subjects and control subjects using baseline demographic variables associated with knee OA progression including age, sex, ethnicity, and body mass index (BMI). Case subjects were defined as individuals who underwent a TKR in either knee after the baseline enrollment date, while control subjects were defined as individuals who appeared at the 108-month follow-up visit and had not undergone a TKR in either knee. If a patient underwent TKR in both knees during OAI data collection, the knee that first underwent TKR was included. Each case patient with TKR was matched to a control subject without TKR who was the same age, sex, and ethnicity and with an additional constraint on the baseline BMI within a 10% tolerance. The data set from case–control pairs contained either the left or right knee from each case and control subject.
A total of 353 case–control pairs were identified from the 4796 subjects in the OAI database. Subjects were excluded if they had TKR at baseline, received partial knee replacement over the course of follow-up, were missing baseline or 108-month follow-up information, or did not match with a case or control subject. A summary of the selection of case–control pairs is illustrated in Fig. 1. Study cohort characteristics are summarized in Supplementary Table 1. All 706 matched subjects had baseline standing posterior-anterior knee radiographs, knee MRI examinations, and clinical outcome measures including the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC)16 and Quality of Life from Knee Injury and Osteoarthritis Outcome Score (KOOS QoL)17. A subset of 270 subjects had baseline semi-quantitative MRI scores for the severity of cartilage loss and bone marrow edema lesions on 14 knee articular surfaces using the MRI Osteoarthritis Knee Score (MOAKS) system18 provided by central reading of the National Institute of Health OA Biomarkers Consortium Project.
Total knee replacement risk assessment models
The MRI examinations for all subjects in the training and validation group included coronal intermediate-weighted turbo spin-echo (IW-TSE), sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE), and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) sequences performed on a 3.0T whole-body scanner (Magnetom Trio, Siemens Healthcare, Erlangen, Germany) with the imaging parameters shown in Supplementary Table 2. To determine the best OA risk assessment model, multiple models were developed using analysis of each individual sequence and using two different approaches for combined analysis of multiple sequences. Three separate MRI models were created for the IW-TSE, FS-IW-TSE, and DESS images as each tissue contrast has advantages for evaluating different knee joint structures19. A multi-input MRI model was also developed by inherently combining information from FS-IW-TSE and DESS images in the same CNN architecture (Fig. 2). All MRI models were created with CNN architectures using conventional residual blocks with extension to 3D. The use of 3D CNN architectures allowed 3D analysis of the DESS sequence without the need to evaluate reformatted images in different planes. The outcome predictions for the models were a confidence value between 0 and 1 indicating the likelihood for TKR. Details regarding the CNN architectures used in the IW-TSE, FS-IW-TSE, DESS, and multi-input MRI models are included in the Supplementary Materials. The source code for this study is available at [Link provided after review].
A radiograph model for predicting TKR was created from a previous study using a CNN architecture adapted from ResNet with 34 layers, which analyzed baseline standing posterior-anterior knee radiographs and provided a confidence value between 0 and 1 as an outcome prediction indicating the likelihood for TKR20. The OAI cohort used in this work is a subset of the cohort from20 and during evaluation, it was ensured that there was no leakage between the training sets in20 and the test set in our work. An MRI ensemble model was created by averaging the outcome predictions of the IW-TSE, FS-IW-TSE, and DESS models. In comparison to the multi-input MRI model that analyzed the different MRI sequences together to predict TKR, the MRI ensemble model averaged the outcome predictions of the models analyzing the different sequences separately. An MRI and radiograph ensemble model was created by averaging the outcome predictions of the ensemble MRI and radiograph models.
A traditional risk assessment model for predicting TKR was created using multi-layer perceptrons, which analyzed baseline clinical risk factors including BMI, WOMAC and contralateral WOMAC, and KOOS QoL21,22. Input features were normalized to unit mean and zero variance using training dataset statistics. The multi-layer perceptrons passed the input features through 2 fully-connected hidden layers of size 256 with ReLU activation and batch normalization and provided a confidence value between 0 and 1 as an outcome prediction indicating the likelihood for TKR.
Model training and validation
Model training and validation was performed using sevenfold nested cross-validation. The 353 case–control pairs were split equally into 7 parts, with the number of pairs used for model training ranging between 250 and 256 and the number of pairs used for model validation and testing ranging between 47 and 52. The dataset splits were performed randomly using a random data generator in Python (version 2.7, Python Software Foundation, Wilmington, DE). Details regarding model training, evaluation, and dataset splits are included in the Supplementary Materials.
Model evaluation on internal and external testing groups
The models were evaluated on an internal hold-out testing group from the knees of the remaining 4090 subjects in the OAI database that were not involved in model training and validation and who were evaluated using the same MRI protocol and same MRI scanner. Among the remaining 4090 subjects, there were 32 case knees in 27 subjects that underwent TKR and 7891 control knees in 4034 subjects that did not undergo TKR between the baseline and 108-month follow-up periods. The remaining 257 knee in 216 subjects were excluded due to missing baseline or 108-month follow-up information. Study cohort characteristics are summarized in Supplemental Table 3.
The models were also evaluated on an external testing group from the MOST database, which consisted of a 270 case–control pairs of subjects with and without TKR performed between the baseline and 84-month follow-up periods. Case subjects and control subjects were matched using baseline age, sex, ethnicity, and BMI with identical inclusion and exclusion criteria as used for selection of the case–control pairs for the training and validation group. Study cohort characteristics are summarized in Supplemental Table 4. The MRI examinations for all subjects consisted of coronal short-tau inversion recovery (COR STIR) and sagittal fat-suppressed intermediate-weighted turbo spin-echo (SAG FS-IW-TSE) sequences performed on a 1.0T dedicated extremity scanner (ONI MSK Extreme; GE Healthcare, Waukesha, WI) with the imaging parameters shown in Supplementary Table 2.
Analysis of saliency maps
Saliency maps were created for the DESS model that showed the regions of discriminative high activation on the images on which the CNN based its interpretation. The saliency maps were produced by one minus the prediction probability of the DESS model with blocking 32 × 32 × 16 regions sequentially with stride of 8, 8, 4 in the 3 directions. The saliency maps were created for 50 randomly selected subjects and were overlaid on their corresponding DESS reformatted images in axial, coronal, and sagittal planes.
The axial, coronal, and sagittal overlaid saliency maps and DESS images for all image slices through the knee were reviewed by a fellowship trained musculoskeletal radiologist who was blinded to all subject information. The radiologist graded the presence and absence of regions of discriminative high activation in different anatomic locations of the knee including the central bone-cartilage interface and peripheral bone-cartilage interface of the medial and lateral femoral condyles and tibia plateau, medial and lateral meniscus, infrapatellar fat pad, intercondylar notch, and fluid filled joint space.
Statistical analysis
Statistical analysis was performed using R Project for Statistical Computing Software (R version 3.6.0, R-Project.org). Statistical significance was defined as a p value less than 0.05. The Holm method was used to adjust p values to account for multiple comparisons23.
Model evaluation was performed using sevenfold nested cross-validation on the testing and validation group with additional evaluation performed on an internal hold-out testing group from the OAI database and an external testing group from the MOST database. Models were evaluated for all knees combined and for knees of each individual Kellgren-Laurence (KL) grade for the internal hold-out testing group in the OAI database and for all knees combined for the external testing group in the MOST database. Receiver operator characteristic analysis with areas under the curve (AUC) and area under the precision-recall curve (AUPRC) was used to evaluate the diagnostic performance of the different models developed to predict TKR. 95% confidence intervals for AUCs and AUPRCs were calculated across 5000 bootstrap samples. The Youden index was used to determine optimal model sensitivity and specificity24. Statistical significance of improvements in diagnostic performance of the different models was analyzed by comparing AUC differences between models using the Delong test25.
Univariate and multi-variate conditional logistic regression models were used to determine the ability of different variables to predict TKR including the outcome prediction of the MRI ensemble model, outcome prediction of the radiograph model, and individual clinical and MRI risk factors for all knees combined and for knees with each individual KL grade. The risk values predicted by the models were normalized to zero mean and unit variance to ensure that differences in mean magnitude prediction are not leading to any discrepancies in odds ratio computation. Clinical risk factors included baseline BMI, WOMAC and contralateral WOMAC scores, and KOOS QoL scores, while MRI risk factors included baseline semi-quantitative MOAKS scores for the severity of cartilage loss and bone marrow edema lesions, which were available in a subset of 270 case–control pairs. The number of articular surfaces with cartilage loss and bone marrow edema lesions was used as MRI risk factors as these variables were shown to provide superior diagnostic performance for predicting incident knee OA than the total MOAK scores for cartilage loss and bone marrow edema lesions26. Crude odds ratio (OR) was used to assess the effect of a given variable when it was used as the only predictor, while adjusted OR was used to assess the effect of the given variable adjusted for the effects of all other variables included in the model. Wald test was used to assess the significance of individual variables.
The frequency of regions of discriminative high activation on the saliency maps at each anatomic location of the knee were calculated for case and control subjects. Fisher’s exact test was used to compare differences in the proportions of locations of discriminative high activation in different anatomic locations of the knee between subjects with and without TKR.