Training cohort
In order to develop a modeling framework for predicting expected weight loss based on longitudinal weight data, we used data from a large real-life cohort (n = 1453) of participants in a web-based program, healthy weight coaching (HWC)16. In short, the dataset contains weight data collected during a 12-month interactive web-based intervention for weight management in obesity. It includes submitting weight and diet logs by the participants using a web application, through which they get automated feedback for the submissions, anonymous contact to other participants and support by an assigned personal coach. Each participant provided written informed consent. The study was approved by the Coordinating Ethics Committee of the Helsinki University Hospital (Reference Number 327/13/03/00/2015). All methods were carried out in accordance with relevant guidelines and regulations. All participants were referred by a licensed physician and needed to fulfill the criteria of age of 18 years or older, BMI of 25 kg/m2 or higher, access to computer/smartphone and willingness to participate. In the present study, we excluded the individuals that fulfilled any of the following criteria, to remove outlier entries introduced by individuals and influence of weight loss medication:
-
(1)
known prescription of weight-loss medicine
-
(2)
less than two weight entries during the 9 months
-
(3)
more than 1.5 kg/day weight change at any point
-
(4)
weight data available for less than 270 days
-
(5)
height of 100 cm or less
This left us with data from 327 individuals for training the prediction models. The average number of weight data points for the selected individuals was 36 ± 19 (range: 3–149). The selection process is also shown graphically in the Supplementary Figure S1. Most data was removed due to (4), as this cohort is continously recruiting, many participants were not enrolled for 270 days yet.
Clustering for output variable
We trained our models for two prediction scenarios: (1) Overall three-class weight loss result at 9 months (weight gain, insufficient weight loss, weight loss) and (2) A more refined five-class prediction of weight loss result at 9 months (high/low weight gain, high/moderate/insufficient weight loss), as explained in more detail below. Due to continuous online recruitment of new participants, not all participants had reached the end of the program (12-month time point) at the time of the data lock. Therefore, we selected the 9-month time point (one month defined as 30 days) to increase the number of individuals reaching the final time point of the current analyses (327 participants) compared to 12 months (173 participants).
To define the five weight change classes (high/moderate/insufficient weight gain, high/low weight loss), we applied three different versions of agglomerative hierarchical clustering to the HWC weight change data. The distance function used was dynamic time warping (DTW), which is able to detect similarities between time series data17, 18. Our aim by using DTW as distance function was to be more focused on the shape of the time series, not only the final weight change value. For example, an individual with high initial weight loss and then reaching a plateau should have a small distance to an individual with slow weight loss, but then a plateau at the same level as the first individual. The first hierarchical clustering was implemented using the R (v 3.5.3)19 package dtwclust20. The other two hierarchical clusterings were obtained using the Python (v 3.6.9)21 package sklearn (v 0.22.1)22 with the DTW implementation of either the dtaidistance (v 1.1.4)23 or cdtw (v. 0.0.1)24 package. We computed the mean weight changes for each cluster by only considering individuals, where all three implementations agreed on the cluster (59.5%). Finally, these mean values were refined by applying the k-means algorithm to the entire HWC dataset, using the means from the first step as initial centroids and DTW as distance function. To determine the three weight change classes for the simpler prediction scenario, we combined the high/moderate weight loss classes and the high/low weight gain class into one class, respectively.
Input variables
Each model was trained using weight change data from nine different time frames, all starting at baseline and spanning up to half, 1, 2, 3, 4, 5, 6, 7 or 8 months. The weight changes were calculated in percent in relation to the baseline weight. Data for the days between the weight entries were linearly interpolated. When applied to new individuals, the predictions were done using the model trained for the time frame closest to the available weight change datum of that individual; for instance, if three and a half month’s weight change datum for an individual was available, the model trained for the 4-month time frame was used for prediction.
For training the models with each method, we tested three different types of input data to identify the best approach for each method: (1) BMI at baseline together with the weight change at the last day of the respective time frame, (2) BMI at baseline together with the last weight change value from each time frame up until the current one, and (3) DTW distances between the observed series and the cluster means. With multi-layer perceptron, we also tested the daily weight change values as additional input data type. We then selected the input data that achieved the best averaged cross-validation prediction accuracy (Table 1). The fold assignment for the fivefold cross-validation was done by randomly sampling an equal number of observations from each prediction cluster. The flow from input data to the outcome is illustrated in Fig. 1.
Model development
For predicting the cluster-based 9-month weight-loss results, we considered the following general statistical model
$$y = f\left( {x_{t} } \right) + \epsilon _{t}$$
where \(y\) is the observed weight loss outcome (either the five- or three-class weight loss result defined in section “Clustering for output variable” or the 9-month weight loss in percentages), \(f\) is an unknown function relating an input vector \({x}_{t}\) consisting of data available at time point \(t\) (defined in section “Input variables”) to \(y\), and \({\epsilon }_{t}\) is a random error term which is independent of \({x}_{t}\) and has mean zero and constant variance. To estimate \(f\), we applied six supervised machine learning methods: logistic regression, linear regression, naive Bayes classification, support vector classification, support vector regression, and multi-layer perceptron. In all approaches, our primary aim was to test the method for predicting the correct weight loss cluster instead of using the modelling coefficients for statistical inference and hence we did not strictly require, e.g., the independence of the input variables. All models were trained on the HWC dataset using fivefold cross-validation, and later validated with the independent Oxford Food and Activity Behaviours (OxFAB) cohort study dataset25. Details for all six models can be found in the Supplementary Text.
The regression-type models predicted the weight loss in percentages, which afterwards were converted into the classes using cutoff values in weight change percentages. We did this to be able to directly compare the results to the classification models. To compute the cutoff values, we sorted the observed weight change values after 9 months of each individual from high to low and selected the value as cutoff that minimized the number of classification errors. If there were several values with the same error, we selected the largest cutoff value.
Applying the models to real-world data
Predictions for an individual were calculated only, when the following criteria were fulfilled for the current prediction time frame:
-
(1)
at least two weight entries of the individual within the current prediction time frame
-
(2)
the last weight entry within 30 days of the end of the current prediction time frame
This ensured that each prediction was only tested for those cases who would also be computed in clinical use.
Computational software
Data analysis and computational modeling was done in Python (v 3.6.9). For the machine learning models, we used the package sklearn (v 0.22.1). The dtaidistance package (v 1.1.4) was used for the DTW implementation.
Independent validation
All prediction models were validated with independent data from the previously published OxFAB cohort study (n = 1265). In addition to the baseline data, the OxFAB dataset also contains self-reported weight data up to 1 year of follow-up. Medication data were not available for the OxFAB dataset and, therefore, exclusion criterion (1) used for the HWC dataset was omitted here, while the remaining exclusion criteria were used to exclude the outliers in the validation dataset. We also used one additional criterion: at least one weight entry within the first 270 days.
This left us with data from 184 individuals for validating the prediction models. The average number of weight data points for the selected individuals was 16 ± 16 (range: 3–162). The selection process is also shown graphically in the Supplementary Figure S1. For the validation datasets, data were available as follows: n = 46 (0.5 months), n = 60 (1 month), n = 83 (2 months), n = 126 (3 months), n = 104 (4 months), n = 83 (5 months), n = 71 (6 months), n = 66 (7 months), and n = 62 (8 months). Again, missing weight data were linearly interpolated using adjacent weight entries. To match the situation in clinical use, we interpolated within the prediction time frame until the last available weight entry and then extrapolated the weight change after the last entry until the end of the prediction time frame using the average change of weight per day between the baseline and the last weight entry within the prediction time frame.
The truth, against which the prediction was compared, was determined by computing the DTW distance between the weight change data of each individual and the means of the five clusters derived from the training dataset. Each individual was assigned to the cluster with the smallest distance, similarly as in the training data. Accuracy was defined as the percentage of predictions that matched the truth.
To compare the multi-class accuracies against a random model for both the three- and five-class prediction models, we computed the percentage of participants in each class in the training data and randomly selected one class according to these probabilities. To compute a good estimate of the average accuracy, this was repeated 10,000 times on the validation dataset and the results were averaged.
Prior present
Parts of the study were previously presented in abstract form at the Applied Bioinformatics in Life Sciences (3rd edition) conference 13–14 February 2020 in Leuven, Belgium.