### Overview of the Tromsø study cohort

To create a knowledge base for the lifestyle recommendation app, we model the associations between lifestyle dependent factors and subjective health using data from the Tromsø population surveys. The Tromsø study is a cohort study initiated in 1974 that invites large representative samples of the municipality of Tromsø, Norway^{38}. To achieve higher statistical power in our analysis we model data from the last two iterations, Tromsø6 (2007–08, n = 12 981, 53.4% women and 46.6% men) and Tromsø7 (2015–16, n = 21 083, 52.5% women and 47.5% men). Earlier iterations were not included in our analysis as they used different questionnaire items to measure PA, and the PA-questionnaire items in the last two waves allowed us to analyse the interaction and joint effect of PA frequency and intensity. Protocols for participant sampling were designed with the aim to collect longitudinal data and ensure sufficiently large sample-sizes within different age and gender cohorts^{39,40}. In Tromsø6, a 10% random sample of individuals aged 30–39 years, a 40% random sample of individuals aged 43–59, and everyone aged 60–87 years were invited. 65.7% of the invitees of Tromsø6 participated, and the ages ranged from 30 to 87 years. In the Tromsø7 study, all residents of Tromsø aged ≥ 40 years were invited, of which 65.0% participated. The ages of participants ranged from 40 to 99 years. In both studies questionnaires were sent to the participants by email, and physical examinations were carried out for those who chose to physically attend the study.

### Measurements and variables used for modelling

The independent variable of the statistical model we use as a knowledge base in the app is SRH, which is a person’s response to the question “How do you, in general, consider your health to be?” with possible answers being 1. Very poor, 2. Poor, 3. Not so good, 4. Good, 5. Excellent. The “very poor” category had negligible responses (0.37% response rate, likely due to the difficulty for those with severe health problems to participate in surveys), and was therefore merged into the “poor” category. After merging, the categories were 1. Poor, 2. Not so good, 3. Good, 4. Excellent. Although SRH only takes discrete values, it reflects underlying states and processes which are continuous, which motivated us to model it as a continuous normally distributed variable. The SRH-predictions can therefore take intermediary values; e.g., if the model predicts a mean difference in SRH of 0.9 associated with a given difference (such as age = 45 and age = 55), then the model predicts that these groups differ on average by almost one SRH level, corresponding approximately to the difference in health between “1. Poor” and “2. Not so good”, or “2. Not so good” and “3. Good”. The modifiable predictors of SRH considered are the following: physical activity (PA) frequency and intensity, body mass index (BMI, categorised into underweight, normal weight, overweight and obese using cut-off values 18.5, 25, and 30 kg/m^{2}), mental health symptoms (10 item version of Hopkins symptoms checklist, HSCL), high blood sugar levels (HbA1c ≥ 6.5%), and smoking status (Do you smoke currently? yes/no). As confounders, we included age, sex, education level (1. Primary/partly secondary education [Up to 10 years of schooling], 2. Upper secondary education [a minimum of 3 years], 3. Or having attended college/university), social support (do you have enough friends who can give you help and support when you need it?), household status (do you live with a partner/spouse?), and comorbid disease burden. We modelled only those with data on all model variables.

*Comorbid disease burden* was measured using the health impact index (HII) proposed by Lorem et al.^{41}**,** which considers both the joint effect and severity of 11 illnesses, such as Cerebrovascular stroke, Migraine, Myocardial infarction, and Asthma. The presence or history of a condition is measured with questionnaire items of the form “Do you have or have you had …?”. The index is a weighted sum where each term represents the impact on SRH of a medical condition, and the weights have been calibrated based on their association with SRH. For example, the HII of someone who has had a Myocardial infarction (weight = 2) and suffers or has suffered from migraines (weight = 1) is 3. The scale ranges from 0 to 22. HII ≥ 3 is defined as being “seriously ill”^{41}.

*PA* levels were measured using self-reported PA frequency (How often do you exercise?) and intensity (If you exercise—how hard do you exercise?). The PA frequency item had responses 1. Never, 2. Less than once a week, 3. Once a week, 4. 2–3 times a week, 5. Approximately every day. The PA intensity item had responses 1. Easy—you do not become short-winded or sweaty, 2. You become short-winded and sweaty, 3. Hard—you become exhausted. We merged the “never” category into the “less than once per week” category to reduce the number of PA frequency categories and simplify the model and analysis. PA frequency and intensity thus had 4 and 3 levels respectively, and there were 12 PA subgroups defined by the combination of PA frequency and intensity.

*Mental health* status was measured using a 10-item version of the HSCL^{42}, denoted HSCL-10. HSCL is a well-validated clinical questionnaire for quantifying mental health symptoms that have been developed using factor analysis^{43}. Each item has responses 1. No complaint, 2. Little complaint, 3. Pretty much, 4. Very much. The numerical values corresponding to the responses are averaged over the items to produce a summary of the individual’s mental health status. Items that lack responses are left out of the calculations, so if 8 items have responses, then the score is the mean over those 8 items. If fewer than seven questions were answered, the HSCL-10 was defined as missing, and the participant’s data were excluded from further analysis. HSCL-10 ranges on a scale from 1 to 4, with a high index indicating psychological distress, and an index ≥ 1.85 used as cut-off for mental distress^{44}.

### Proof-of-concept app for converting the model to individualised recommendations

The proof-of-concept app that integrates the SRH-model was developed using Matlab’s app designer. Figure 1 shows a flowchart describing how the app operates and how it relates to research on SRH. It first queries the user for information on the variables needed for the SRH-model to generate predictions using drop-down menus. Once this data is collected, the app identifies modifiable factors for which there is room for improvement. Then, for each such factor (e.g. PA), the estimated effect on SRH of reaching the related goals (e.g. moderate exercise 1–3 times per week) is calculated using the SRH-model. Finally, the goals are presented jointly in a visualisation with descriptive text for each goal placed next to a horizontal bar that indicates the estimated improvement to subjective health if it is achieved. The order of the goals is determined by sorting the goals into categories (exercise, weight loss, blood-sugar control, etc..), and then the categories are ordered by impact of reaching the optimal goal or state (in terms of high SRH) for each respective category, such as a BMI in the normal range. The health effects of some goals, like those related to PA, depend on the degree of change, and we chose, for simplicity, to present only a subset of goals to represent that lifestyle factor. In these cases, we picked goals that seemed relevant for motivating the user, but always include the effect of reaching the theoretically optimal state to show the long-term potential room for improvement.

The app is open sourced under the The GNU Affero General Public License and it is available at https://github.com/uit-hdl/health-diary-app. Collected user data is stored in local files.

### Statistical methods

#### Stratified analyses

We first perform a set of stratified analyses where the data is subdivided into disjoint groups and we calculate various summary statistics to provide a broad empirical overview of the datasets. Sample characteristics (number of participants in various strata and their relative sizes) are collected in Table 1. We calculate the rate of good–excellent SRH in various strata to provide an overview of the associations between the model variables and SRH, which can be seen in Table 2. To empirically assess how the modifiable model variables jointly impact SRH, we stratify the datasets based on the number of unhealthy modifiable factors (UMFs), and calculate the rate of poor or not so good SRH within each UMF strata. The UMFs are defined as high blood-sugar (HbA1c ≥ 6.5%), overweight or obesity, being sedentary, current smoking, or symptoms of mental distress (HSCL-10 ≥ 1.85). The strata consist of those with 0, 1, 2, 3 and ≥ 4 UMF respectively. The results of this analysis are collected in Table 3.

Study participation can result in selection bias whereby healthier individuals are more likely to participate than seriously ill individuals, especially in our case where we model only those who chose to physically attend the study. If such sampling bias was present, then we would expect the participants who participated in both waves to have different study characteristics than those who only participated in the first wave. To test if a significant sampling bias is present, we performed a stratified analysis comparing age, sex ratios, HII, HSCL, and SRH, for the individuals who participated in both Tromsø6 and Tromsø7 against those who participated only in Tromsø6. The results are presented in the supplementary materials in Table S2. SRH, HSCL-10, BMI, and percentage female participants were similar for the two subgroups. There were significantly more participants who were seriously ill amongst those who did not return for the Tromsø7 survey: 16.34%, compared to 8.6%. Thus it seems that some degree of participation bias may be present. To account for this, we tested if high comorbid disease burden might modify the effect of lifestyle factors, see the subsection “Interaction effects to test generalisability”.

#### The mixed effects regression model

We aim to model mean SRH as a function of modifiable lifestyle factors. To account for dependency between data points in Tromsø6 and Tromsø7 that came from the same individual, we fitted a mixed-effects regression model with participant ID set as the grouping variable. Mixed effects models are suitable for handling longitudinal self-report data, since they can mitigate the effect of self-report bias by comparing consecutive data points collected from a single person and attributing a consistently high or low level to an individual bias rather than to a lifestyle factor.

Age, HSCL-10 and HII were modelled as continuous variables, and the remaining covariates were modelled as categorical. To select power terms for modelling non-linear relationships, we separately fitted univariate models with powers up to degree 4 and kept the powers with p-values < 0.05. With this method, we obtained a model where age was represented with a second-degree term only, HII was modelled with powers 1 and 2, and HSCL was modelled with powers 1–3.

We are assuming that the SRH categories are evenly spaced out (equidistant), so that we can meaningfully talk about an increase in SRH of e.g. 0.6, and so that a fixed change in the covariate based predictor produces the same change to SRH regardless of where on the SRH scale we are located. If this assumption holds, we expect to see a linear relationship between predictions and actual SRH, and departures from linearity would therefore indicate that the assumption is invalid. We visually inspect if this holds by plotting predicted vs actual SRH along with a linear trendline that models predicted values as a linear function of actual values. Specifically, we compare the trendline against the mean predicted value for each SRH category. The result of this analysis can be seen in Fig. 4. In “Justification for treating SRH as a continuous variable” in the supplementary materials we further motivate our use of treating SRH as a continuous variable and the assumption of equidistance by comparing against a Cumulative link model (see Fig. S2 for comparison of coefficients).

#### Interaction effects to test generalisability

To test if differences in age or sex modified the effect of various factors, we utilised interaction terms, which can reveal if changing a factor X can have different effects within different subgroups. A positive interaction between the group A and the factor X means that a unit increase in X is associated with a larger benefit (or less harm) within group A than outside of group A. The interpretation for a negative interaction is similar, but reversed (association is less beneficial/more harmful). If the p-value of the interaction term is insignificant (> 0.05), we conclude that changes in X have the same effect in each subgroup. Specifically, we created an age ≥ 65y category to see if the association between SRH and PA and BMI changes after age 65, which is a commonly used age threshold in the epidemiological literature. We performed a similar analysis for high comorbid disease burden (HII ≥ 3 and HII ≥ 2 respectively) and sex, to see if these variables influence the effect that PA and BMI have on SRH; see “Investigating interaction effects” in the supplementary materials for more details on this analysis.

#### Model presentation and comparison of lifestyle factors

We present the relationships captured by the model in terms of model-predicted effects; the predicted mean change in SRH when a covariate X is changed from some reference value to a new value, assuming all other covariates are held fixed. To present an intuitive overview of the non-linear relationship between age and SRH, we predict the effect of age increasing from a reference value of 40 years up to 50 and 70 years respectively. The result can be seen in Fig. 2, where we treated HII and HSCL-10 similarly, using reference values HII = 0 and HSCL-10 = 1. For easier comparison of the impact of the most important variables in the model we calculate the effects of changing HII, HSCL-10 and PA-levels from some specified reference level to a new level (note: “change” and “effect” in this paper are used for convenience, and are not meant to imply that an effect is causal or that a variable is modifiable). The effects calculated in this analysis and the specific levels chosen for comparison can be seen in Table 5. Confidence intervals for effects were computed using the Delta method and parameter covariance estimates. Uncertainties are reported with p-values and 95% confidence intervals, with p-values below 0.05 considered significant.

#### Investigating interactions between exercise intensity and frequency

Our dataset is well suited for exploring how PA intensity and frequency interact in their effect on SRH, since these variables are separated in Tromsø6 and Tromsø7. An increase in the number of weekly sessions will result in a higher increase in overall PA-volume if the intensity of the PA is high. It is therefore plausible that intensity and frequency will interact positively in their effect on SRH, with larger gains in mean SRH when the intensity is high. To explore if this is the case, we compute the effects of increasing the PA frequency from the baseline of < 1 time/week while holding PA intensity fixed. Interaction effects can then be analysed visually by comparing the slope of the trajectory associated with each intensity level (Fig. 3b). For an overview of the effect of various combinations of PA frequency and intensity, we also compute effects for each such combination with the sedentary group as the reference category (Fig. 3a).

#### Parameter sensitivity to selection of model variables

Several of the model covariates, such as PA, BMI, mental distress, and comorbid disease burden, influence each other through a multitude of different causal pathways. In the case of PA, it can be argued that BMI, comorbid disease burden, and mental health can be viewed as confounders but also as intermediary variables through which PA causally impacts health. A similar relationship can be argued for between HII and HSCL-10 with BMI, and BMI has considerable collinearity with blood-sugar levels. The complicated nature of these relationships makes deciding which variables to include in the model difficult and subjective. If the effects predicted by the model are highly sensitive to these choices, the validity of the model as a basis for recommending lifestyle goals is questionable. We therefore perform a sensitivity analysis where we investigate how the model predictions on the effect of changing PA levels or BMI are influenced by inclusion of these variables. Specifically, for PA, we fit a base model M0 that adjusts for age, education, sex, smoking, and then create a nested sequence of models by adding one variable at a time, thus obtaining: M1 = M0 + BMI, M2 = M1 + HII, and M3 = M2 + HSCL-10. Using Models 1–3, we then calculate the effect PA has on SRH by comparing the difference in mean SRH between “mild PA < 1 time/week” to “hard PA ≥ 4 times/week”. For BMI, we do a similar analysis, adjusting for the same variables in the baseline model, with the sequence of models now being M1 = M0 + HII, M2 = M1 + HSCL-10, and M3 = M2 + HbA1c. The results of this analysis is shown in Fig. S1 in the Supplementary materials**.**

We have modelled SRH as a continuous normally distributed variable, and to assess the validity of this assumption we compare it to a matching discretized normal distribution; specifically, we fit a normal distribution to SRH, use the fitted distribution to compute theoretical rates for each SRH level by computing the probability mass over the bins with endpoints [-∞, 1.5, 2.5, 3.5, ∞]. The goodness of fit was assessed by comparing theoretical (using the fitted normal distribution) and empirical probabilities. The results of this analysis can be seen in Fig. S4 in the supplementary materials.

#### Assessing model accuracy

To test for overfitting and assess the predictive capabilities of the SRH-model, we set aside a randomly selected test set of 200 participants prior to performing any analysis. To ensure that the train and test set had no participants in common (thus ensuring complete independence), we included in the test set only Tromsø7 participants who had not participated in Tromsø6. Since the SRH-model predicts continuous values but SRH takes discrete values from 1 to 4, we convert the continuous SRH-predictions to discrete value predictions by rounding them to the nearest integers in the range 1–4. As the primary performance metric, we compute the accuracy; the fraction of model predictions that were correct. We compare the prediction accuracy for these metrics between the test set and the set used to fit the model to assess if the model is overfitted to the dataset. We also calculate the area under the receiver operating characteristic curve (AUC) which reflects a models’ potential ability to separate the data into two non-overlapping classes. To test if prediction ability is symmetrical, i.e. if it is equally useful for predicting the positive and negative end of the SRH scale, we calculate the AUC for predicting poor SRH and excellent SRH respectively. To get a sense of how much of the variation in SHR is explained by the model, we calculate the marginal R^{2}: the proportion of the total variance explained by the fixed effect coefficients, which ranges from 0 to 1, with 1 indicating that the independent variable can be perfectly predicted from the model covariates^{45}. Finally, we report the standard deviation of the random effects (the individual SRH baselines) as an estimate of how much variation there is in reporting behaviour between individuals.

#### Comparing models

It is therefore interesting to consider what performance can be gained by utilising more complex and flexible models and machine learning techniques and shift focus to pure performance optimisation, since presentation via apps simplifies the output equally regardless of model complexity. This also helps us analyse to what degree the model’s accuracy is limited by inefficient usage of data or lack of information in the input data. To this end, we compare the performance of the mixed-effects model in fivefold cross validation against two machine learning algorithms that have demonstrated high performance on structured data: Explainable boosting machines (EBM) and Extreme Gradient Boosting (XGboost). We also tested adding data on variables (alcohol consumption and heart rate) that were not included in the main model due to concerns relating to effect interpretation. More details on this comparison can be found in the Supplementary materials.

### Ethics approval and consent to participate

The Regional Ethical Committee of Northern Norway gave ethical approval for this work (project number 89721). The Tromsø study was approved by the Norwegian Data Inspectorate and the Regional Ethical Committee of North Norway (REK). The methods of this study were performed in accordance with the relevant guidelines and regulations. The Tromsø Study collected written informed consent from all participants.