Monday, February 26, 2024

# The impact of Bayesian optimization on feature selection – Scientific Reports

In the following sections, we first provided the technical details of all the analytical methods that we have considered, and then presented the settings of simulation studies. Finally, we detailed the procedure adopted in the analysis of high-dimensional gene expression data obtained from ADNI.

### Technical details

As we predominantly focused on the impact of Bayesian optimization on feature selection, we first provided details of Bayesian optimization. We then presented commonly used feature selection methods, including those with and without hyper-parameters.

#### Bayesian optimization

Bayesian optimization18 aims at finding the combinations of hyper-parameters that maximizes the performance of the original problem. It defines its objective function in the same fashion as the original problem, and iteratively estimates the posterior distribution of the objective function as well as the subspaces of the hyper-parameters that are likely to achieve the optimized objective function. To be specific, for each iteration, Bayesian optimization first uses the uniform distribution to sample hyper-parameters from a given range and adopts the Gaussian process to estimate the posterior distribution of the objective function for each sampled hyper-parameter combination. It then identifies the region of hyper-parameters that are likely to optimize the objective function using the acquisition function, which balances between the posterior mean ($$\mu \left(x\right))$$ and variance ($$\sigma \left(x\right))$$ of the objective function using the tuning parameter $$\kappa$$, it can be expressed as Eq.Â (1):

$$AC\left(x\right)=\mu \left(x\right)+\kappa \sigma \left(x\right)$$

(1)

By utilizing the acquisition function to identify the optimal subspace of hyper-parameters, Bayesian optimization can avoid the region that achieves the best value of the objective function but has a considerable large variance. By iteratively estimating posterior distribution of the objective function and adaptively updating the subspace of hyper-parameters, Bayesian optimization can leverage the information from prior evaluations and only focus on promising subspaces, which not only simplifies the hyper-parameter searching process, but also improves the stability and tends to provide an optimal hyper-parameter configurations25.

#### Feature selection methods

We considered embedded-based methods (i.e., Lasso, Enet, and XGBoost) that require hyper-parameter tuning as well as filter-based (i.e., SIS and MRMR) and wrapper-based methods (i.e., sPLSda) that do not involve hyper-parameters for comparison purposes. Lasso, Enet and XGBoost exhibit greater versatility and have been experimentally applied in research on brain disorders and gene expression30,31. SIS, MRMR, and sPLSda are also widely utilized for feature selection in gene expression data32,33,34.

Lasso8 and Enet9 are penalized regression models with objective function of the form:

$${f}_{w}\left(x\right)+\lambda \rho {\Vert w\Vert }_{1}+\frac{\lambda \left(1-\rho \right)}{2}{\Vert w\Vert }_{2}^{2},$$

(2)

where $${f}_{w}\left(x\right)$$ is the loss function (e.g., mean square error) and $$\lambda$$ is a tuning parameter controlling the sparsity of the model. $$\rho$$ controls the relative contributions of the two penalties, where L1 penalty term (i.e., $${\Vert w\Vert }_{1}$$) encourages sparsity with coefficients of less important factors being shrunk to zero and L2 penalty term (i.e., $${\Vert w\Vert }_{2}^{2}$$) encourages shrinkage to reduce the impact of over-fitting and multi-collinearity. Lasso sets $$\rho =1$$ and Enet sets $$0<\rho <1$$. Both Lasso and Enet involve a hyper-parameter $$\lambda$$, which controls the sparsity of the model and normally is chosen in accordance with information criteria [e.g., Akaike information criterion (AIC)] and cross-validation (CV). In this research, we explored the benefits of using Bayesian optimization on the hyper-parameter $$\lambda$$ and $$\rho$$. We utilized Bayesian optimization to adjust the values of $$\lambda$$ for Lasso and $$\rho$$ and $$\lambda$$ for Enet, both of which fall within the range of (0, 1).

XGBoost11 is built based on the decision tree with an objective function of Eq.Â (3):

$${Obj}^{*}\approx \gamma T-\frac{1}{2}\sum_{j=1}^{T}\frac{{G}_{j}^{2}}{{H}_{j}+\lambda }$$

(3)

where $$\gamma$$ and $$\lambda$$ can control the strength of the penalty. $$T$$ is the total number of leaf nodes, and $${G}_{j}$$, $${H}_{j}$$ represent the sum of the first and second order gradients of all samples of the $$j$$-th node, respectively. XGBoost can be used as a variable selection tool as it can gauge the importance of each feature via the â€˜gainâ€™, which measures the improvement of the objective function once the feature is split and is defined as Eq.Â (4):

$$Gain= \frac{1}{2}\cdot \left[\frac{{G}_{L}^{2}}{{H}_{L}+\lambda }+\frac{{G}_{R}^{2}}{{H}_{R}+\lambda }-\frac{{\left({G}_{L}+{G}_{R}\right)}^{2}}{{H}_{L}+{H}_{R}+\lambda }\right]-\gamma$$

(4)

where subscripts L and R respectively denoting the left and right tree after splitting at the node. A larger value of the â€˜gainâ€™ signifies a superior split, highlighting the increased importance of the feature. In this study, we explored the advantages of Bayesian optimization on the optimization of hyper-parameters that are listed in Supplementary Table S1.

SIS5 is a filter-based feature selection method that ranks the importance of features by evaluating their correlations with the target. It first computes the association between each feature and the target, and then selects a subset of features according to the rank of their correlation values based on a pre-specified user-defined number of features.

MRMR6 ranks the features on the basis of their mutual information with the target and their mutual information with other, which can be expressed as:

$$MRMR\left(i\right)=MI\left(y,{x}_{i}\right)-\frac{1}{S}\cdot {\sum }MI\left({x}_{i},{x}_{j}\right)$$

(5)

where $$MI\left(\cdot ,\cdot \right)$$ represents mutual information. MRMR aims at selecting a subset of features that maximizes the correlation with the target while minimizing the redundancy among features. Similar to SIS, it selects the top features with the highest MRMR score for downstream analysis.

sPLSda35 predominantly aims at converting the original predictor into a small number of orthogonal latent variables that are constructed with a small set of inputs and maximizes the covariance between input and output. It uses the objective function of the partial least square discriminant analysis to form the latent variables and imposes a penalty term to allow that latent variables are constructed with a selected number of features in the original space. It defines its objective function as:

$$\underset{{\Vert \alpha \Vert }^{2}=1, {\Vert \beta \Vert }^{2}=1}{{\text{argmax}}}co{v}^{2}\left(T, H\right)\mathrm{ with }T=X\alpha \mathrm{\,and\, }H=Y\beta ,$$

(6)

where $$\alpha$$ is the penalty term. As sPLSda is designed for classification problem, we categorized the response into 50 categories following Bommert et al.36.

### Simulation studies

To evaluating whether the Bayesian-optimized feature selection method could yield more stable and reliable feature selection, we conducted simulation studies. We simulated the covariates $${{\varvec{X}}}_{{\varvec{i}}}={\left( {X}_{i1}, {X}_{i2},\cdots , {X}_{ip}\right)}^{{\text{T}}}$$ of the i-th individual as $${{\varvec{X}}}_{{\varvec{i}}}\boldsymbol{ }\sim \boldsymbol{ }N(0,{{\varvec{I}}}_{{\varvec{p}}})$$, where $${{\varvec{I}}}_{{\varvec{p}}}$$ is an identify matrix and p is set to be 5000. We considered both linear and non-linear models. For the linear effects, the continuous outcomes were simulated as:

$${Y}_{i}={{\varvec{X}}}_{{\varvec{i}}}{\varvec{w}}+{\varepsilon }_{i},$$

(7)

where $${{\varvec{X}}}_{{\varvec{i}}}={\left( {X}_{i1}, {X}_{i2},\cdots , {X}_{i{p}_{0}}\right)}^{{\text{T}}}$$ is the causal covariate, $${\varvec{w}}=({w}_{1},{w}_{2},\cdots , {w}_{{p}_{0}})$$ is their corresponding effect and $${\varepsilon }_{i} \sim N(\mathrm{0,1})$$. Following the same settings used in Fan and Lv5, we set the $${w}_{i}$$ as:

$${w}_{i}={\left(-1\right)}^{{U}_{i}}\cdot \left(\frac{5\cdot {\text{log}}\left(n\right)}{\sqrt{n}}+\left|{Z}_{i}\right|\right),$$

(8)

where $${U}_{i}\sim Ber(0.5)$$ and $${Z}_{i}\sim N\left(\mathrm{0,1}\right).$$ For the non-linear model, the continuous outcomes were simulated as:

$${Y}_{i}={\sum }_{j=1}^{{p}_{1} }\mathit{sin}\left({X}_{ij}\right)+1.8\times {\sum }_{j={p}_{0}-{p}_{1}+1}^{{p}_{0} }\mathit{cos}\left({X}_{ij}\right)+ {\varepsilon }_{i},$$

(9)

where $${p}_{1}=\frac{{p}_{0}}{2}$$, and $${\varepsilon }_{i}\sim N\left(\mathrm{0,0.1}\right)$$. To consider binary outcomes under both linear and non-linear effect models, we generated $${{Y}_{i}}$$ = $$\left\{\begin{array}{c}0, {Y}_{i} < {Y}_{mid}\\ 1, {Y}_{i} \ge {Y}_{mid}\end{array}\right.$$ and $${Y}_{mid}$$ is the median of the set $${Y}_{i}$$. We set the total sample size to be 1000, and gradually increased the number of causal features from 100 to 1000 (i.e., $${p}_{0}=\left(100, 200, 500, 1000\right)$$). We repeated each simulation setting 20 times.

We implemented Bayesian optimization on XGBoost (denoted as BO_XGBoost), Lasso (denoted as BO_Lasso), and Enet (denoted as BO_Enet), where hyper-parameter tuning is needed. We set the number of iterations to be 100 for Bayesian optimization. We also considered the above three methods, where the hyper-parameters were set based on their default settings of the corresponding R packages (i.e., xgboost37, glmnet38 and msaenet39). To make fair comparisons, we included additional three widely used feature selection methods (i.e., SIS, sPLSda, MRMR), where hyper-parameters are not involved. For each method, we varied the pre-specified number of selected features from 100 to 1000 (i.e., $${n}_{s}=(100, 200, 500, 1000))$$. We used the recall rate to evaluate the performance of each feature selection method.

### The analysis of gene expression data from ADNI

We focused on brain imaging traits from ANDI studies. These traits include subcortical volumes (hippocampus, accumbens, amygdala, caudate, pallidum, putamen, thalamus), the volumes of gray matter, white matter and brainstemâ€‰+â€‰4th ventricle from T1 structural brain MRI, and the volume of white matter hyperintensities from T2-weighted brain MRI. We have normalized the phenotype data, and the sample sizes for each phenotype in ADNI studies are summarized in Table 1. Demographic information is summarized in Supplementary Table S2.

Gene expression data were derived from blood samples collected from the 811 participants in the ADNI WGS cohort. Analysis was conducted using the Affymetrix Human Genome U219 Array (Affymetrix, Santa Clara, CA). We utilized the data after quality control from K.N.Â et al.44, and no additional quality control steps were applied. For our analysis, we extracted 49,386 gene expression profiles that pass the quality control criteria for subsequent modeling.

For ADNI data, we genuinely lack knowledge on which genes are related to AD, and thus we rely on the predictive performance to determine whether we have identified AD risk factors and achieved a robust and accurate AD risk prediction model. We split the data into a 70% training set and a 30% testing set. We employed all feature selection methods in simulation studies to select a pre-specified number of features from the training data, and then fed these features to the downstream prediction tasks, where support vector machine (SVM), Lasso, random forest (RF) and gradient boosting machine (GBM) are used to build prediction models. We evaluated the prediction performance based on testing data, where Pearson correlations and mean square errors are calculated. We repeated this process 100 times to avoid chance finding.