Bayesian reference model in feature selection problems

(1)

POLITECNICO DI MILANO

School of Industrial and Information Engineering Master of Science in Mathematical Engineering

Bayesian reference model in feature

selection problems

Supervisor: Prof. Alessandra Guglielmi Cosupervisor: Prof. Aki Vehtari

M.Sc Juho Piironen

Author: Federico Pavone matricola 875319

(2)

(3)

List of Figures

3.1 Scatter plot of sample correlation of feature xj with noisy

observed variable y versus latent f on the left and estimated ˆ

f on the right . . . 33 3.2 Plot of the shrinkage factor for the horseshoe prior . . . 34 4.1 Scatter plot of sample correlation of feature xj with noisy

observed variable y and latent f . . . 41 4.2 Posterior intervals at level 90% for the Bayesian model for

the normal means problem . . . 45 4.3 Number of discoveries versus the q-value with ρ0 = 0.2, n =

50 and ρ = 0.3 . . . 47 4.4 Posterior intervals at level 90% for the normal means model

with different simulated data . . . 48 4.5 Number of discoveries versus q-value plot for different ρ0

thresholds . . . 49 4.6 Stability comparison for the control of the Q-value in the

normal means problem . . . 50 4.7 Stability comparison for the selection with credibility

inter-vals in the normal means problem . . . 55 4.8 Histogram of the z-values for simulated data with n = 50 and

ρ = 0.5 . . . 58 4.9 Number of discoveries against local fdr averaged estimates

after 150 data simulations. . . 59 4.10 Stability estimates and 95% confidence interval controlling

local fdr at level 0.2. Results after 150 data simulations. . . . 60 5.1 Example of variable selection plot for the projection

predic-tive approach . . . 67 5.2 Example of the different iterations for the iterated projection

algorithm . . . 70 3

(6)

5.3 Histogram of the size of the subset of selected features for the

iterated projection and the iterated lasso . . . 72

5.4 Stability estimator and 95% confidence intervals for the iter-ated projection and the iteriter-ated lasso . . . 73

5.5 Averaged number of discoveries against the stopping rule thresh-old α for the iterated projection and iterated lasso algorithms 74 5.6 Histogram of the size of the subset of selected features for the iterated projection with cross-validation . . . 77

5.7 Stability estimator for the iterated projection algorithm with cross-validation . . . 78

5.8 Averaged number of discoveries against the stopping rule thresh-old α for the iterated projection with cross-validation . . . 79

6.1 Correlation plot of the body fat data . . . 82

6.2 Recall vs noisy discovery rate for the body fat data . . . 84

(7)

List of Tables

4.1 Set of parameters used for the comparison in Section 4.3 . . . 40 4.2 Stability comparison with hypothesis testing for the control

of the Q-value in the normal means problem . . . 51 4.3 Stability comparison with hypothesis testing for the selection

with credibility intervals in the normal means problem . . . . 53 4.4 False discovery rate, recall and number of no-discoveries for

the selection with credibility intervals . . . 54 4.5 Stability comparison with hypothesis testing for the control

of the local false discovery rate . . . 60 5.1 Stability comparison with hypothesis testing for the iterated

projection and iterated lasso algorithms . . . 75 5.2 Recall and number of empty selections (fails) for the iterated

projection and the iterated lasso algorithms . . . 75 5.3 Stability comparison with hypothesis testing for the iterated

projection with cross-validation . . . 77 6.1 Number of empty selections for the body fat data . . . 84

(8)

(9)

Abstract

Feature selection is a well known and still open problem in both frequen-tist and Bayesian stafrequen-tistics. It can be intended as the goal of finding either “the minimal subset of features with good enough predictive performance” or “the whole subset of relevant features”, depending on the task of the analysis. In this thesis we refer to the second definition.

In the first part of this work we carry out a comparison of different methods of feature selection using or not what we call a reference model on top of the procedure, that is a model that describes well our data. We measure the performance and the stability of each method using both the standard approach, that simply relies on data, and the reference model one, showing overall improved results when we look at a proper model (i.e. the reference one) instead of the observed data. We include different procedures in the comparison as a selection through high probability density posterior cred-ibility intervals, a selection made by controlling the Q-value and the local false discovery rate.

In the second part, we propose a novel algorithm to tackle the same se-lection problem that we name iterated projection, based on the projection predictive approach. We compare such method with the natural counter-part approach that does not use a reference model, which we call iterated lasso. Finally, we compare these techniques using real world data. Results show again an increased stability and a better performance of the selection in favour of the reference model approach.

(10)

(11)

Acknowledgments

Arrived at the end of my university experience, I would like to say thanks to many people that let all this happen.

I would like to thank Professor Alessandra Guglielmi for being the first one to introduce me to Bayesian Statistics and also for patiently correcting the thesis.

After spending there six amazing months, if I had to name a second home, it would be Helsinki. Kiitos paljon to Professor Aki Vehtari or, better, Aki for accepting me in his research group, for supervising me the whole time and for teaching me many things about Stan and Bayesian Statistics dur-ing simple chats. Also thank you for the welcome and all the tips about Helsinki, Daddy Green’s pizza is the best! Thanks to Juho for guiding me during the first part of the thesis and for being always available to answer all my questions. I would like also to thank all people I met in Aalto: Will, Gabriel, Kunal, Eero, Topi, Akash, Mans, Michael and all the others I have not named here. Thanks also to Pablo, for hosting me and having intro-duced me to his friends. Thanks to Anfisa, Ivan and Maria for the time spent together.

I managed to survive all these years of study also thanks to sport. Brazilian Jiu Jitsu has become a central part of my life, it taught and keep teaching me several life lessons. Kiitos to the Rebel Team of Delariva Finland: Risto, Delcio, Olli, Abu, Ray, Julle, Riku, Jarkko, Riku, Elias, Hari and all the other members. You welcomed me in your family and conquered a piece of my heart. I spent an amazing time in Helsinki also thanks to you.

Thanks to FCS, my home BJJ team. We have been training together for many years. Thanks to the instructor Luciano and all the members. A special thanks goes to the Sakido team and to Claudio, because everything started from there.

Five years of university are longs, but not so long if you are in good com-pany. Thanks to all my University friends, you are too many to be listed here. Everything got easier after discussing things with you. Thanks for the

(12)

studying and the leisure time spent together.

Thanks to my friends from Lecco: Ange, Mancio, Simo, Gio, Pippo and all other members of the Cantina Social Club. Thanks also to my high school mates for the rare, but great, reunions. It is always a great time with you all.

And finally, the biggest thank you goes to my family. They have been sup-porting me the whole time and virtually let all this happen. A special thanks to my uncle Salvatore, my mum, my dad and my brother!

(13)

(14)

(15)

Introduction

In statistical applications, one of the main steps of the modelling workflow is covariate or feature selection. Referring to the simple example of a linear regression, as e.g. yi = βtxi + i, with index i identifying each statistical

unit, β being the vector of weights of the same dimension of the vector of covariates xi and i an error term, the goal of feature selection is to reduce

the number of the covariates in the linear predictor leaving out those that are not significantly related to the output yi. In the literature different methods

have been proposed depending both on the final goal of the analysis and on whether it is used the frequentist or the Bayesian approach. In this thesis we mainly rely on the Bayesian framework that, in addition to the model distribution, includes a prior distribution p(β) for the parameters β, which possibly encodes prior information about them. The inference proceeds following the Bayes rule, giving what is called the posterior distribution p(β|D), where D stands for the observed data.

From a full Bayesian perspective, when the goal of the analysis is predic-tion, no selection should be carried out and integration over all parameters uncertainties, i.e. posterior distributions, is the recommended procedure. However, a high dimensional parameter space brings different difficulties as computation burden and overfitting. In such cases, the choice of the prior distribution plays an important role. Besides these issues, another problem is the impossibility to perform integration over all parameters in several real-case scenarios, due to many possible reasons as high costs to collect all the features for a future observation. These are few among many possible motivations to perform feature selection, here intended as the problem of “finding a minimal subset of covariates that yield a good predictive model for the target variable” (Piironen et al., 2018). For this purpose several different techniques have been proposed as feature selection induced by sparsifying priors (Rockova et al., 2012; Piironen and Vehtari, 2017c) or the projection predictive approach (Dupuis and Robert, 2003; Piironen and Vehtari, 2017a;

(16)

Introduction

Piironen et al., 2018).

Sometimes the task of the analysis is not prediction, but rather inference on which variables are significantly related to the phenomenon of interest. A common case is microarray data in medicine or biology, where for example for a very large number of genes, expression levels are measured for two sets of patients labeled as positive or negative to some disease (typically a small n large p scenario, with n number of patients and p number of genes). The main goal of the analysis is to spot which subset of these genes is significantly related to the disease in order to perform further experiments. A possible way to proceed is to perform hypothesis testing to assess the significance of each gene, based on some modelling assumptions. Efron (2012) refers to this as multiple hypothesis testing problem. Such example could be still considered as feature selection, intended this time as “identifying all features that are predictive about (that is, statistically related to) the target variable” (Piironen et al., 2018).

The subject and the title of thesis is related to this last scenario, the multiple hypothesis testing problem, but it originally takes the hint from the former, that is feature selection with prediction goal. The idea is to benefit of the use of a reference model, i.e. a (possibly Bayesian) model that encompass all information available in the data (all the covariates) and that performs sufficiently well in prediction, as proposed in the first work about the projection approach in Dupuis and Robert (2003) and lately by Piironen and Vehtari (2017a) and Piironen et al. (2018). If the observed target values, the yi’s in the regression example, are seen as noisy realisation

of an underlying phenomenon of interest, the idea of the reference model is that it can give a better description of the underlying phenomenon compared to the observed target values, because it cleans part of the noise present in the data. Thus also the relevance of a given feature can be better understood if somehow related to the reference model. Carrying on with the linear regression, a naive example of this is given by the sample correlation between the features xj and the observed y being more noisy than the one between

xj and some reference model point predictions ˆy, as later shown in Chapter

3.

The thesis is organised as follows: first we briefly summarise the Bayesian approach and the state of play about feature selection and multiple hypoth-esis testing in Bayesian statistics (Chapter 1), then we illustrate the general idea of the reference model approach (Chapter 3) and compare it to the clas-sical one in a multiple hypothesis testing framework for the normal means problem (Chapter 4). Such comparison is quantified looking at indexes as false discovery rate, recall and also estimating the stability of the procedures

(17)

Introduction

with the estimator presented in Chapter 2. After that, we briefly illustrate the original projection predictive idea and introduce a novel approach to solve the “whole subset of relevant feature” problem through an iterated use of it (Chapter 5). In Chapter 6 we apply some of the discussed methods to a real world dataset. Finally, in the Discussion we summarise and com-ment the results underlying possible future directions for the research. The original contribution to the thesis consists in the use of the reference model approach for the multiple hypothesis testing framework of Chapter 4 and in the implementation of the iterated projection algorithm of Chapter 5 plus all the numerical experiments reported in the aforementioned chapters and in Chapter 6.

We used the programming language R for experiments and analysis, whereas we fitted the Bayesian models with Stan. Most of the simulations have been run on the Triton system, the high-performance computing cluster provided by Aalto University.

The thesis has been developed in the Probabilistic Machine Learning group at Aalto University (Helsinki, Finland) under the supervision of pro-fessor Aki Vehtari.

(18)

(19)

Chapter 1

State of play

In this chapter we give some background knowledge of concepts concerning the thesis. The main type of data we refer to is composed of a target random variable Y and set of p covariates Xj, with j going from 1 to p. We write

the observed values in lower case, yi and xij, with subscript i indexing each

statistical unit, i.e. i = 1, .., n. We use the simplest model to describe Y through the variables Xj, i.e. a linear model of the type:

yi= βtxi+ i, i iid

∼ N (0, σ2) i = 1, ..., n (1.1) with p being the dimension of both vectors xi and β. Formulation (1.1) is

equivalent to assume the following model distribution for our data: Yi|β, σ, xi

ind

∼ N (βt_x

i, σ2) i = 1, .., n (1.2)

This chapter is organised as follows: after giving an outline of the Bayesian approach (Section 1.1), we introduce the feature selection problem (Section 1.2), referring to model (1.1), and describe some of the available techniques. In Section 1.3, we introduce the generalised linear model, which we refer to in Chapter 5, and, finally, in Section 1.4 we highlight few notes about terminology and notation used in the thesis.

1.1 The Bayesian approach

Referring to the data described above, frequentist statisticians would write the model distribution as:

Yi|xi ind

∼ N (βt_x

i, σ2) i = 1, ..., n (1.3)

while Bayesians as:

Yi|β, σ, xi ind

(20)

1. State of play

The reason is that in the Bayesian approach model parameters, in our example β = (β1, .., βp and σ, are considered random variables. The model

distribution describes observed data conditioning on such parameters. In order to generalise the discussion, from now on in this section we assume to have a generic model parameter θ ∈ Θ, with typically Θ ⊂ Rp and a generic model distribution p(y|θ) for the vector of observed random variables Y ∈ Rn, n being the number of observations. For the sake of simplicity, we use a sloppy notation writing p(y|θ) instead of p(Y = y|θ) and we also assume all distributions to be absolutely continuous with respect to the Lebesgue measure. We therefore write the distribution and the density with the same notation.

Expression (1.4) is rewritten as:

Y |θ ∼ p(y|θ) (1.5)

p(y|θ) = L(θ|y) is called likelihood when it is seen as function of the param-eters for a given vector of observations y. A distribution π(θ) is assigned to the random variable θ, which is called prior distribution, and together with the model distribution constitute what is called a Bayesian model:

Y |θ ∼ p(y|θ) (1.6)

θ ∼ π(θ)

with πθ possibly encoding prior knowledge. The inference proceeds comput-ing the posterior distribution π(θ|y) of the parameter θ by Bayes theorem:

π(θ|y) = p(y|θ)π(θ)

p(y) (1.7)

where the marginal distribution p(y) is computed integrating out θ from the model distribution:

p(y) = Z

Θ

p(y|θ)π(θ)dθ (1.8)

The quantities of interest are usually the posterior predictive distribu-tion for a new unobserved variable eY , denoted by p(_ey|y), or the posterior expectation of some reasonable, i.e. regular enough, function h(·) of θ. Un-der the assumption of independence of the variables Yi when conditioning

on θ:

Yi iid

∼ p(yi|θ) i = 1, .., n (1.9)

the posterior predictive distribution can be computed as follows: p(y|y) =_e

Z

Θ

p(_ey|θ)π(θ|y)dθ (1.10)

(21)

1.2. The feature selection problem

whereas posterior expectations are simply computed averaging over the pos-terior distribution:

E[h(θ)|y] = Z

Θ

h(θ)π(θ|y)dθ (1.11)

An analytical expression of the posterior distribution is available for a very small number of models. One case is when the prior is conjugate to the model, i.e. the posterior distribution belongs to the same parametrised family of distributions of the prior and the posterior computation simply reduces to update the relative hyperparameters. However, in most of the cases we can only aim to sample from such distribution and to compute ex-pressions as (1.11) via Monte Carlo methods. Suppose to have i.i.d. samples {θ(s)_}S

s=1 from the posterior distribution, Monte Carlo integration

approxi-mates integrals with summations over the samples: Z Θ h(θ)p(θ|y)dθ ≈ 1 S S X s=1 h(θ(s)) (1.12)

Assuming to be able to sample also from the model distribution p(y|θ), we can retrieve samples from (1.10) by marginalisation: for each θ(s), sample Y(s) from p(y|θ(s)), so that:

{(θ(s)_{, Y}(s)_)}S

s=1 ∼ p(y, θ|y)e and therefore:

{Y(s)}S_s=1 ∼ p(_ey|y)

There are different methods to generate samples from p(θ|y): it can be done using some approximating distribution, as variational inference (VI) algorithms, otherwise using Markov chains (MCMC algorithms), that produce correlated samples asymptotically belonging to the true posterior. Metropolis-Hastings or Gibbs sampler are classical MCMC algorithms. In case of continuous parameters only, the state of the art sampler is given by HMC (Hamiltonian Monte Carlo, details in Betancourt, 2017). The software Stan (Carpenter et al., 2017) implements HMC-NUTS (No-U-Turn Sampler described in Hoffman and Gelman, 2014), a refined version of HMC that adaptively sets some of the parameters of the algorithm.

1.2 The feature selection problem

As mentioned in the Introduction, variable (i.e. feature or covariate) selec-tion aims at reducing the number of covariates included in a model through some criteria. Depending on the goal of the analysis, the selection leads

(22)

1. State of play

to different type of subset of features. Following Piironen et al. (2018), we identify two possible tasks: when the goal of the analysis is focused on pre-diction, feature selection aims at “finding a minimal subset of covariates that yields a good predictive model for the target variable”, whereas sometime the interest is on “identifying all features that are predictive about (that is, statistically related to) the target variable”. We refer to the latter problem with the general expression multiple hypothesis testing, regardless whether the method used is based on hypothesis testing or not.

In the following sections, we review some of the available methods for both the minimal subset problem (Section 1.2.1) and the multiple hypothesis testing one (Section 1.2.2). All methods are described assuming the model (1.2).

1.2.1 Feature selection as minimal subset

In the Bayesian context, many variable selection techniques are based on suitable choices the prior distribution of the regression parameters. A gen-eral approach is to use sparsifying priors (Rockova et al., 2012; Piironen and Vehtari, 2017c), where spike and slab prior is considered the gold standard. Spike and slab prior was originally formulated by Mitchell and Beauchamp (1988) using a mixture of a delta component in zero (the spike) and a normal distribution (the slab):

βj|γj, τj iid

∼ γ_jN (0, τ_j2) + (1 − γj)δ0 j = 1, .., p (1.13)

with typically γj iid

∼ Be(γ₀), j = 1, .., p. Later, George and McCulloch (1993) proposed a normal distribution with small variance for the spike component, resulting in an absolutely continuous prior distribution. Spike and slab prior is theoretically appealing, however variable selection results can be sensitive to the slab width and prior inclusion probability choices (e.g. one rule could be including all those features such that p(γj = 1|y) is larger than some

value).

A large family of sparsifying priors can be built as scale mixture of normal distributions with local-global scale parameters (Polson and Scott, 2012). The prior for the parameters βj can be written as:

βj|λj, τ iid ∼ N (0, λ_jτ ) j = 1, .., p λj iid ∼ π₁ (1.14) τ ∼ π2 20

(23)

The global scale τ , common to all parameters, shrinks all signals toward ze-ros, while the local scale λj allows the important ones to escape this effect.

Polson and Scott (2012) suggest to use a prior with substantial mass around zero for τ , whereas a distribution with heavy tail for the local scale λj.

Dif-ferent choices for λj define different priors as Laplace with λj ∼ Exponential

(Park and Casella, 2008), Student-t with λj ∼ Inv-χ2 (Tipping, 2001) and

the horseshoe proposed by Carvalho et al. (2009), discussed in Section 3.2.1. Further examples can be found in Polson and Scott (2012). Such priors do not perform the selection themselves, thus it is non-trivial how to proceed from posterior distributions to feature selection. A common decision rule consists in selecting those features whose HPD (highest posterior density) marginal intervals does not include zero at some credibility level. However, several issues rise as how to choose the credibility level, how to deal with high dimensionality and correlated features, where marginal intervals can be misleading as shown in Piironen et al. (2018), and also what should be the distribution of the parameters after the selection, e.g. it can be the full model posterior restricted to the selected parameters or the posterior after fitting again the model including only the selected features.

Another possible approach is to consider hypothesis testing to assess whether a signal is null or not:

(

H0 : βj = 0

H1 : βj 6= 0

(1.15) Some of the aforementioned issues are still present also in this case, as the choice of the credibility level or the Bayes factor threshold. Moreover, in a full Bayesian framework null hypothesis of the form of H0 : βj = 0 loose of

meaning because of the continuity of the random variable βj.

Variable selection is strongly connected to model selection, since different subset of features can be seen as different models to be compared in order to select the best among them based on some utility estimate. Such utility is usually based on predictive performance, as root mean squared error or expected-log-predictive-density (Vehtari et al., 2017). For example, another common procedure is to use stepwise regression: starting from the empty model, at each step the variable that increases most the utility estimate is included in the model, until some stopping criteria is satisfied. In the case of a large number of parameters, that in model selection context means high number of candidate models, the selection result can have high variability if the utility estimates have large variance as it is the case when the utility estimation is based on cross-validation and the number of observations is low, with the consequent risk to select a nonoptimal submodel and to have,

(24)

1. State of play

for such model, a biased utility estimate, as showed in Piironen and Vehtari (2017a).

1.2.2 Multiple hypothesis testing framework

As previously said, here feature selection is intended as the problem of se-lecting the whole subset of relevant features through some procedure, which can be the traditional hypothesis test or something else.

Naive approaches to multiple hypothesis testing consist simply in stan-dard hypothesis testing, as control of type-I error in the frequentist setting. However when multiple comparisons are carried out simultaneously, the risk of seeing something when there is nothing can be dangerously high. In-deed let us call α the level at which type-I error is controlled, write P (·|H0)

meaning that we condition over the event “H0 is true” and consider the

event “discovery” when we reject H0, the type-I error is thus defined as

follows:

Type-I error = P (discovery|H0) ≤ α (1.16)

Consider then M simultaneous comparison at the same level. The probabil-ity of claim at least one discovery when there is nothing is:

P (at least one discovery|∀H0) = 1 − P (no discoveries|∀H0) = (1.17)

= 1 − (1 − P (discovery|H0))M =

≤ 1 − (1 − α)M

where we assume tests being independent under true H0. This means that

if α = 0.05 and M = 10, the probability of find something in the worst case is controlled at level 0.4 and such level grows exponentially with M. This problem is a big challenge and sometimes, quoting Jorge Luis Borges’s short story, is named the “garden of forking paths”. Although the intrinsic difficulty, some solutions have been proposed in the literature. In the fre-quentist setting, one of the first solution was the Bonferroni correction of the type I-error (Rupert Jr, 2012). But the revolution happened with the false discovery rate (FDR) control with the Benjamin-Hochberg (BH) procedure (Benjamini and Hochberg, 1995), of which we give now a brief explanation. Given a set of m hypothesis tests {Hi}mi=1 and some procedure to reject or

not the null hypothesis, we have the following table: H0 true H0 false

H0 accepted U T m − R

H0 rejected V S R

m0 m − m0 m

(25)

with U , T , V , S, R and m0 being numbers of tests of each entrance

com-bination of the table. It follows the definition of FDR and pFDR (positive false discovery rate):

F DR := E V R|R > 0 P (R > 0) (1.18) pF DR := E V R|R > 0 (1.19) The BH procedure consists in ordering the p-values of each test and then mark as discoveries all those tests whose p-values is less than the largest included p(i), the parenthesis on the subindex refers to the (increasing)

or-dered indexing of the p-values. The idea can be easily understood through Bayes rule:

FDR = P (H0|discovery) =

P (discovery|H0)P (H0)

P (discovery) (1.20)

If the procedure includes all the p-values below the value p(i), than by

defi-nition of p-value:

P (discovery|H0) = p(i) (1.21)

The prior probability of H0 can be increased at 1. The marginal

probabil-ity P (discovery) can be empirically estimated to be i/N since we reject i hypothesis over N . Therefore we have:

FDR = p(i)π0

i/N ≤

p_(i)

i/N ≤ q (1.22)

so that controlling at level q means stop the selection of the ordered tests when the largest included satisfy:

p_(i) ≤ qi

N (1.23)

Storey (2003) defines the Q-value as the correspondent “p-value” of the pFDR . Given a test-statistic T = t, the Q-value is defined as follows:

Q(t) = inf

Γα:t∈Γα

pFDR(Γα) (1.24)

We use the control of the Q-value as one of the method in the comparison in Section 4.3.1. Further explanations about the Q-value can be found in the relative section.

In some cases, the multiple hypothesis testing problem is rephrased in the context of a normal means problem, where we want to infer over the

(26)

1. State of play

true mean θj of a noisy observation zj, that is some summary of the data

related to the variable xj , for each j:

zj = θj+ j j = 1, .., p (1.25)

j iid

∼ N (0, σ2₀)

Such problem is further discussed in Section 4 with examples of how to move from the abstract inference formulation to this one. Efron (2008) propose an empirical Bayes approach to control the local false discovery rate for the normal means problem, which is one of the methods used in the comparison of Section 4.3.3.

Different possible full Bayesian approaches have been proposed for the normal means problem. The overall idea is to fit a linear model to (1.25) using some sparsfying prior over the latent means θj and then infer about the

relevant features looking at their posterior distributions. Bhattacharya et al. (2015) propose Dirichlet-Laplace prior (another example of scale mixture of normal distributions) and use k-means clustering with k=2 over the expected posterior means, identifying two groups features, respectively relevant and irrelevant. In our experiments of Chapter 4 we rely on the normal means problem formulation and use regularised horseshoe prior combined with few different methods (see Section 4.3).

1.3 Generalised linear models

These models have many important theoretical and computational proper-ties. Since they are not of main interests for this thesis, here we give a brief and simple introduction, further information can be found in many statistical books, as for example, Hardin and Hilbe (2007).

The generalised linear model (GLM) allows to generalise the linear re-gression to a target variable that is not normally distributed, i.e. the like-lihood is not Gaussian. Traditional examples are the logistic regression or the Poisson regression. The model distribution for the target variable has the only constrain to belong to the exponential family, which includes, for example, the exponential, the Bernoulli, and the gamma distribution.

Referring to each single statistical unit and a vector of covariate x, a linear predictor η = xβ models the expected value of the response variable Y through a link function g(·) as follows:

E[Y ] = µ = g−1(η) (1.26)

(27)

1.4. Note on notation and terminology

As an example, suppose to observe some binary output y = (y1, .., yn)

and a group of covariate {xj}pj=1. A reasonable likelihood for this data is:

Yi|β, xi iid

∼ Be(g−1(xtβ)) i = 1, .., n (1.27) The function g(·) plays the role of link function. The expected value of a Bernoulli distribution is the probability of success, thus it is a value in the interval [0, 1]. Therefore, in this case the inverse link function g−1 has to be a map from R to [0, 1] in order to give a meaning to expression (1.27). An example of a common choice is the logit function or, more in general, the inverse link function can be any cumulative distribution function (e.g. the probit model corresponds to the standard normal cumulative distribution function).

1.4 Note on notation and terminology

As probably already evident, we use as synonyms the terms feature, vari-able or covariate referring to the quantities xj of the model (1.2). We call

{β_j}p_j=1 weights, signals or parameters. The quantities n and p always refer respectively to the sample size and number of parameters in the model.

When we write distributions of a model or of a list of different random variables on different rows, each row is assumed independent from the others. E.g.:

X ∼ πX

Y ∼ πY

Z ∼ πZ

means that X, Y, Z are jointly independent.

Even if it is not specified, we assume every probability distribution of a continuous random variable to be absolutely continuous with respect to the Lebesgue measure. As previously said, if we a have a variable Y ∼ p(·), we also write Y ∼ p(y) and we use the same expression p(y) for the probability distribution and the density evaluated at point y, it will be clear from the context which of the two it is.

(28)

(29)

Chapter 2

Stability of feature selection

algorithms

We include in our comparison the analysis of the stability of a feature se-lection procedure. We use the definitions and the methods described in Nogueira et al. (2017).

In the literature there have been different attempt to propose methods to study and estimate the stability of variable selection algorithms. Nogueira et al. (2017) propose a theoretical framework making a list of properties that a stability measure, or estimator, should satisfy. Eventually, they also propose their own estimator, which is a proper generalisation of several existing ones.

The task of feature selection is to identify a subset of features, that can be the minimal one, with good enough predictive performance, or the whole subset of relevant features. The input of any given procedure is data and possibly some parameters of the selection algorithm. It is assumed that data are a finite sample from a generating distribution. Different data samples can lead to different subset of selected features, the stability is the measure of such variation in the selection procedure. A typical approach, and what Nogueira et al. do, is to generate M bootstrap samples of the data and perform the feature selection for each one of them. The estimate of the stability comes from the analysis of the variability of the results of the different bootstrap samples. Let Z = {S1, .., SM} the collection of feature

sets selected from each of the M bootstrap samples. Let ˆΦ be a function taking as input Z and returning a stability value. It is possible to see Z also as a M × d binary matrix, where d is the dimension of the parameter space:

(30)

2. Stability of feature selection algorithms Z =    z11 z12 . . . z21 z22 . . . .. . ... . ..   

Each row is the result of a bootstrap selection and entry 1 correspond to a selected feature. It is possible to link the selection of each feature, seen as a stochastic process, to a Bernoulli variable Zj, j indexing each variable

and Zj = 1 meaning that the feature has been selected, whose mean can be

estimated looking at the average of the j-th column of the matrix Z. Denote with s2_j the unbiased sample variance of Zj, according to Nogueira et al.

(2017) a good stability measure should have the following properties: 1. Fully defined. The stability estimator ˆΦ should be defined for any

col-lection Z of feature sets, thus allowing for the total number of features selected to vary.

2. Strict monotonicity. The stability estimator ˆΦ should be a strictly decreasing function of the sample variances s2_j of the variables Zj .

3. Bounds. The stability ˆΦ should be upper/lower bounded by constants not dependent on the overall number of features or the number of features selected.

4. Maximum stability and deterministic selection. A measure should achieve its maximum if-and-only-if all feature sets in Z are identical. 5. Correction of chance. Under the Null Model of Feature Selection H0,

the expected value of ˆΦ should be constant.

Nogueira et al. propose the following stability measure and prove that it satisfies all the properties and that the estimator converges in probability to a normal distribution.

Definition 1. (Proposed Measure) The stability measure is defined as: ˆ Φ(Z) = 1 − 1 d Pd j=1s2j ¯ k d(1 − ¯ k d)

where ¯k is the average number of selected features.

Theorem 1. (Asymptotic Distribution) As M → ∞, the statistics ˆΦ weakly converges to a normal distribution, that is

ˆ Φ − Φ q v( ˆΦ) L → N (0, 1) 28

(31)

2.1. Note on highly correlated features

The value v( ˆΦ) is an estimate of the variance of ˆΦ, for further information see the cited work. This result allows to build asymptotic (1−α)% confidence interval of the type of:

ˆ Φ − z1−α₂ q v( ˆΦ), ˆΦ + z1−α₂ q v( ˆΦ)

where z1−α₂ is the inverse cumulative of a standard distribution at point

1 −α₂.

Nogueira et al. reports also hypothesis testing results, in order to com-pare the stability of two different feature selection algorithms. In the next sections we use symmetric hypothesis testing of the form of:

(

H0 : Φ1= Φ2

H1 : Φ16= Φ2

and thus, using the asymptotic distribution, it holds true the following the-orem:

Theorem 2. The test-statistics for comparing stabilities is

TM =

ˆ

Φ(Z2) −Φ(Zˆ1)

q

v(Φ(Zˆ1)) + v(Φ(Zˆ2))

Under H0, the statistics TM asymptotically (M → ∞) follows a standard

normal distribution.

2.1 Note on highly correlated features

We want to highlight that such estimator does not take into account possible correlation between the features. When the objective is the whole subset of relevant features, this is not a problem. On the other hand, when the goal is to find the minimal one for predictive purpose, if there are a group of fea-tures highly correlated, it would be reasonable to do not consider instability possible swapping of those features in the selected subset. Therefore in this case the estimator can be used only if such groups of correlated features are not too large.

(32)

(33)

Chapter 3

On the Bayesian reference

model

Generally, in Statistics a model is a mathematical description, or, better, an approximation of a true data generating process, to which we refer as underlying phenomena of interest. The term “reference” in the expression “reference model” encapsulates the idea of using the description of the data provided by the model as a point of reference in the variable selection pro-cess, instead of the unprocessed observed data.

In this chapter we aim at giving a motivation of how the reference model can improve the selection and, finally, we give some advices on how to fit such model, assuming a Bayesian perspective.

3.1 Why the reference model for feature selection

The original idea of using a reference model to address the feature selection problem is already present in some works from the 60s as Lindley (1968). In this thesis we take inspiration from the projection predictive approach (Piironen et al., 2018), initially introduced by Dupuis and Robert (2003), where the selection goal is to find the smallest model with good predictive performance.

Generally model selection approaches attempt to solve prediction and features selection problems simultaneously. In the projection approach these steps are carried out in two different moments: first of all the reference model, i.e. an encompassing model with good predictive performance, is fit and then the possible submodels are explored. We better explain the projection approach later in Chapter 5. The assumption made is that the reference model is the best description available of future data, assuming the

(34)

3. On the Bayesian reference model

M-complete setting (Bernardo and Smith, 2009; Vehtari and Ojanen, 2012). Therefore, assuming to have a proper, possibly Bayesian, reference model, his posterior predictive distribution describes the underlying phenomena of interest better than how the observed data do. We can see this in data plotted in Figure 3.1. The data generation model is taken from Piironen et al. (2018) and it is used also in the comparisons carried out in Chapters 4 and 5. Each statistical unit is generated as follows:

f ∼ N (0, 1) Y |f ∼ N (f, 1) (3.1) Xj|f iid ∼ N (√ρf, 1 − ρ) j = 1, .., k Xj|f iid ∼ N (0, 1) j = k + 1, .., p

with ρ = 0.5 and k = 100 for a total of 50 observations. The latent value f is not observed and corresponds to the underlying phenomena of interest, while y is the correspondent noisy observation. The first k covariates are really predictive, whereas the others are uncorrelated with the output Y . On the left plot, sample correlation is computed between each feature and both the variable Y and the unobservable variable f , we observe that the noise in the observed values leads to less separation between relevant and spurious, i.e. irrelevant, features. The reference model is used to have an estimate ˆf of f through the posterior predictive mean, correlation between each feature and ˆf is thus represented in the right plot. Looking at the marginal plots on the axis we can see that the two groups of features are better torn apart when ˆf is used instead of y. In this example the reference model used is 4.10, presented in Section 4.3.

3.2 Guidelines: how to devise the reference model

Building a proper reference model can be non-trivial. This is a central issue for the reference model approach in feature selection and there is not a definite answer: it depends on the data and there are plenty of different choices. It requires fitting a Bayesian model, therefore choosing a proper model and prior distribution. A good practice is to include all the features available and proceed in a full Bayesian way. When the number of features is large, it is is suggested to use some regularising prior in order to avoid overfitting and to get a good prediction performance. Note that in fitting the reference model we do not care about features selection. However sometimes the number of features is large and computational resources are limited, in

(35)

3.2. Guidelines: how to devise the reference model

Figure 3.1: Scatter plot of sample correlation of feature xj with noisy observed variable

y versus latent f on the left and estimated ˆf on the right. Features correlated with the target are highlighted in red. ρ = 0.5, k = 100, p = 1000 and n = 50.

such cases it can be helpful to use some screening or dimension reduction techniques like PCA or supervised-PCA (Bair et al., 2006; Piironen and Vehtari, 2017b).

3.2.1 Shrinkage priors: the regularised horseshoe prior As we said, in model with a large number of parameters it is advisable to use shrinkage, or regularising, priors. The problem of the prior choice in high dimensional spaces has been largely studied in the literature, see for example Piironen and Vehtari (2017c) and Rockova et al. (2012), and sometimes these priors have been also used to address the selection, as mentioned in Section 1.2. Here we describe, in the context of the linear regression model 1.2, the regularised horseshoe prior, introduced by Piironen and Vehtari (2017c), a variant of the horseshoe prior.

The standard horseshoe prior introduced by Carvalho et al. (2009) is defined as a mixture of normal distribution through a local (λj) and a global

(τ ) scale parameters:

β|λj, τ iid∼ N (0, τ2λ2j) j = 1, .., p (3.2)

λj iid∼ C+(0, 1)

(36)

Figure 3.2: Plot of the shrinkage factor κ using the functional form derived in Piironen and Vehtari (2017c) on the left and histogram sampling from the prior on the right.

The hyperparameter τ is responsible for the global shrinkage: it can be fixed to a value τ0 or more generally it can have a hyperprior distribution. The

heavy-tailed half-Cauchy prior for the local shrinkage λj allows βj to have

enough mass away from zero and thus to let the significant signals escape the shrinkage. Assuming a normal likelihood for (y1, .., yn) as in 1.2 and the

predictors being uncorrelated with zero mean and variances s2_j, following Piironen and Vehtari (2017c) the mean of the posterior distribution of βj

given the hyperparameters and the data can be written as: ¯

βj = (1 − κj) ˆβj (3.3)

where ˆβj = (XTX)−1XTy is the maximum likelihood estimator (assuming

XTX invertible) and: κj = 1 1 + nσ−2_τ2_s2 jλ2j (3.4) We call κj the shrinkage factor since it describes how the coefficient βj is

shrunk toward zero (κj → 1) or not (κj → 0). Result (3.4) is quite general,

it holds for any prior that can be written as normal mixture, see Piironen and Vehtari (2017c). When λj is an half-Cauchy and τ and σ are fixed,

the resulting prior over κj looks like a horseshoe (Figure 3.2). The result

is that the shrinkage factor has his mass concentrated really close to zero or really close to one, miming almost complete or null shrinkage as in the spike-and-slab prior.

Piironen and Vehtari (2017c) highlight one drawback of the horseshoe prior: very large signals are not shrunk enough, leading to a problem when parameters are weakly identified by the data. Here comes in help the reg-ularised horseshoe prior proposed by Piironen and Vehtari (2017c), that introduces a sort of “slab width” to control the shrinkage of the largest

(37)

3.2. Guidelines: how to devise the reference model

coefficients. The regularised horseshoe prior is formulated as follows: βj|λj, τ, c iid ∼ N (0, τ2 e λ2_j) j = 1, .., p e λj = c2λ2_j c2_{+ τ}2_λ2 j λj iid∼ C+(0, 1) (3.5)

The intuition for this modified version of the horseshoe prior is that, as Piironen and Vehtari write, when τ2λ2_j c2_{, i.e. β}

j is not large, the prior is

close to the original horseshoe, whereas when τ2λ2_j c2_{, i.e. large signals,}

the prior approaches N (0, c2), leading to a regularising effect on βj.

3.2.2 How to assess predictive performance

In the projection predictive approach and in our algorithm presented in Chapter 5, a central issue is the predictive evaluation of the reference model. The ideal validation consists in using a validation set, but most of the times the number of observation is not big enough and cross-validation is the only way to proceed. We can divide cross-validation in leave-one-out (LOO) and k-fold. The former gives less biased estimates, but requires higher computa-tional resources. Sometimes such burden can be eased with approximation as PSIS-LOO (Vehtari et al., 2017), i.e. Pareto-smoothed-importance-sampling-LOO. However k-fold remains still a valid options, it can be relatively fast and it is able to always cross-validate the entire fit, while PSIS-LOO can not be used if the way the covariates are build vary from fold to fold (e.g. if those are principal components).

But how to measure the predictive performance? Since the Bayesian framework returns a full predictive distribution, it is advisable to look at the expected-log-predictive density (elpd ) (Vehtari et al., 2017; Bernardo and Smith, 2009; Gneiting and Raftery, 2007), defined as:

elpd = Z

log p(_ey|y)pt(ey)dye (3.6) with p(ey|y) the predictive posterior distribution of the model and pt(y)e the true, unknown, data generation mechanism. The unknown quantity pt(y) can be approximated using the observed data as a sample from it.e Therefore we can compute an estimate of the elpd using, for example, LOO cross-validation as follows:

elpd =

n

X

i=1

log p(yi|y−i∗) (3.7)

(38)

where p(yi|y−i∗) means the out-of-sample posterior predictive distribution density evaluated at the observed value yiconditioned on all the data except

yi, or, in the k-fold cross-validation case, given the data not belonging to

the same fold of i.

Besides the elpd, it is useful to check root mean squared error (RMSE) or accuracy, i.e. ratio of correctly classified data, respectively for continuous and binary target variable.

(39)

Chapter 4

Multiple hypothesis testing

in the normal means problem

4.1 The normal means problem

A common example to describe methods in the multiple hypothesis testing framework is the normal means estimation problem. The abstract formula-tion is the one in equaformula-tion 1.25. Therefore, the model distribuformula-tion for the z-values is:

Zj|θj iid

∼ N (θ_j, σ₀2) j = 1, .., p (4.1) where p is also the number of covariates of some observed data from which the zj are usually computed. The goal of the inference is to estimate

pa-rameters θj, which are the means of vector of observations zj. Note that Z

and θ are listed with the same index j, thus it is assumed to have just one vector of observations, the z-values, of the same dimension of the parameter one. An example of how to retrieve the z-values from observed data is again the gene expression microarray data described in the Introduction: a very common approach is to take the difference of the sample means of each gene for both the positive and negative group of patients divided by the pooled estimate of the standard error (two-sample t-statistic), for more details see Efron (2012). The resulting t-statistic can be transformed into normally distributed z-values by:

zj = Φ−1(Tn−2(tj)) (4.2)

where Φ stands for the standard normal c.d.f., Tn−2 for the student-t c.d.f.

with n − 2 degrees of freedom and n total number of observations, and tj

is the computed two-sample t-statistic for the j-th gene. Note again that n stays for the number of observations in the original data set, while p stays

(40)

4. Multiple hypothesis testing in the normal means problem

for the number of covariates.

Another example is when there is a target random variable Y and p features Xj. The relevance of each feature can be assessed looking at the

correlation with Y . In such a case, if we call ρj the true correlation between

Xjand Y and assuming to have a sample of n observations (yi, xi1, .., xin)ni=1,

the Fisher transformation (TF) of rj, defined as the sample correlation

be-tween the n-dimensional vector of observed target values y and observed j-th covariate values xj, is approximately normally distributed and it can

be used as z-value as follows:

ρj = Cor(Xj, Y ) j = 1, .., p (4.3) rj = Pn i=1(xij − ¯xj)(yi− ¯y) (n − 1)sxjsy (4.4) with ¯xj and ¯y denoting the respective sample mean. Define the Fisher

transformation as:

TF : (−1, 1) → R

TF : x 7→ tanh−1(x) (4.5)

then the following approximation holds: zj = TF(rj) ≈ N (TF(ρj),

1

n − 3) j = 1, .., p (4.6) From a frequentist statistical perspective, the problem of finding interest-ing features is addressed through hypothesis testinterest-ing. For both the examples, under the null hypothesis H0 : zj = 0, that means the j-th feature is

un-correlated with the target variable, the corresponding normal distribution is centred in zero. From a full Bayesian perspective, we can fit a regression model giving priors distribution to the parameters θj:

Zj|θj iid

∼ N (θ_j, 1

n − 3) j = 1, .., p (4.7)

θi ∼ π

and then proceed to complete the feature selection with some of the methods described in Section 1.2.2.

Our claim is that if the underlying data are accessible, e.g. the (yi, xi1, .., xin)ni=1

used to generate the zj, the z-values can be computed using a reference model

instead of the observed data and this would generally lead to better results in the selection. In the correlation example, it would be possible to proceed in the following way:

(41)

4.2. Simulated data

1. Fit a Bayesian model that leads to good predictive performance (the reference model), e.g. a linear regression model:

Yi|β, σ, xi iid

∼ N (xt_iβ, σ2) i = 1, .., n

βj ∼ πβ j = 1, .., p

σ ∼ πσ

2. Compute the posterior predictive distribution for each observation i: p(Yi|xi, y) =

Z

p(Yi|β, σ, xi)p(β, σ|y)dβdσ

3. Choose a point estimator that fits your problem, e.g. expected poste-rior predictive mean:

y_i∗ = E[Yi|xi, y]

4. Finally compute the z-values using y_i∗ instead of observed data yi:

r_j∗= Pn i=1(xij− ¯xj)(y ∗ i − ¯y ∗₎ (n − 1)sxjsy∗ z_j∗= tanh−1(r∗_j)

Note that the predictive distributions are computed using the full posterior, therefore each time the predicted value’s information has been already used in the training set. Nevertheless, this is not a problem, since the main goal is not assessing the predictive performance for such observation and we prefer introducing a little bias rather than more computational costs.

In the next sections we compare the performance and the stability, using the estimator described in Chapter 2, of few different selection methods for the normal means problem using the reference model estimates y_i∗ versus the raw data yi.

4.2 Simulated data

We carry out the comparison using simulated data. We simulate a scenario where there is a set of features {Xj}kj=1 correlated with a target variable Y

(relevant ones) and a set of uncorrelated ones {Xj}p_j=k+1 (irrelevant). The

correlated features are also correlated among themselves. We believe that this kind of data reflect many real case scenario. The data generating model is the same used in Piironen et al. (2018). This is accomplished through a

(42)

4. Multiple hypothesis testing in the normal means problem n p k ρ 50 1000 100 0.3 50 1000 100 0.5 70 1000 100 0.3 70 1000 100 0.5 100 1000 100 0.3 100 1000 100 0.5

Table 4.1: Set of parameters used for the comparison in Section 4.3

latent variable f and a given correlation level ρ, generating independently each statistical unit in the following way:

f ∼ N (0, 1) Y |f ∼ N (f, 1) (4.8) Xj|f iid∼ N ( √ ρf, 1 − ρ) j = 1, .., k Xj|f iid∼ N (0, 1) j = k + 1, .., p

where k denotes the sparsity, i.e. the number of relevant features. It follows that: Cor(Xj, Xk) = ρ ∀(j, k) ∈ 1, .., k Cor(Xj, Y ) = √ ρ j = 1, .., k (4.9) Xj ⊥ Y j = k + 1, .., p

We simulate data with the combinations of parameters illustrated in Table 4.1.

The different number of observations and correlation levels varies the difficulty of identifying the relevant features. This can be seen in Figure 4.1, where it is shown an example of the generated data, comparing the sample correlation of each feature with the target variable y and with the latent variable f . Correlation with y is the actual noisy relationship that we can observe, while correlation with f is not observable and correspond to an “oracle” point of view. It is possible to see from the rug plot (marginal plot) on the x-axis that smaller n and lower correlation level ρ give more overlaps between the relevant (red) and spurious (black) features, resulting in a more challenging problem.

(43)

4.2. Simulated data

Figure 4.1: Scatter plot of sample correlation of feature xj with noisy observed variable

y and latent f respectively on x and y axis. Features correlated with the target are highlighted in red. Data generated for different value of n and ρ in according with Table 4.1.

(44)

4.3 Comparison

As we said, our claim is that, regardless of the methods chosen to address the problem of finding all the significant features, if a reference model can be used it leads to better results in the selection.

We show this using the data generation mechanism presented in Section 4.2. We formulate the selection problem as a normal means problem re-trieved from the correlation level of each feature by Fisher-transformation (see approximation (4.6)) as explained in the second part of Section 4.1. We compare the results with and without the use of a reference model. For the reference model approach we compute correlations using the expected pre-dictive posterior mean. The comparison can be summarised in the following way:

With reference model 1. Fit Bayesian reference model

us-ing linear regression:

yi = βtxi+ i i = 1, .., n

2. Compute expected posterior pre-dictive means for each observa-tion:

y_i∗= E[Yi|xi, y]

3. Compute sample correlations r∗_j between feature xj and y∗ and

z∗_j score through Fisher trans-formation:

z_j∗ = TF(r∗j)

4. If necessary, fit Bayesian model for the normal means problem: Zj|θj iid ∼ N (θj, 1 n − 3) j = 1, .., p θj ∼ π

5. Complete the feature selection procedure

Only data

1. Compute sample correlations rj

between feature xj and y and

zj score through Fisher

trans-formation:

zj = TF(rj)

2. If necessary, fit Bayesian model for the normal means problem: Zj|θj

iid

∼ N (θ_j, 1

n − 3) j = 1, .., p θj ∼ π

3. Complete the feature selection procedure

(45)

4.3. Comparison

The inference about the subset of relevant features is done with differ-ent approaches related to the multiple hypothesis testing framework: we consider controlling the Q-value (Storey, 2003; K¨all et al., 2007), which we discuss later on in this section, the posterior probability inclusion at 80%, 90% and 95% credibility level, to which we refer as credibility intervals, and the local false discovery rate.

The reference model in such examples is a Bayesian regression over the first five supervised principal components (SPC) Piironen and Vehtari (2017b). We refer to the SPCs as uj with j = 1, .., 5, thus the reference

model can be written as follows:

Yi|β, σ2, ui iid ∼ N (xT_i β, σ2) i = 1, .., n βj|τ iid ∼ N (0, τ2) τ ∼ t+₄(0, s−2_max) j = 1, .., 5 σ ∼ t+₃(0, 10) (4.10)

where smax denotes the sample standard deviation of the largest SPC. This

model has been already proposed and used with the same data generation process by Piironen et al. (2018) and it has shown good prediction perfor-mance, resulting in a good choice as reference model.

Due to the Fisher transformation, the computed z-scores are distributed according to Z ∼ N (TF(ρ), (n−3)−1). Such distribution is used as likelihood

for two of the three methods used in the comparison, i.e. credibility intervals and controlling the Q-value, where we fit a Bayesian model to the normal means problem. As mentioned in Section 1.2.2, it is common to use some sparsifying prior for this kind of problem. We thus fit the normal means model using regularised horseshoe prior and, according to suggestions in Piironen and Vehtari (2017c), we scale data to have unit variance. This is simply accomplished by multiplying the zjby

√

n − 3, resulting in the model distribution N (√n − 3TF(ρ), 1). To avoid introducing too many letters, we

still refer to these values with the letter z. Therefore the following is the Bayesian model for the normal means problem with scaled z-values and

(46)

regularised horseshoe prior: Zj|θj iid ∼ N (θj, 1) j = 1, .., p θj|λj, τ, c iid ∼ N (0, τ2 e λ2_j) λj iid ∼ C+_{(0, 1)} τ ∼ t+₃(0, τ0) c2 ∼ IG(ν/2, ν/2) e λj = c2λ2_j c2_{+ τ}2_λ2 j (4.11)

where the global scale τ0 is chosen according to indication in Piironen and

Vehtari (2017c), giving a prior guess of effective number of parameters equal to one in order to achieve more shrinkage.

We fit the models using Stan (Carpenter et al., 2017), the code can be found in Appendix A. In Figure 4.2, posterior 90% HPD intervals of the normal means model are plotted for both the approaches, the reference and the only data one, after simulating data with ρ = 0.5 and n = 50. Relevant features are highlighted in red. It is already evident that the reference model posterior distributions achieve better separation. Nevertheless we observe that some of the spurious parameters are not shrunk enough toward zero and this make the inference still difficult even when the reference model is used. Note that parameters θjhave been back-transformed in the correlation

scale.

4.3.1 Controlling the Q-value

One of the possible ways to deal with multiple comparison is controlling the false discovery rate (Benjamini and Hochberg, 1995). In the usual setting it requires to set a null hypothesis corresponding to the null effect. This leads to different problems and critics as the fact that, in real world cases, it is hard to believe that an effect is exactly zero and this is even less significant in a full Bayesian setting. Gelman and Carlin (2014) advise instead to control the type-S, sign, and type-M, magnitude, errors. With this perspective, we control a Bayesian version of the Q-value (Storey, 2003), that is related to the pFDR, but in our case looking at the magnitudes of the effects.

Assume to be able to choose a threshold ρ0 to consider as irrelevant

features (or better poorly relevant) those whose absolute value of correlation with the target variable is below ρ0. We define the posterior error probability

of feature j as the posterior mass over values below the chosen threshold 44

(47)

4.3. Comparison

Figure 4.2: Posterior intervals at level 90% for the Bayesian model for the normal means problem. Parameters have been back transformed in the correlation scale. On the left the result using the reference model, while on the right just the observed data. It can be noticed that points, referring to sample means, and intervals on the left referring to relevant features (in red) are better separated compared to the ones on the right.

(note again that θ refers here to the parameters back-transformed in the correlation scale):

P EPj := P (|θj| < ρ0|y) (4.12)

Consider reordering P EPj values from the smallest to the biggest one:

P EP(0) ≤ P EP(1) ≤ .. ≤ P EP(p) (4.13)

and define the cumulative mean of the first k ordered P EP_(j) as:

Q(k) := 1 k k X t=1 P EP(t) (4.14) 45

(48)

Q(k) is therefore the estimated average error that we make considering as relevant the first k ordered features.

In Figure 4.3 is plotted the relationship between the Q-values and the number of claimed discoveries, k. Results are for n = 50, ρ = 0.3 and ρ0 = 0.2, averaged over 20 data realisations. The dashed lines refer to the

actual true FDR using the ranking induced by the PEP reordering. For any level we want to control the Q-value at, when the reference model is used we are able to make a larger number of discoveries. We also note that the estimated curve is closer to the real one for the reference model than when only data are used. The Q-value line of the reference approach first overestimates and then, for Q larger than 0.13, underestimates the actual FDR, while the data approach’s one is constantly overestimating it. Underestimation of any kind of error is always a negative point, but since we believe that the useful Q-values can be hardly considered out of the range (0, 0.2), the reference approach underestimation for those values is not too big. Moreover, we have set the relevance correlation threshold at 0.2, changing this value the Q-values line can move closer or further away from the FDR lines, but in any case the reference model outperform the standard approach based only on data. This selection method requires to choose many thresholds, as ρ0and the a control rate Q; making these choices

is non-trivial.

It is interesting to note that the estimated Q-value rates start to proceed parallel from around Q = 0.1. At that point, the number of discoveries is high enough such that for the reference approach all the relevant features have been almost already selected, leaving out only the spurious ones, while for the only data approach both the relevant and irrelevant features left have posterior distributions that locate comparable amount of mass close to zero. We can see this qualitatively from Figure 4.2 and it can be notice that the shrinkage for the uncorrelated features is virtually the same in both the approaches. Therefore the PEP contribution in the Q-value for the further added covariates, when Q(k) & 0.1, is equivalent for both the methods, thus the plotted curves in Figure 4.3 proceed parallel.

Observe also that even if the number of discoveries is almost 0, the estimated Q value for the data approach is not able to reach the origin, while the reference model’s curve starts exactly from the origin and grows concave with a really steep trend. One could argue that this could be due just to the Q-value estimator, but a similar behaviour can be also noticed for the actual false discovery rate curves.

The same analysis is repeated for different values of correlation and num-ber of observation, as in Table 4.1. We report posterior intervals of the

(49)

4.3. Comparison

Figure 4.3: Number of discoveries versus the q-value with ρ0 = 0.2, n = 50 and

ρ = 0.3. The solid lines refer to q-values estimates, while the dashed ones refer to the actual true FDR using the ranking induced by the PEP reordering. Dashed vertical line corresponds to Q = 0.05. The result is the average over 20 data realizations.

normal means model in Figure 4.4, the reference model approach achieves qualitatively better separation result for any combination of the parameters used. Nevertheless it is possible to see that when the correlation is low, that means covariates weakly correlated among them but also with the response variable, and the number of observation is small, the problem becomes much more difficult and none of the two approaches is capable to isolate the rele-vant features satisfactorily.

In Figure 4.5, the Q-values curves are reported for the different simula-tion done, each using four different correlasimula-tion threshold ρ0(0.05,0.1,0.2,0.3).

The curves of the reference approach are marked with triangles. As in Figure 4.3, more on the top-left is located the curve, better it is, i.e. more discov-eries with an expected lower Q-value. Again, the reference model improves the result for almost all scenarios. The gap is recovered as n and ρ grow and the threshold ρ0 diminishes; only for few combinations the data approach

slightly outperforms the reference one for some Q-values. However, note that in such cases the difference is small and it occurs only for quite large (almost always greater than 0.3) Q-values, i.e. large false discovery rate’s estimates. Overall we think that the reference approach shows an important

(50)

Figure 4.4: Posterior intervals at level 90%. Parameters have been back transformed in the correlation scale. Different pair of columns correspond to different level of correla-tions (0.3 and 0.5), while different rows correspond to different number of observation (50, 70 or 100)

improvement, especially when the correlation is low (ρ=0.3).

Fixing the control of the Q-value at 0.05, we have a complete procedure for the selection of the features. In Figure 4.6 we compare the stability

(51)

4.3. Comparison

Figure 4.5: Number of discoveries versus q-value plot for different ρ0thresholds. Results

after 20 data simulations.

of each approach using the estimator and the results from Section 2, after 100 simulations. Point estimates and 0.95 confidence intervals are plotted for different values of the correlation threshold ρ0 (on the y-axis). Overall

it can be observed an improved stability for the reference approach. This is more evident with a lower correlation level and number of observations. The difference in the stability is negligible almost every time when ρ = 0.5, except for n = 50 and ρ0 = 0.3. Moreover, we observe that the uncertainty

of the stability estimator for the reference approach is almost always lower than the one for the data approach, only exception is the case n = 50, ρ = 0.3 and ρ0 = 0.3. Results of pairwise comparison tests are listed in

Table 4.2: p-values and test-statistic values are reported as long as 0.95 significance level decision. Results show that the use of the reference model

(52)

can bring benefits or in the worst case neutrality, but never disadvantages, in the selection process.

Figure 4.6: Stability comparison. On the x-axis stability values, on the y-axis different correlation thresholds ρ0. On the row different number of observations, on the column

different correlation levels. Confidence interval are at level 95% and results are after 100 data realisations.

4.3.2 Credibility intervals

To carry on with the comparison, we now look at the posterior credibility intervals as an alternative selection procedure. Referring to Figure 4.4, it corresponds to select those features whose highest posterior density interval, at some credibility level, does not include zero. This is one of the simplest method to address the feature selection problem, but, as drawback, it does not take into account the multiple comparison problem, i.e. “the garden of