Variable selection for robust model-based learning from contaminated data

(1)

learning from contaminated data

Selezione di variabili nella stima robusta di modelli per

dati contaminati

Andrea Cappozzo, Francesca Greselin, and Thomas Brendan Murphy

Abstract Several contributions to the recent literature have shown that supervised learning is greatly enhanced when only the most relevant features are selected for building the discrimination rule. Unfortunately, outliers and wrongly labelled units may undermine the determination of relevant predictors, and almost no dedicated methodologies have been developed to face this issue. In the present paper, we in-troduce a new robust variable selection approach, that embeds a classifier within a greedy-forward procedure. An experiment on synthetic data is provided, to under-line the benefits of the proposed method in comparison with non-robust solutions. Abstract Recenti risultati in letteratura hanno dimostrato che l’apprendimento su-pervisionato migliora notevolmente quando si scelgono le variabili più rilevanti per la costruzione della regola discriminante. La presenza di valori anomali e di unità erroneamente classificate nel learning set può severamente minare la deter-minazione dei predittori rilevanti e sfortunatamente quasi nessuna metodologia af-fronta questo problema. Il presente contributo propone un nuovo approccio robusto, che incorpora un classificatore all’interno di un metodo incrementale di selezione delle variabili. Risultati simulativi mostrano i vantaggi del nuovo metodo, in com-parazione con soluzioni non robuste.

Key words: Variable Selection, Model-Based Classification, Label Noise, Outliers Detection, Wrapper approach, Impartial Trimming, Robust Estimation

Andrea Cappozzo, Francesca Greselin

Department of Statistics and Quantitative Methods, University of Milano-Bicocca, e-mail: a.cappozzo@campus.unimib.it, francesca.greselin@unimib.it

Thomas Brendan Murphy

School of Mathematics & Statistics and Insight Research Centre, University College Dublin e-mail: brendan.murphy@ucd.ie

(2)

1 Introduction

Nowadays, hundreds or thousands of variables on each sample are available in fields like chemometrics, computer vision, engineering and genetics, and many other sci-entific domains. Feature selection techniques have been introduced in data analysis, mainly aiming at building simpler models, easier to interpret by researchers/users, with shorter training times. Models based on the right selection of variables allow to avoid the curse of dimensionality, reduce overfitting, and prevent identifiabil-ity problems that may arise in high dimensional spaces. This has been known for a long time, as demonstrated by the specific literature reviews on the topic in the fields of machine learning, data mining, bioinformatics, genomic, and statistics. Surpris-ingly, the impact that outliers and wrongly labelled units cause on the determination of relevant predictors has received far less attention. Indeed, contaminated data can heavily damage a classifier performance [6], and most variable selection methods rely on the implicit assumption of dealing with an uncontaminated training set.

The present paper aims at filling this gap. We propose a new robust variable se-lection method for model-based classification, by embedding a robust classifier, re-cently introduced in the literature, in a greedy-forward stepwise procedure for model selection. Section 2 recalls the problem of variable selection in model-based dis-criminant analysis, and the Robust Eigenvalue Decomposition Disdis-criminant Analy-sis (REDDA), and then introduce the robust variable selection technique. Section 3 presents the comparison of several feature selection procedures within a simulation study in an artificially contaminated scenario. A discussion of our results concludes the paper, outlying some remarks and future research directions.

2 Robust variable selection in model-based classification

Model-based discriminant analysis is a probabilistic framework for supervised clas-sification, in which a classifier is built from a complete set of N learning observa-tions (i.e., the training set):

(x, l) =(x1, l1) , . . . , (xN, lN) ; xn∈ RP, ln={ln1, . . . , lnG}0∈ {0,1}G; n= 1, . . . , N

(1) where xnis a P-dimensional continuous predictor and lnis its associated class label,

such that lng= 1 if observation n belongs to group g and 0 otherwise with, clearly,

∑Gg=1lng= 1∀n = 1,...,N. We assume that the prior probability of class g is τg> 0

and ∑Gg=1τg= 1. The gth class-conditional density is modeled with a P-dimensional

Gaussian distribution with mean vector µµµ_g∈ RPand covariance matrix ΣΣΣg∈ PD(P):

xn|ln= g∼ NP(µµµg, ΣΣΣg). Therefore, the joint density of (xn, ln) is given by:

(3)

where φ(·; µµµg, ΣΣΣg) denotes the multivariate normal density and θθθ represents the

collection of parameters to be estimated, θθθ={τ1, . . . , τG, µµµ1, . . . , µµµG, ΣΣΣ1, . . . , ΣΣΣG}.

Eigenvalue Decomposition Discriminant Analysis (EDDA) is a family of classi-fiers developed from the probabilistic structure in (2), wherein different assump-tions about the covariance matrices are considered. Particularly, EDDA is based on the following eigenvalue decomposition:

Σ

ΣΣg= λgDDDgAAAgDDD

0

g (3)

where DDDgis an orthogonal matrix of eigenvectors, AAAgis a diagonal matrix such that

|AAAg| = 1 and λg=|ΣΣΣg|1/p. Allowing each parameter in (3) to be equal or different

across groups a family of 14 patterned models arises. To protect parameter estimates against label noise and outliers, [1] introduced a robust version of EDDA, called REDDA, by means of the maximization of a trimmed mixture log-likelihood [4], in which an impartial trimming level γlis enforced in the estimation procedure.

The next step is therefore to include a robust variable selection procedure within REDDA. We proceed in a stepwise manner, by considering the inclusion of extra variables into the model, and also the removal of existing variables from the model, one at a time, conditioning on their discriminating power. We start from the empty set and then, in each step of the algorithm, we partition the learning observations xn,

n= 1, . . . , N, into three parts xn= (xcn, x p

n, xon), where:

• xc

nindicates the set of variables currently included in the model

• xnpthe variable proposed for inclusion

• xo

nthe remaining variables

To decide whether to include the proposed variable xnp, we compare the following

two competing models:

where xr_n denotes a subset of the currently included variables xc_n. The Grouping model specifies that xnpprovides extra grouping information beyond that provided

by xc_n; whereas the No Grouping model specifies that xnpis conditionally

indepen-dent of the group membership given xr_n. We consider xr_nin the conditional distribu-tion because xnpmight be related to only a subset of the grouping variables xcn[3].

The differences between the two models are graphically illustrated in Figure 1. The model structure of p(xo

n|x p

n, xcn) is assumed to be the same for both grouping and no

grouping specification, and we let p(xc n, x

p

n|ln) and p(xcn|ln) be a normal density with

parsimonious covariance structure. Additionally, we assume p(xnp|xrn⊆ xcn) to be a

normal linear regression model, as a result from conditional multivariate normal means. The selection of which model to prefer is carried out employing a robust ap-proximation to the Bayes FactorBGR,NG, given by the ratio between the integrated

likelihood of the two competing models. Along the lines of [5], twice the logarithm ofBGR,NGcan be approximated with

(4)

MN G ln xo n xp n xc n Grouping MGR ln xo n xp n xc n No Grouping xr n⊆ xcn

Fig. 1: Graphical representation of the Grouping and the No Grouping models

and a variable xnpwith a positive difference in BIC(Grouping)−BIC(No Grouping)

is a candidate for being added to the model. A robust version of the BIC is employed here, for avoiding the detrimental effect that class and attribute noise might produce in the variable selection procedure. The Trimmed BIC (TBIC), firstly introduced in [4], is employed as a robust proxy for the quantities in (4). Let us define:

T BIC(Grouping) = 2 N

∑

n=1 ζ(xc_n, x_np) G

∑

g=1 lnglog ˆ τ_gcpφ(xc_n, xp_n; ˆµµµcp_g , ˆΣΣΣcp_g ) | {z }

2×trimmed log maximized likelihood of p(xc n,xnp,ln) + − vcp_log(N∗₎ (5) T BIC(No Grouping) = 2 N

∑

n=1 ι(xc n, xnp) G

∑

g=1 lnglog ˆ τ_gcφ(xc_n; ˆµµµc_g, ˆΣΣΣc_g) | {z }

2×trimmed log maximized likelihood of p(xc n,ln) −vc_log(N∗₎₊ +2 N

∑

n=1 ι(xcn, xnp) log φ x_np; ˆα+ ˆβββ 0 xr_n, ˆσ2 | {z }

2×trimmed log maximized likelihood of p(xnp|xrn⊆xcn)

−vp_log(N∗_).

(6) The penalty terms vcp and vc indicate the number of parameters for a REDDA model respectively estimated on the set of variables xc_n, xnp and xcn; while vp

ac-counts for the number of parameters in the linear regression of xnpon xrn. The 0-1

indicator functions ζ(·) and ι(·) identify the subset of observations that have null weight in the trimmed likelihood under the Grouping and No Grouping models, with N∗= ∑Nn=1ζ(xn) = ∑Nn=1ι(xn). Accordingly, at each iteration of the

proce-dure that leads to the final robust estimates, we discard the_bNγlc% of the sample

with the lowest contribution to the conditional likelihood, under the no grouping model. Once the Concentration step is enforced, the set of parameters

(5)

in which a stepwise method is employed for automatically choosing the subset of regressors xr_n.

After each addition stage, we make use of the same procedure described above to check whether an already chosen variable in xc

nshould be removed: in this case

xnptakes the role of the variable to be dropped, and a positive difference in terms of

TBIC implies the exclusion of xnpto the set of currently included variables. The

pro-cedure iterates between variable addition and removal stage until two consecutive steps have been rejected, then it stops. Notice that, whenever γl= 0, BIC and TBIC

coincide and the entire approach reduces to the methodology described in [3].

3 Simulation study

The aim of this simulated example is to numerically assess the effectiveness of the new methodology, whilst investigating the effect that a (small) percentage of con-tamination has on standard variable selection procedures. We adopt the data gener-ating process (DGP) in [3], and add some attribute and class noise to the original experiment. A total of B= 100 Monte Carlo (MC) experiments are conducted as follows. From the DGP outlined in [3], N= 500 units are generated and their group membership retained for constructing the training set; while M= 5000 unlabelled observations compose the test set. Subsequently, label noise is simulated by wrongly assigning 20 units coming from the fourth group to the third class. In addition, 5 uni-formly distributed outliers, having squared Mahalanobis distances from µµµ_ggreater than χ_3,0.9752 _{∀g ∈ {1,2,3,4}, are appended to the training set, with randomly} as-signed labels. These contaminations produce, in each MC replication, a total of 25 adulterated units, that account for slightly less than 5% of the entire learning set. We validate the performance of our novel method in correctly retrieving the relevant variables, the comparison being carried out considering the following methods: • TBIC: new robust stepwise greedy-forward approach via TBIC

• SRUW: stepwise greedy-forward approach via BIC [3]

• SelvarMix: variable selection in model-based discriminant analysis with a regu-larization approach [2].

Once the important variables have been identified, the associated classifier (i.e., REDDA for the robust variable selection criteria, with trimming level γl = 0.05;

(6)

● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● TBIC REDDA No Var Sel SelvarMix EDDA No Var Sel SRUW

0.035 0.0400.045 0.0500.055 0.0600.065 0.070 0.0750.080 0.0850.090 0.0950.100

Misclassification Error

Fig. 2: Boxplots of the misclassification error, varying variable selection and model-based classification methods.

Table 1: Average misclassification errors, followed by their standard deviations.

Method TBIC REDDA NoVarSel EDDA NoVarSel SRUW SelvarMix

Misc Error 0.0409 0.051 0.073 0.072 0.0639

(0.0026) (0.0026) (0.0026) (0.0037) (0.0028)

by the presence of noisy variables, also shown by the poor performance of EDDA with no feature selection. Further research will be devoted to the development of a methodology that automatically assesses the contamination rate present in a sample, as the a-priori specification of the trimming level still remains an open issue in this field, particularly delicate for high-dimensional data.

References

[1] A. Cappozzo, F. Greselin, and T. B. Murphy. A robust approach to model-based classification based on trimming and constraints. Advances in Data Analysis and Classification, aug 2019.

[2] G. Celeux, C. Maugis-Rabusseau, and M. Sedki. Variable selection in model-based clustering and discriminant analysis with a regularization approach. Ad-vances in Data Analysis and Classification, 13(1):259–278, 2019.

[3] C. Maugis, G. Celeux, and M. L. Martin-Magniette. Variable selection in model-based discriminant analysis. Journal of Multivariate Analysis, 102(10):1374–1387, 2011.

[4] N. M. Neykov, P. Filzmoser, R. I. Dimova, and P. N. Neytchev. Robust fitting of mixtures using the trimmed likelihood estimator. Computational Statistics & Data Analysis, 52(1):299–308, sep 2007.

[5] A. E. Raftery and N. Dean. Variable selection for model-based clustering. Jour-nal of the American Statistical Association, 101(473):168–178, 2006.