A Composite Wrapper for Feature Selection

Testo completo

(1)

A Composite Wrapper for Feature Selection

Elena Roglia and Rosa Meo

Universita’ di Torino - Dipartimento di Informatica

Abstract. In many bioinformatics problems the number of features is very relevant. Especially in classification, feature selection plays an es- sential role and it improves classification results.

In this paper we propose a new method for feature selection based on the ensemble approach. Our method is a composite wrapper: it simulta- neously takes care of the sets of features selected by different classifiers.

By a cross validation strategy, each learner produces a feature ranking, which we subsequently merge to determine the ideal subset. We report classification results on some biological datasets considering many state of the art classifiers and the features selected by our composite method.

Experiments demonstrate that the proposed method is often effective and overall results are encouraging.

Key words: Feature selection, wrapper, classification

1 Introduction

Feature selection is an essential part in bioinformatics classification problems:

it consists in the choice of a representative subset of the data attributes that can be used to correctly predict the target class of new examples. It has many beneficial effects: it reduces both measurement errors and the cost of the activity of data extraction from biological domains.

Attributes can be selected using two different approaches. The former makes an assessment on the feature subset entirely based on general characteristics of the data and independently by the particular learner. It is the filter method, because the attribute set is filtered to produce the most promising feature subset before the learning step. The latter one evaluates the feature subset using the machine learning algorithm that will ultimately be employed for learning. This is the wrapper method, because the learning algorithm is wrapped into the selection procedure.

Filter methods use an evaluation function that relies solely on properties of the data, thus operates independent of any learning algorithm. Undesirable features are filtered out of the data before learning begins. These algorithms use heuristics based on general characteristics of the data to evaluate the merit of feature subsets. Wrapper methods use the learning algorithm to estimate the value of a given subset. They use the prediction performance of a given learning machine to assess the relative usefulness of subsets of features. In practice, one needs to define: (i) how to search the space of all possible subsets of features;

(2)

(ii) how to assess the prediction performance of a learning machine in order to guide the search and halt it; and (iii) which predictor to use. Wrapper uses an induction algorithm along with a statistical re-sampling technique such as cross- validation to estimate the final accuracy of one learner using a given subset of features. This paper is focused on wrapper methods and how to combine the prediction of different learners.

It is well-known that learners often result in better accuracy results by wrap- per methods but they could reach very different accuracy results according to the features selected. Moreover, different learners in a wrapper approach deter- mine a non unique set of selected features. We believe this is misleading for the user, who wishes instead to understand which are the features that are more rel- evant for the task and independently of the learner. From the analyst viewpoint, it could be more interesting to understand which are the features of the same dataset that different learners decided to use. This set of features, built taking into consideration not a single learner, but a collection of learners, could give more insights on the specific utility of the features for the classification task and give the guarantee of not being biased by a particular learning scheme that could condition the choice. This is of particular importance in bioinformatics, in which the number of features is often huge and needs to be reduced to simplify both the data collection and integration tasks and the computational learning task.

Furthermore, the user/analyst often ignores which is the actual feature utility for the target and needs just a tool that is able to suggest him/her which is the core set of high utility features, based on the evidence and the history of a set of learning sessions that worked on the same problem.

In this paper we propose an innovative schema that we think is beneficial because the feature selection is not biased by any single learner but considers multiple learners at the same time. A forward search method with gain ratio has been initially applied as a search and evaluation method in the whole attribute set. This first step is not guided by any learner but by the intrinsic characteristics of the data: it prefers the features that subdivide the data into partitions with homogeneous target class. It produces a ranked feature list whose subsets are later evaluated by using the learning scheme of multiple classifiers. This second step follows an ensemble approach and takes into consideration the features chosen by multiple learners: the ideal subset of features is obtained by merging the feature rankings obtained by different classifiers. In this paper we report results on the following classifiers: a multi-layer perceptron (MLP) [3], a radial basis function network (RBF), a support vector machine [12], a C4.5 decision model induced by C4.5 algorithm [5] and a random forest (RF) [8].

The choice of the best subset of these features must however be performed remembering that the number of features should be reduced as much as possible but this reduction should not affect the classification performance. In this paper, we present our novel, composite procedure and its experimental results.

In our approach we aim to perform a decision on feature selection which is not biased by a specific learner but takes in consideration different learners.

We initially rank the features and select a feature subset that gives the best

(3)

accuracy result for each isolated learner, by a simple wrapper method (we call this a single learner wrapper). Then, we combine the rankings forming a unique set. The obtained set is thus the result of a collective choice of features made by the ensemble of the simple wrappers.

The aim of this paper is to demonstrate that ensemble feature selection is effective. In the experiments we present the classification accuracy obtained by many learners on the same feature subset selected by our composite wrapper.

They will be compared with the same learners on the feature subset determined by the single learner wrapper. Other results regard also accuracy on the whole set of features and accuracy after a filter feature selection. The experimental results will demonstrate the effectiveness of this new approach.

This paper is structured as follows. In Section 1.1 and 1.2 we present an overview of feature subsets search methods and classification algorithms. In Sec- tion 2 we present our composite wrapper method. In Section 3 we discuss the feature selection strategy and present the experimental results. Finally, Section 4 presents the main conclusions and future works.

1.1 Main approach to wrapper methodology

Feature selection wrapper methods try to find a subset of features according to their usefulness to a given predictor. The method conducts a search for a good subset using the learning algorithm itself as part of the evaluation func- tion. Typically, the space of attributes is searched greedily in one of two direc- tions: top-down or bottom-up. The downward strategy, called forward selection, starts with no attributes and adds them one at a time. The upward strategy is backward elimination: it starts with the full set and deletes attributes one at a time. In forward selection, each attribute that is not already in the current subset is tentatively added and the resulting set is evaluated using, for instance, cross-validation. This evaluation produces a numeric evaluation measure of the expected performance of the classifier by the subset. Backward elimination op- erates in an analogous fashion but subsets are evaluated in the reverse order.

One common wrapper method is greedy forward stepwise selection. To deter- mine which feature to add at each step, a model is trained adding each unused feature one at a time to the feature set previously selected. The feature that provides the best performance on the held out feature selection set is added to the set. For an exhaustive description of the cited methods please refer to [2].

1.2 Overview of some state of the art classifiers

For the sake of completeness we introduce some of the basic characteristics of the adopted learning models. A complete overview can be found in [5], [8], [9].

Decision trees. A decision tree is a structure whose internal nodes answer to a test condition based on the value of some of the record attributes. Each outcome of the test condition leads either to another internal node or to a leaf node which contains a class value. That class value is the prediction of the decision tree for

(4)

the set of data records that reach that final node. Thus, the outcomes of the attribute tests separate data records with different characteristics into disjoint partitions that are homogeneous for the class value.

The induction step in C4.5 follows a greedy strategy that grows a decision tree by progressively partitioning the training data into smaller partitions until each of them is homogeneous in the value of the class label. C4.5 algorithm induces the form of the decision tree, i.e., chooses the test condition at each node t using and entropy measure. Informally, the entropy of a dataset measures the degree of disorder in the class value of the data.

The best attribute test at node t is selected as the one that allows the higher difference between the entropy at the parent node t (before the test) and the entropy at the children nodes (after the test).

Random forest. Significant improvements in classification accuracy have resulted in a set of methods called ensemble methods. They consist in the generation of multiple, base classifiers from training data and successively combining the pre- dictions of each of them in test. Random forest is a special ensemble learner, which is also suitable for problems involving a large number of features. In a random forest a large number of decision trees is grown where each tree depends on the values of a random vector, sampled independently and with the same dis- tribution for all trees. Random vectors are generated using an ensemble method (called bagging) which randomly selects N features, with replacement from the original training set. Each tree in a random forest is grown at least partially at random by injection of randomness in one of the following ways: (1) by growing each tree on a different random subsample of the training data; (2) by selection of the attribute test at any node determined partially at random. When multiple trees are generated, their predictions are usually combined by majority voting so that the most popular class among them is predicted (where majority is even- tually weighted by giving more weight to the more correct trees). The ensemble learning is due by this step.

Multi-layer perceptron. Multi-layer perceptron is a feed-forward neural network with multiple layers of processing neurons. The neurons in one layer have directed connections to the neurons in the next layer. Each intermediary layer is a hidden layer. The network can use different activation functions that determine how communication occurs between neurons at one layer and each neuron at the successive layer: linear, sigmoidal, hyperbolic tangent function. Each training example can be described by a vector in an m-dimensional space. The vector components determine the input values to each of the neurons at the first layer.

The output of each neuron is determined by the application of the activation function at the summation of the node inputs. In turn each node passes its output in input to neurons at the next layer and so on until output layer neurons give in output their prediction. The goal of the MLP learning algorithm is to determine a set of weights to be assigned to the connections between neurons.

The right weight minimizes the total error at the output layer neurons over the

(5)

training set. The network is trained by supervised learning using the iterative back-propagation algorithm and according to the gradient descent rule [3].

Radial basis functions networks. Radial basis functions networks are feed-forward neural networks that differ from a multilayer perceptron in the way that the hid- den units perform computations. Each hidden unit represents a particular point in input space as well as each instance. Its output for a given instance depends on the distance between its point and the instance point. A commonly used ac- tivation function is a bell-shaped Gaussian activation function which essentially computes the Euclidean norm between the input vector and the center of that unit. Hidden units are called RBFs because the points in the instance space for which a given hidden unit produces the same activation form a hypersphere or hyperellipsoid. While the hidden layer of a RBF network is non linear, the output layer is linear. The parameters of the network are (a) the centers and widths of the RBFs and (b) the weights used to form the linear combination of the outputs obtained from the hidden layer. An excellent introduction to RBF networks is provided in [4].

Support Vector Machine. Support vector machine was first introduced by Vladimir Vapnik in 1979 [14]. It is based on an algorithm that finds a special kind of a linear model: the maximum margin hyperplane. The main idea of a SVM is to construct a maximum margin hyperplane as linear discriminant function in such a way that the margin of separation between classes is maximized. Support vector machines select a certain number of critical boundary instances called support vectors from each class and build a hyperplane that separates them as widely as possible. Support vectors are the instances that are closest to the maximum margin hyperplane and the set of support vectors uniquely defines the maximum margin hyperplane for the learning problem. There is always at least one sup- port vector for each class, and often there are more. In the construction of the support vector learning algorithm the inner-product kernel between a ”support vector” and the vector x drawn from the input space is central. The learning problem can be formulated as a convex optimization problem of the objective function, for which efficient algorithms are available. A general description, in- cluding generalization to the case in which the data is not linearly separable, has been published by [15].

2 Our Composite Wrapper Method

In our approach we aim to perform a decision of feature selection which takes its decisions according to an ensemble of learners. In this Section we discuss the pseudo-code of the method. Procedure CompositeWrapper returns a ranked list of features (resultList) selected using a consensus strategy: it is the set of features that at least the 0.20% of the classifiers used.

Function CompositeWrapper()

(6)

{

1) counter[]; -- counter[i] is the number of classifiers -- that selected feature number i

2) num_classifier=0; -- count the number of used classifiers 3) resultList=emptyList;

4) for every classifier C do: -- obtain a ranked list generated -- using a single learner wrapper

5) subset= BaselineWrapper(C, nfolds);

-- nfolds is the number of folds of the cross-validation 6) for all features with index from 1 to nTotFeatures do:

-- nTotFeatures is the total number of features 7) if subset contains feature number i then Counter[i]++;

8) end if

9) end for;

10) num_classifier++;

11)end for;

12)for all i from 1 to nTotFeatures do:

-- presence percentage computation and discard step 13) if ( Counter[i]/num_classifier)> 0.20 then

resultList = resultList append <i>;

14) end if

15)end for;

16)return resultList;

}

BaselineWrapper returns the best subset of features according to a single learn- ing scheme. First it produces a ranking of features (step (c)); then it follows a forward-search method and takes an increasing portion of the top part of the features ranking (steps g–n). At each step it evaluates the ability of the classifier with the given feature subset (steps i–m). The initial ranking is produced by evaluating each single feature by an objective function. Ranking of features is possible since a large number of feature evaluation measures are available (see, for instance, [10, 11] for a survey on some of them). We used a GainRatio sin- gle attribute evaluator (function gainratio) with a five folds cross validation method (steps b– t). Gain Ratio GR of feature fj is defined as follows:

GR(fj) =H(class)− H(class|fj)

H(fj) (1)

where H(class) is the entropy in the training set of the target class, H(class|Fj) is entropy of the class given feature fj and H(fj) is entropy of feature fj. This step is similar to a filter method in which no learner is involved but the simple characteristics of data determine the features ranking.

Function BaselineWrapper(classifier C, nfolds) {

(7)

a) counter[]; -- counter[i] is the number of folds -- in which i-th feature is selected b) for all j from 1 to nfolds do:

-- rank[i] is the number of the feature in the -- i-th position in the rank by GainRatio c) rank[]=gainratio(j);

d) subset=emptySubset;

e) bestSubset=emptySubset;

f) minerror=100%;

g) for all i from 1 to nTotFeatures do: -- forward-selection

h) subset=subset union (Rank[i]);

i) error=classify(C, fold j, subset);

j) if minerror>error then:

k) minerror=error;

l) bestSubset=subset;

m) end if;

n) end for;

o) for all i from 1 to nTotFeatures do:

-- counts the # of folds in which i-th feature is present p) if bestSubset contains Rank[i] then Counter[i]++;

q) end if;

r) end for;

s) end for;

t) resultSet=emptySet;

u) for all i from 1 to nTotFeatures do:

-- selection of attributes with presence percentage -- more than 80% in the total number of folds v) if (Counter[i]/nfolds)> 0.8 then

resultSet=resultSet union {i};

w) end if;

y) end for;

z) return resultSet;

}

Procedure BaselineWrapper is called for each learner and it extracts the feature subset that gives the best accuracy result by that learner. The cross validation with five folds produces a modified feature ranking with score. The score assigned to each feature is based on the number of folds in which the feature has been selected by the wrapper (step o–r).

In the following lines (t–y), we combine in a unique set these feature subsets coming from the different folds. Our rule was the following. From the ranking we selected only the features with a presence percentage of more than 80% of the total number of folds (steps v–w). We combined the rankings (resultSet returned by Function BaselineWrapper) by the union of the selected features in (Function CompositeWrapper).

(8)

We remove the features that have presence percentage less than or equal to 20% (steps 13–14). Thus a unique ranking is generated ((16)). During the experiments, in which we evaluated the effectiveness of the feature subsets, clas- sification accuracy has been evaluated by passing the final feature subset to each classifier. These results are explained in the following experimental Section.

Since our method is not biased by a single classifier, we expect that the subset of features generated by this procedure is more meaningful and sound if referred to the specific domain. Indeed, the chosen feature subset is not due to a single, specific classification method. We verified through comparison of classifi- cation accuracy obtainable with a single learner that classification performances improves. This is due to the redundancy reduction in instance description and noise removal.

3 Experiments

In order to illustrate the effectiveness of the procedure, we have chosen six datasets from MLRepository [17]: Ionosphere, Sonar, Hepatitis, Heart-Statlog, Upsala1, Upsala2. The main reason for using these datasets is that they differ in a significant way for the number of instances, the number of attributes, the number of missing data and the biological domain.

Heart-Statlog is a dataset of 270 instances, 13 categorical and real attributes and no missing value. The variable to be predicted is the presence or absence of heart disease.

Hepatitis is a dataset of 155 instances, 19 categorical, integer and real at- tributes and 5.7 % of missing values. This dataset requires determination of whether patients with hepatitis will either live or die.

Ionosphere is a dataset of 351 instances, 34 integer and real attributes and no missing value. The target class contains the radar returns from ionosphere.

Radar data were collected by a system consisted of 16 high-frequency anten- nas with a total transmitted power of about 6.4kW in Goose Bay, Labrador.

The targets were free electrons in the ionosphere. Good radar returns are those showing evidence of structure in the ionosphere. Bad returns are those that do not return anything: their signals pass through the ionosphere. The data are described by 34 features that introduce two attributes for 17 pulse numbers corresponding to the complex values returned by the function re- sulting from the complex electromagnetic signal. Target class is split into two data classes: 225 objects belong to good class and 126 objects belong to bad class. ”Good” for radar returns that shows evidence of some type of structure in the ionosphere, ”Bad” for the other ones.

Sonar is a dataset of 208 instances, 60 real attributes and no missing value. The task of the sonar dataset is to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. Thus the dataset consists of two data classes. The first data class contains 111 objects obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions. The second class contains 97 objects obtained

(9)

from rocks under similar conditions. Each object is a set of 60 numbers in the range 0.0 to 1.0. Thus, the data are 60-dimensional. Each number (feature) represents the energy within a particular frequency band, integrated over a certain period of time.

Upsala1 and Upsala2: [6] are two data sets composed by gene expression val- ues from microarray experiments. Both the data sets have 80 probe set IDs (real attributes) that we extract from Cimino et al. [7] as set of genes that could be used as independent prognostic markers and potential drug targets for breast cancer. In Upsala1 we have considered 346 patients whereas in Upsala2 we have considered 249 patients. For these datasets expression val- ues were calculated from the raw.CEL files with gcRMA bioconductor function and intensities were scaled on median values for each gene. Both datasets have not missing value.

We tested the procedure using the default settings of Weka suite of data min- ing and machine learning tools [12] and a cross validation method with ten folds.

We compared the classification accuracy obtained by a MLP, a RBF, a binary decision tree (C4.5), a random forest (RF) and John Platt’s sequential minimal optimization algorithm for training a support vector classifier (SMO) [13] on each of the above data sets.

We report the experimental results in Table 1. It contains the number of features (#F), the number of features selected by our method (#CW), the num- ber of features selected by a simple wrapper feature selection method, guided by Gain Ratio, but without the feature ranking combination phase. The results of this wrapper are obtained by coupling the wrapper with each classifier sep- arately and is taken as an obvious baseline comparison method. For the sake of completeness we will compare the accuracy of our composite wrapper with another feature selection method based on ranking. We will rank the attributes using a ReliefF evaluator. The key idea of Relief is to estimate the quality of features according to how well their values distinguish between the instances of the same and different classes that are near each other [16]. ReliefFAttributeEval used in conjunction with the Ranker search method of Weka generates a ranked list. ReliefFAttributeEval is instance-based: it samples instances randomly and checks nearby instances of the same and different classes. The last column in Table 1 (ReliefF) is the number of features to retain for the above method. It is the average of the number of features selected by each simple wrapper method guided by Gain Ratio (columns 3–5).

We compare the classification accuracy obtained by each method in the fol- lowing situations:

1. when no feature selection occurs

2. when the feature selection is made by the simple baseline wrapper (Function BaselineWrapper),

3. when the feature selection is made by our composed wrapper method (Function CompositeWrapper), and

4. when the feature selection is made by the rank search method with ReliefF.

(10)

#F #CW MLP RBF C4.5 RF SMO ReliefF

Heart-Statlog 13 6 3 8 4 5 5 5

Hepatitis 19 3 1 12 1 1 1 3

Ionosphere 34 11 7 11 7 20 12 11

Sonar 60 36 55 2 12 35 25 26

Upsala1 80 38 28 8 39 44 16 27

Upsala2 80 2 3 1 1 1 1 1

Table 1. Number of features selected by each methods.

Tables 2, 3, 4, 5, 6 show the classification accuracy obtained for each of the above classifiers.

Bold font is used to remark the winner method with the best classification ac- curacy on the dataset considering a given learner; >> << remarks instead the overall, best classification accuracy on the dataset considering all the learners.

MLP No feature selection Simple Wrapper Composite Wrapper ReliefF

Heart-Statlog 78.15% 85.19 % 78.89 % 83.33 %

Hepatitis 80 % 85.16 % 85.16 % 80.65 %

Ionosphere 91.17 % 91.45 % 93.73 % >> 94.02 % <<

Sonar 82.21 % 83.17 % >> 85.58 % << 81.73 %

Upsalal 84.10 % 84.97 % 84.39 % 88.15 %

Upsala2 57.83 % 60.24 % 61.45 % 61.45 %

Table 2. Percentage of instances correctly classified by a MLP.

RBF No feature selection Simple Wrapper Composite Wrapper ReliefF

Heart-Statlog 84.07% 85.19 % 82.59 % 82.96 %

Hepatitis >> 85.81 % << 85.16 % 83.87 % 80.65 %

Ionosphere 92.59 % 93.45 % 93.73 % 91.45 %

Sonar 72.12 % 75.96 % 76.44% 77.40 %

Upsalal 65.32 % 59.83 % 58.38 % 69.07 %

Upsala2 59.84 %>> 65.06 % << 61.04 % 63.86 % Table 3. Percentage of instances correctly classified by a RBF.

A comparison between the first and the third column shows that compos- ite wrapper improves substantially the original classification accuracy (with no feature selection).

Multilayer perceptron always performs better than when no feature is selected or a single wrapper is applied. SMO and RBF give encouraging results. Only when a random forest algorithm is applied performance decreases. This is an expected result due to the fact that RF makes already a random feature selection

(11)

C4.5 No feature selection Simple Wrapper Composite Wrapper ReliefF

Heart-Statlog 76.67% 81.48 % 79.63 % 81.48 %

Hepatitis 83.87 % 84.52 % 81.29 % 79.35 %

Ionosphere 91.45 % 92.88 % 92.59 % 92.31 %

Sonar 71.15 % 75 % 73.56% 75 %

Upsalal 83.24 % 82.37 % 84.39 % 82.95 %

Upsala2 61.04 % 64.26 % >> 64.26 % << 64.26 % Table 4. Percentage of instances correctly classified by a C4.5.

RF No feature selection Simple Wrapper Composite Wrapper ReliefF

Heart-Statlog 78.15% 79.63 % 80.37 % 81.48%

Hepatitis 82.58 % 84.52 % 80 % 81.29 %

Ionosphere 92.88 % 92.59 % 92.59 % 93.73 %

Sonar 80.77 % 82.69 % 79.33% 78.37%

Upsalal >> 95.38 % << 92.20 % 95.09 % 93.93 %

Upsala2 63.86 % 63.05 % 63.05 % 57.03 %

Table 5. Percentage of instances correctly classified by a RF.

SMO No feature selection Simple Wrapper Composite Wrapper ReliefF Heart-Statlog 84.07% 84.81 % >> 85.19 % << 81.11 %

Hepatitis 85.16 % 84.52 % 84.52 % 79.35 %

Ionosphere 88.60 % 88.60 % 88.60 % 85.76 %

Sonar 75.96 % 78.37 % 79.81 % 76.44 %

Upsalal 62.72 % 61.56 % 62.43 % 66.19 %

Upsala2 60.24 % 64.26 % 64.26 % 64.26 %

Table 6. Percentage of instances correctly classified by a SMO.

embedded in the algorithm which selects F features, with F = log2d + 1 and d the number of input features, as described in [8]. In a similar way, we noticed that C4.5 decision trees often work better with the single wrapper: the additional number of features composite wrapper allows might make it increasing the mean depth of the trees which in turn increases overfitting. Thus C4.5 does not seem robust w.r.t. the additional number of features that composite wrapper allows.

ReliefF results are comparable with our composite wrapper method.

We performed a paired two-tail T-test on the statistical significance of the difference in accuracy of the simple and composite wrappers. The null hypothesis is that mean difference between simple and composite wrapper is equal to zero.

We observe that differences are not statistically significant.

Results remark that the composite wrapper obtains the best classification accuracy in the following cases:

Heart-Statlog we obtain the best results with SMO Sonar we obtain the best results with MLP.

Upsala2 we obtain the best results with SMO and C4.5.

(12)

Results indicate that Composite Wrapper is a promising technique for feature selection. With Composite Wrapper all the classifiers can achieve better perfor- mance than with the original data. The performance of Composite Wrapper is comparable with ReliefF method, and in some cases, it is even better.

Backward Feature Elimination. For the sake of completeness, observing that Hepatitis and Upsala1 register the highest classification accuracy when no fea- ture selection is done, we performed a variant of the composite wrapper replacing the forward search in the space of the feature subsets with a greedy backward elimination method.

Table 7 shows the number of features selected by this variant of our method (#CWB), the number of features selected by a simple wrapper feature selec- tion method with a greedy backward elimination without the feature ranking combination phase and the number of features used for a rank search with a Re- liefF attribute evaluator. It is computed as the average of the values in columns (2)–(6).

#CWB MLP RBF C4.5 RF SMO ReliefF

Hepatitis 16 12 10 2 17 12 11

Upsala1 80 78 75 16 80 53 60

Table 7. Number of features selected by each method with backward elimination.

Tables 8 and 9 show the classification accuracy obtained. For Hepatitis we obtain the best results with SMO whereas for Upsala1, since the number of feature selected is the same of the original feature set, the results do not change using our composite wrapper. This is a degenerate case due to the fact that backward elimination usually selects a bigger subset of suitable features com- pared to a forward search. Composite wrapper performs better (or equal) than Single wrapper in numerous cases and obtains the best classification accuracy with SMO in the first case. Results for Upsala1 show that when ReliefF is used by a rank search method gives better performance if compared with our com- posite wrapper. This is due to the fact that ReliefF, in contrast to the majority of the heuristic measures for estimating the quality of the attributes, can cor- rectly estimate the quality of attributes in problems with strong dependencies between attributes. Indeed it does not assume the conditional independence of the attributes (upon the target variable). We believe this is the case of Upsala1.

4 Conclusions

In this work, we have presented a new ensemble method for feature selection. It is based on multiple rankings of features that multiple types of learners produce with a wrapper approach. The ideal subset of features is produced by merging the rankings and selecting a subset of attributes which includes only the features

(13)

Hepatitis No feature selection Simple Wrapper Composite Wrapper ReliefF

MLP 80% 83.23 % 83.23% 84.52 %

RBF 85.81 % 87.10 % 85.81 % 85.81 %

C4.5 83.87 % 79.35 % 81.94 % 79.35 %

RF 82.58 % 82.58 % 82.58 % 83.87 %

SMO 85.16 % 87.10 % >> 87.10 % << 83.23 % Table 8. Classification accuracy for Hepatitis with backward elimination method.

Upsala1 No feature selection Simple Wrapper Composite Wrapper ReliefF

MLP 84.10% 84.68 % 84.10% 86.99 %

RBF 65.32 % 65.32 % 65.32 % 65.61 %

C4.5 83.24 % 82.95 % 83.24 % 84.10 %

RF 95.38 % 95.38 % 95.38 % >>95.66 % <<

SMO 62.72 % 63.87 % 62.72 % 61.27 %

Table 9. Classification accuracy for Upsala1 with backward elimination method.

present in more than one ranking and present at least in four of the five folds adopted by the applied cross validation. In this paper we compared the accuracy of classification in some public available datasets among which four datasets are biological.

This procedure produces an improvement of the classification accuracy w.r.t.

the absence of feature selection and selects a subset of attributes that is sound, a grounded in the domain since it is not biased by a particular choice of learner.

The results are encouraging and confirm that this procedure is valid for the majority of the classifiers (with the exception of the learners like RF and C4.5 in which a feature selection is embedded already in the learning scheme).

The next step will be to investigate alternative properties of this methodology such as employing also a backward elimination strategy that in some cases might produce better results and some feature evaluation measures, different from Gain Ratio, such as Chi-square or Pearson correlation with the target class.

References

1. Elena Roglia, Rossella Cancelliere, and Rosa Meo. Classification of chestnuts with feature selection by noise resilient classifiers. European Symposium on Artificial Neural Networks, Bruges, April, 2008.

2. I. Guyon, A. Elisseeff. An introduction to variable and feature selection., Journal of Machine Learning Research, 3:11571182, 2003.

3. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal represen- tations by error propagation. Parallel distributed processing: explorations in the microstructure of cognition, Volume 1: foundations:318–362, 1986.

4. C. E.Bishop Pattern Recognition and Machine Learning., Springer, 2006.

5. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1992.

6. Lance D. Miller, J. Smeds, J. George, V. B. Vega, L. Vergara, A. Ploner, Y. Pawitan, P. Hall, S. Klaar, E. T. Liu and J. Bergh. An expression signature for p53 status

(14)

in human breast cancer predicts mutation status, transcriptional effects, and patient survival, Proc Natl Acad Sci, 102,pp. 13550–13555, 2005.

7. D. Cimino, L. Fuso L, C. Sfiligoi, N. Biglia, R. Ponzone, F. Maggiorotto, G. Russo, L. Cicatiello, A Weisz, D Taverna, P. Sismondi and M. De Bortoli. Identification of new genes associated with breast cancer progression by gene expression analysis of predefined sets of neoplastic tissues., Int J Cancer, 123, pp. 1327–1338, 2008.

8. Leo Breiman. Random forest. Machine Learning, 45:5–32, 2001.

9. P-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison- Wesley, 2005.

10. M. Dash and H. Liu. Feature selection for classification. Intell. Data Analysis, 1(3), 1997.

11. Ken McGarry. A survey of interestingness measures for knowledge discovery.

Knowl. Eng. Rev., 20(1):39–61, 2005.

12. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2nd edition, 2005.

13. J. Platt. Fast Training of Support Vector Machines using Sequential Minimal Op- timization. Advances in Kernel Methods - Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola, eds., MIT Press,1998.

14. V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.

15. V. Vapnik and C. Cortes. Support vector networks. Machine Learning 20(3):273- 297, 1995.

16. K. Kira and L. Rendell. A pratical approach to feature selection. In D.Sleeman and P. Edwards, editors, Proceedings of the Ninth International Workshop on Machine Learning, Aberdeen, Scotland. San Francisco: Morgan Kaufmann, pp. 249-258, 1992.

17. MLRepository. http://www.mlearn.ics.uci.edu/MLRepository.html

figura

Updating...

Riferimenti

Updating...

Argomenti correlati :