• Non ci sono risultati.

2.4 Classification

In this section we introduce some simple classification techniques. We in-troduce these algorithms because in the rest of the thesis we use them to evaluate our results. For this reason a short presentation of these methods is given in the following section.

2.4.1 Naive Bayes

The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular attribute of a class is unrelated to the presence (or absence) of any other attribute.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood. The techniques similar to Lapla-cian Smoothing [79] have been adopted in naive Bayes in order to estimate the probability of occurrence of attribute values that have never been met in the training set examples. This is particular useful in text classification when the classification algorithm manages in the test set words that were absent in the training set. In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers often work much better in many com-plex real-world situations than one might expect. In this explanation we denote with n the number of attribute, Fi a generic attribute and C the class attribute.

Using Bayes’ theorem, we write

p(C|F1, . . . , Fn) = p(C) p(F1, . . . , Fn|C) p(F1, . . . , Fn)

We can rewrite the previous equation in the following way:

posterior = prior × likelihood evidence

The evidence is fixed for each posterior, for this reason:

posterior ∝ prior × likelihood

The naive Bayes algorithm use a strong independence assumption: each attribute is independent from each other attribute. With this assumption we can rewrite the likelihood of the data in this way:

p(F1, . . . , Fn|C) = Yn

i=1

p(Fi|C) And the posterior in this way:

posterior ∝ p(C) Yn i=1

p(Fi|C)

One should notice that the independence assumption may lead to some unexpected results. Despite the fact that the far-reaching independence as-sumptions are often inaccurate, the naive Bayes classifier has several proper-ties that make it surprisingly useful in practice. In particular, the decoupling of the class conditional attribute distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate the problem of curse of dimensionality [39], in which the high-dimensional space degrades the performance of the learning algo-rithm.

2.4.2 Decision Tree

Decision tree learning is a practical method for inductive inference. This technique learns a hierarchical structure in a recursive way. Example of

al-2.4. CLASSIFICATION

Figure 2.2: Example of Decision Tree

gorithms that use this learning strategy are CART, C4.5 and ID3 [65]. The Decsion tree approach learns in a recursive way a set of partitions and sub-paritions. Starting from the main partition that contains all training data, at each step every existing region is split into subregions. This recursive split-ting procedure is continued until a region meets a local termination criterion.

Then It applies a first split using the values of a chosen attribute. When all regions have met the termination criterion, the decision tree stops to learn partitions. To split the regions a particular attribute is selected and its values are used to split the region in subregions. For instance ID3 algorithm uses information theory to select the best attribute, given a region, to split the region. In figure 2.4.2 there is an example of this procedure. The algorithm starts to analyze the whole dataset, then the decision tree splits the dataset choosing the attribute OUTLOOK. At each level the algorithm chooses a specific attribute to split the node. For instance always in figure 2.4.2, at the second level the algorithm chooses the attribute HUMIDITY for the left branch and the attribute WINDY for the right branch. Generally the process is repeated until the termination criterion is met. In the example the algorithm went on until each leaf contains all elements from the same class.

From the learned trees we can extract a sets of if-then rules to improve human readability. After the splitting phase is completed a pruning procedure is usually employed. For instance some approaches recombines adjacent regions in a bottom-up way using misclassification risk from cross-validation in order to determine when to stop the pruning. The decision tree approach is an interesting way to learn a model with intrinsic structure, but it has some limitations. The first limitation is given by the fact that as the partitioning proceeds the number of training observations in successive regions becomes smaller and smaller. At the end of the process the remaining data should be insufficient to provide meaningful estimates. Another limitation regards the splitting procedure. In [72] we can see that a little perturbation of the data can have a big influence on the Decision tree performance. An explanation of this phenomena is the follow: suppose that the noise in the data affects those attributes that the Decision tree chooses as splitting attributes. This fact can change in a determinant way the resulting decision tree and can degrade the performance of the algorithm.

2.4.3 Nearest Neighbors

The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms[65]. It finds the K nearest neighbors of a query point xq

in the training set, and then predicts the class label of the point xq in the following way:

f(xq) = arg max

c∈C

Xk i=1

σ(c, f (xi)) where:

• σ(a, b) = 1 if a = b and σ(a, b) = 0 otherwise

• f (xi) is the function that returns the correct classification of an object xi

2.4. CLASSIFICATION

• f(xq) returns the classification of an object xqobtained by this classifier.

An object is classified by a majority vote of its neighbors, with the object being assigned to the most common class amongst its k nearest neighbors. k is a positive integer, typically small. If k = 1, then the object is simply assigned to the class of its nearest neighbor. In this case the classification function is not so smooth. As the number of k increases the classification function becomes more smoother. In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. The same method can be used for regression, by simply assigning to the object the average of the values of its k nearest neighbors. It can be useful to weigh the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. The neighbors are taken from a set of objects for which the correct classification (or, in the case of regression, the value of the property) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

In order to identify neighbors, the objects are represented by vectors in a multidimensional feature space. It is usual to use the Euclidean distance, though other distance measures, such as the Manhattan distance, could in principle be used instead. The k-nearest neighbor algorithm is sensitive to the local structure of the data. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

For this fact we can categorize this classifier as a lazy algorithm. In the actual classification phase, the test sample (whose class is not known) is represented as a vector in the same feature space. Distances from the new vector to all stored vectors are computed and k closest samples are selected.

2.4.4 Artificial Neural Network

Artificial Neural Networks (ANN) provide a robust method to approximate vector-valued target functions. In general ANN are considered like an univer-sal approximation function. In general ANNs are well suited to problems in

which there is a lot of noise inside the data. As far as the data representation is concerned, instances are represented as numerical vectors of predefined fea-tures. In general the output of the target function may be discrete-valued or real-valued. This learning method is used in many regression tasks where the variable to predict is continuous. The basic block of an ANN is the per-ceptron. The perceptron is simply a weighted sum of the input vector. This simple algorithm enables to obtain a linear classifier. This simple classifier has a small hypothesis space that does not allow the algorithm to manage data characterized by a non linear structure.

f (xi) = Xn

i=0

wi· xi

The perceptron returns 1 if the weighted sum is greater than a given threshold. Otherwise it returns -1. This simple algorithm is well-suited for binary classification problems when the data space is linearly separable. For instance this method can not manage the XOR problem [65]. To avoid the problem related to the perceptron inability of management of non linearly separable data spaces, ANNs have been introduced. An ANN is a network of perceptrons with many levels. In general the first level is the input level, a non predefined number of hidden levels and, at the end, an output levels that produces in output the final value for the class variable. This approach uses the backpropagation strategy in order to train the network: it uses the gradient descent to search inside the hypothesis space. Gradient descent is useful because it allows the ANN to search through hypothesis spaces containing many different types of continuously parameterized hypotheses.

The backpropagation algorithm propagate downward the error in prediction of the output level and it tries to proportionally re-distribute the error to all the nodes inside the network.

2.4. CLASSIFICATION

2.4.5 Support Vector Machine

Support Vector Machine (SVM) was originally proposed by Vapnik [81] as a binary classifier. The SVM methodology can be employed for classification and regression problems with good generalization results. SVM is a super-vised learning technique. There are two stages in order to create and apply an SVM classifier: training and prediction. In the training stage, or learn-ing stage, instances with labels are given in order to design support vectors which are used to predict desired outputs. Also for this approach instances are represented as a numerical vector of predefined features. The output of this phase is a set of support vectors. These support vectors constitute a model, the prediction model. The optimization problem introduces the use of slack variables. This kind of variables help to relax the constraints to obtain a soft separating hyperplane. To solve the problem of management of non linearly separable data the kernel trick was introduced. This method maps the instance vector using, in general, a non-linear function. This mapping is done by a kernel function, which maps the input features into a higher dimensional space such that data are linearly separable. Using this mapping function the similarity between two objects is evaluated with the original features space avoiding the construction of the new features. There are sev-eral kernel functions proposed in literature. The most usual kernels are the following:

• Polynomial (homogeneous): k(x, x) = (x · x)d

• Polynomial (inhomogeneous): k(x, x) = (x · x+ 1)d

• Radial Basis Function: k(x, x) = exp(−γkx − xk2), for > 0

• Gaussian Radial basis function: k(x, x) = exp

kx−x2k2



• Hyperbolic tangent: k(x, x) = tanh(κx · x + c), for some (not every) κ > 0 and c < 0

In the prediction phase, the prediction model is applied to predict the class variable for previously unseen instance vectors. This is done using the gener-ated hyperplane in the transformed features space. Support Vector Machines are very useful and they obtained good performance w.r.t. other state of the art methods.