• Non ci sono risultati.

Contingency table between variable I and D

Contingency tables are used to record and analyze relationships between two variables, especially categorical variables. If we analyze two categorical variables, I (with values I1 . . . Im) and D (with values D1 . . . Dn ). We can build a contingency table with m rows (as many rows as the number of values of the variable I) and n columns (as many columns as the number of values of the variable D). We denote each cell of the contingency table with nij.

The value in each cell nij of the contingency table represents the counting of how many times the value Di appears with the value Ij.

D1 D2 ... Dj Dn Total

I1 n11 n12 ... n1j n1n I1 Total I2 n21 n22 ... n2j n2n I2 Total

... ... ... ... ... ... ...

Ii ni1 ni2 ... nij nin Ii Total

... ... ... ... ... ... ...

Im nm1 nm2 ... nmj nmn Im Total

Total D1 D2 Dj Dn Matrix

Total Total ... Total Total Total Figure 2.1: Contingency table between variable I and D.

This tool is very useful to study the distribution of two variables. Using this simple representation we can compute many interesting measures that can be useful to understand the association between variables.

2.2 Evaluation measures

2.2.1 χ

2

The χ2-statistic is a coefficient that can be helpful to interpret the relation-ship between two variables. It measures the degree of independence between two variables. The χ2-statistic coefficient is measured with the following formula:

χ2 = Xm

i=1

Xn j=1

(Oij − Eij)2 Eij

where

• Oij is the observed frequency;

• Eij is the expected (theoretical) frequency, asserted by the null hypoth-esis;

2.2. EVALUATION MEASURES

• m · n is the number of possible outcomes of each event.

Using the contingency table in Fig. 2.1 we can define the χ2-statistic in the following way:

χ2 = Xm

i=1

Xn j=1

(nij− (IiM atrixT otalT otal·DjT otal))2 (IiM atrixT otalT otal·DjT otal) In our case :

• nij is the observed frequency (Oij);

• (IiM atrixT otalT otal·DjT otal) is the expected (theoretical) frequency, asserted by the null hypothesis (Eij);

• m · n is the number of possible outcomes.

This statistic gives us a test to measure how much the observed frequency differs from the theoretical frequency obtained by the null hypothesis. The null hypothesis is the independence assumption where the expected frequency is equal to P (Ii) · P (Dj), where P (Ii) and P (Dj) are the two marginals. The χ2-statistic can be used to calculate a p-value by comparing the value of the statistic to the χ2 distribution. Using the previous hypothesis the degree of freedom is equal to: (number of rows - 1) · (number of columns - 1). Small values of the χ2 statistic indicate that the distribution of the two variables is very close to the one used in the null hypothesis. Otherwise big values of the χ2 statistic indicate that the distribution of the two variables is far from the one used in the null hypothesis.

2.2.2 Entropy

Entropy, in information theory, is a measure of uncertainty associated with a random variable[41]. The term in this field refers to the Shannon entropy:

the average amount of information needed to transmit the state of a random

variable. In the context of data mining, entropy quantifies the uncertainty in the distribution of a random variable. It is very useful to understand the distribution of a variable in a dataset. For this reason entropy, and measures derived by it, are employed in feature selection task.

Given a random variable D we denote its entropy as:

H(D) = − X

Dj∈D

P (Dj) log2(P (Dj))

using the previous example each DjT otal is equal to summation over the column. DjT otal = Pm

i=1nij

We introduced the concept of entropy in terms of average amount of information necessary to specify the state of a random variable. Entropy is not a properly measure derived from a contingency table. If we extend the computation of entropy to the case of two variables H(D, I), we can use a contingency table to compute this measure.

H(D, I) = − X

Dj∈D

X

Ii∈I

P (Dj, Ii) log2(P (Dj, Ii))

The entropy definition is also the basic block to understand the concept of Mutual information showed in the following section.

2.2.3 Mutual Information

In probability and information theory, the mutual information of two random variables is a quantity that measures the reciprocal dependence of the two variables.

Formally, the mutual information of two discrete random variables I and D can be defined as:

MI(I; D) = X

2.2. EVALUATION MEASURES where P (Ii, Dj) is the joint probability distribution function of I and D, and P (Ii) and P (Dj) are the marginal probability distribution functions of I and D respectively.

Mutual information can be equivalently expressed as

MI(I; D) = H(I) − H(I|D)

= H(D) − H(D|I)

= H(I) + H(D) − H(I, D)

= H(I, D) − H(I|D) − H(D|I)

where H(I) and H(D) are the marginal entropies, H(I|D) and H(D|I) are the conditional entropies, and H(I, D) is the joint entropy of X and Y.

The conditional entropy of a variable I given the variable D is defined as follows:

H(I|D) = − X

Dj∈D

P (Dj)X

Ii∈I

P (Ii|Dj) log2P (Ii|Dj)

Intuitively, if entropy H(I) is the measure of uncertainty about a random variable I, then H(I|D) is the amount of information that D does not convey on I. This is the amount of uncertainty remaining about I when D is known.

Thus the right side of the first of these equalities can be read as the amount of uncertainty in I, minus the amount of uncertainty in I which remains after D is known, which is equivalent to the amount of uncertainty in I which is removed by knowing D. This corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other. Mutual Information is a symmetrical measure and It can also be expressed as a Kullback-Leibler divergence, of the product P (I) × P (D) of the marginal distributions of the two random variables I and D, from the joint distribution P(I,D) of the two random variables:

MI(I; D) = DivKL(P (I, D)kP (I)P (D)) where DivKL(P kQ) is equal to

DKL(P kQ) =X

i

P (i) log2 P (i) Q(i) Furthermore, let p(I|D) = p(I, D) / p(D). Then

MI(I; D) = X

Dj∈D

p(Dj)X

Ii∈I

p(Ii|Dj) log2 p(Ii|Dj) p(Ii)

= X

Dj∈D

p(Dj) DivKL(p(Ii|Dj)kp(Ii))

= ED{DivKL(p(Ii|Dj)kp(Ii))}

Mutual information can also be understood as the expectation of the Kullback-Leibler divergence of the univariate distribution P (I) of I from the conditional distribution P (I|D) of I given the D: more different are the distributions P (I|D) and P (I) the greater is the mutual information.

2.2.4 Symmetrical Uncertainty

Another important measure is the Symmetrical Uncertainty [42]. This mea-sure is a normalized version of mutual information. Symmetrical Uncertainty is defined as follows:

SU(I, D) = 2 · MI(I; D) H(I) + H(D)

This measure varies between 0 and 1 (1 indicates that knowledge of the value of either I or D completely predicts the value of the other variable; 0 indicates that I and D are independent). The advantage of normalized version of mutual information is that this measure is not biased by the number of

2.2. EVALUATION MEASURES values of each attribute as occurs for entropy and mutual information.

2.2.5 Normalize Mutual Information

Normalized Mutual Information [78] is another normalized measure obtained from the mutual Information.

Normalized Mutual Information is defined as follows:

NMI(I, D) = MI(I; D) pH(I) × H(D)

This measure varies between 0 and 1 (1 indicates that knowledge of the value of either I or D completely predicts the value of the other variable; 0 indicates that I and D are independent). This measure has the same advantages of the Symmetrical Uncertainty.

2.2.6 Goodman-Kruskal τ

Goodman-Kruskal τ can be described as a measure of proportional reduc-tion in the predicreduc-tion error [36]. It describes the associareduc-tion between two categorical variables - a dependent and an independent one - in terms of the increase in the probability of correct prediction of the category of the depen-dent variable when information is supplied about the category of the other, independent variable. τ is intended for use with any categorical variable while other measures (eg., γ) are intended for ordinal ones.

In order to review the meaning of τ consider the table in Figure 2.1.

τ determines the predictive power of a variable I, considered as an inde-pendent variable, for the prediction of the deinde-pendent variable D. The predic-tion power of I is computed as a funcpredic-tion of the error in the classificapredic-tion of D.

The prediction error in the classification of D is first computed when we do not have any knowledge on the variable I. ED denotes this error here. The

reduction of this error allowed by I is obtained by subtraction from ED of the error in the classification of D that we make when we have the knowledge of the value of I in any database example. ED|I denotes this latter error. The proportional reduction in prediction error of D given I, here called τI D, is computed by:

ED and ED|I are computed by a predictor which uses information from the cross-classification frequencies and tries to reduce as much as possible the prediction error. In the prediction, it also preserves the dependent vari-able distribution (relative frequencies of the predicted categories Dj) in the following way: when no knowledge is given on the category of I, category Dj is predicted with the relative frequency determined by DT otaljT otal; otherwise, when it is known that Ii is the category of variable I, category Dj is pre-dicted with the relative frequency determined by I nij

iT otal. Therefore, error ED We can see that τ satisfies many desirable properties for a measure of association. For instance, it is invariant by rows and columns permutation.

Secondarily, it seems a good measure for the association between two fea-tures since it takes values between (0,1) and it is 0 if and only if there is independence between the features (solves the problem of zero uniqueness).

Furthermore, it has the benefit that can be interpreted since it has an oper-ational meaning - it corresponds to the relative reduction in the prediction

2.3. FEATURE SELECTION error in a way that preserves the class distribution.

2.3 Feature Selection

Feature selection is an important area of research in data mining and ma-chine learning communities. Nowadays data are represented by hundred to tens of thousands of variables (features). For this reason the task of extract relevant and not redundant subset of features is very important. In [39] an introduction to variable and feature selection is found. There are different approaches to feature selection and different situations in which feature selec-tion can be applied. In general feature selecselec-tion was introduced in machine learning in the supervised setting. But nowadays there are studies about feature selection also for the unsupervised setting. The two problems are different from each other. In the supervised analyses we can consider the class information that guides the search of a suitable subset of features. In the unsupervised analyses we can use only the information on how the data are distributed. An example of a feature selection for unsupervised learning is the work presented in [95]. We can obtain a first categorization of the feature selection approaches:

• supervised feature selection

• unsupervised feature selection

2.3.1 Supervised Feature Selection

Many works on feature selection were about the supervised approach. In this setting we can use two different approaches:

1 Rank all the attributes w.r.t. their correlation with the class label.

With this method we can obtain a ranking of the features and filter those attributes with a value over a user-defined threshold.

2 Exploring the possible subsets of the features and choose the one that maximizes the accuracy of a given classifier.

The first one is named Filter approach. In this setting all features are ranked w.r.t. to the class. After the rank step some approaches, for instance [90], apply a second step that tries to erase redundant attributes in order to remove possible noise.

The second one is named Wrapper approach. In this setting a classifier is combined with a search mechanism that tries to explore the features space.

The space is defined by the set of all possible feature subsets. Each subset chosen by the search method is evaluated by a predefined classifier. The search mechanism is used like a wrapper for the classifier. The subset that obtains the best accuracy result w.r.t. the chosen classifier is the subset of features returned by the Wrapper feature selection approach.

2.3.2 Unsupervised Feature Selection

When no knowledge about the class variable is supplied, supervised feature selection cannot be applied. Feature selection for clustering or in general for unsupervised learning is a very important problem. This task is important because today huge quantities of data are available (for example web data), and all these data do not carry information on the belonging class. There are approaches that try to manage this problem [10], [86]. In recent years some interesting feature selection algorithms for unsupervised learning have been proposed. These new methods build a graph representation of the dataset. This graph representation is an embedding of the density of the dataset. Using algebraic properties that hold on the matrix that represents the connections between the nodes in the graph they are able to resolve the feature selection problem for unsupervised learning. An example of this approach is presented in [95].

2.4. CLASSIFICATION

2.4 Classification

In this section we introduce some simple classification techniques. We in-troduce these algorithms because in the rest of the thesis we use them to evaluate our results. For this reason a short presentation of these methods is given in the following section.

2.4.1 Naive Bayes

The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular attribute of a class is unrelated to the presence (or absence) of any other attribute.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood. The techniques similar to Lapla-cian Smoothing [79] have been adopted in naive Bayes in order to estimate the probability of occurrence of attribute values that have never been met in the training set examples. This is particular useful in text classification when the classification algorithm manages in the test set words that were absent in the training set. In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers often work much better in many com-plex real-world situations than one might expect. In this explanation we denote with n the number of attribute, Fi a generic attribute and C the class attribute.

Using Bayes’ theorem, we write

p(C|F1, . . . , Fn) = p(C) p(F1, . . . , Fn|C) p(F1, . . . , Fn)

We can rewrite the previous equation in the following way:

posterior = prior × likelihood evidence

The evidence is fixed for each posterior, for this reason:

posterior ∝ prior × likelihood

The naive Bayes algorithm use a strong independence assumption: each attribute is independent from each other attribute. With this assumption we can rewrite the likelihood of the data in this way:

p(F1, . . . , Fn|C) = Yn

i=1

p(Fi|C) And the posterior in this way:

posterior ∝ p(C) Yn i=1

p(Fi|C)

One should notice that the independence assumption may lead to some unexpected results. Despite the fact that the far-reaching independence as-sumptions are often inaccurate, the naive Bayes classifier has several proper-ties that make it surprisingly useful in practice. In particular, the decoupling of the class conditional attribute distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate the problem of curse of dimensionality [39], in which the high-dimensional space degrades the performance of the learning algo-rithm.

2.4.2 Decision Tree

Decision tree learning is a practical method for inductive inference. This technique learns a hierarchical structure in a recursive way. Example of

al-2.4. CLASSIFICATION

Figure 2.2: Example of Decision Tree

gorithms that use this learning strategy are CART, C4.5 and ID3 [65]. The Decsion tree approach learns in a recursive way a set of partitions and sub-paritions. Starting from the main partition that contains all training data, at each step every existing region is split into subregions. This recursive split-ting procedure is continued until a region meets a local termination criterion.

Then It applies a first split using the values of a chosen attribute. When all regions have met the termination criterion, the decision tree stops to learn partitions. To split the regions a particular attribute is selected and its values are used to split the region in subregions. For instance ID3 algorithm uses information theory to select the best attribute, given a region, to split the region. In figure 2.4.2 there is an example of this procedure. The algorithm starts to analyze the whole dataset, then the decision tree splits the dataset choosing the attribute OUTLOOK. At each level the algorithm chooses a specific attribute to split the node. For instance always in figure 2.4.2, at the second level the algorithm chooses the attribute HUMIDITY for the left branch and the attribute WINDY for the right branch. Generally the process is repeated until the termination criterion is met. In the example the algorithm went on until each leaf contains all elements from the same class.

From the learned trees we can extract a sets of if-then rules to improve human readability. After the splitting phase is completed a pruning procedure is usually employed. For instance some approaches recombines adjacent regions in a bottom-up way using misclassification risk from cross-validation in order to determine when to stop the pruning. The decision tree approach is an interesting way to learn a model with intrinsic structure, but it has some limitations. The first limitation is given by the fact that as the partitioning proceeds the number of training observations in successive regions becomes smaller and smaller. At the end of the process the remaining data should be insufficient to provide meaningful estimates. Another limitation regards the splitting procedure. In [72] we can see that a little perturbation of the data can have a big influence on the Decision tree performance. An explanation of this phenomena is the follow: suppose that the noise in the data affects those attributes that the Decision tree chooses as splitting attributes. This fact can change in a determinant way the resulting decision tree and can degrade the performance of the algorithm.

2.4.3 Nearest Neighbors

The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms[65]. It finds the K nearest neighbors of a query point xq

in the training set, and then predicts the class label of the point xq in the following way:

f(xq) = arg max

c∈C

Xk i=1

σ(c, f (xi)) where:

• σ(a, b) = 1 if a = b and σ(a, b) = 0 otherwise

• f (xi) is the function that returns the correct classification of an object xi

2.4. CLASSIFICATION

• f(xq) returns the classification of an object xqobtained by this classifier.

An object is classified by a majority vote of its neighbors, with the object being assigned to the most common class amongst its k nearest neighbors. k is a positive integer, typically small. If k = 1, then the object is simply assigned to the class of its nearest neighbor. In this case the classification function is not so smooth. As the number of k increases the classification function becomes more smoother. In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. The same method can be used for regression, by simply assigning to the object the average of the values of its k nearest neighbors. It can be useful to weigh the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. The neighbors are taken from a set of objects for which the correct classification (or, in the case of regression, the value of the property) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

In order to identify neighbors, the objects are represented by vectors in a multidimensional feature space. It is usual to use the Euclidean distance, though other distance measures, such as the Manhattan distance, could in principle be used instead. The k-nearest neighbor algorithm is sensitive to the local structure of the data. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

For this fact we can categorize this classifier as a lazy algorithm. In the actual classification phase, the test sample (whose class is not known) is represented as a vector in the same feature space. Distances from the new vector to all stored vectors are computed and k closest samples are selected.

2.4.4 Artificial Neural Network

Artificial Neural Networks (ANN) provide a robust method to approximate vector-valued target functions. In general ANN are considered like an univer-sal approximation function. In general ANNs are well suited to problems in

which there is a lot of noise inside the data. As far as the data representation

which there is a lot of noise inside the data. As far as the data representation