UNSUPERVISED APPROACHES FOR THE GENERATION OF STRUCTURES ON

(1)

Universit`a degli Studi di Torino

Scuola di Dottorato in Scienza e Alta Tecnologia Tesi di Dottorato di Ricerca in Scienza e Alta Teconologia

Indirizzo: Informatica

UNSUPERVISED APPROACHES FOR THE GENERATION OF STRUCTURES ON

LARGE DATA

Dino Ienco

Tutor: Prof. Rosa Meo

XXII Ciclo, Gennaio 2010

(2)

(3)

Universit`a degli Studi di Torino

Scuola di Dottorato in Scienza e Alta Tecnologia

1 2 0 0 6 6

0 0 1 1 5 7

0 0 1 0 6 5

0 0 6 7 0 0

0 1 8 7 1 0

1 0 2 3 0 1

4 5 0 0 0 0

4 3 1 1 0 0

6 4 0 1 0 0

X3

X1 X2 X4

X5

X8 X9 X6 X7

Y1 Y5 Y3 Y4 Y2 Y6

UNSUPERVISED APPROACHES FOR THE GEN- ERATION OF STRUCTURES ON LARGE DATA

Dino Ienco

(4)

(5)

Acknowledgments

During all these years of PhD studies I have been helped, encouraged and supported by several people which I would like to acknowledge.

I have grown up as a researcher inside the atmosphere of the Department of Computer Science of The Univer- sity of Torino. My advisor Rosa Meo has supervised my growth with patient guidance, keeping trust in my capa- bilities and she gave me a lot of opportunities to study and to improve my skills.

Marco Botta has been like a second tutor for my PhD

program and he helped me with interesting discussions

about some topics of my works. Roberto Esposito has

been a very good friend and I want to thank Roberto

for his great (working and moral) help and for the useful

and interesting discussions about machine learning, data

mining and not only scientific topics. Ruggero Pensa has

been like a scientific brother for me with the joint work

in the last year. He helped me to finalize and concretize

(6)

ing approach over categorical data and the Hierarchical Co-Clustering algorithm. Francesca Cordero has been an elder sister for me: she seriously contributed in my growth with her decisive character and her way of work- ing. Elena Roglia, a beautiful roommate, she endured me for three years of PhD, laughing, joking and sharing many moments of real life.

Last but not least, my real family. My brother Luca another computer scientist who wants to follow my foot- steps. My parents Pasquale and Anna Maria: I am well aware that I would have no chance to complete my studies without their support. In particular my father: he sup- ported me in the decision to start the University studies, before, and my PhD studies, afterwards. He has always been like a light in the fog and he will continue to be highlight even now he is living in my memory.

A special thanks to Enrica, my girlfriend: she was always

nearby me when I needed and in recent years we have

grown together addressing the difficulties of the life.

(7)

List of Figures

2.1 Contingency table between variable I and D. . . 20

2.2 Example of Decision Tree . . . 31

3.1 T-Test (df: 999) for clustering evaluation on randomized datasets . . . 59

3.2 Mean accuracy allowed by different proximity measures . . . . 60

3.3 Features of the Mushroom data-set . . . 61

3.4 Dendrogram of the features of data-set Mushroom . . . 62

3.5 Mean Accuracy with our FS of three classifiers: Naive Bayes, J48 decision Tree and K nearest neighbors (k = 7). . . 63

3.6 Comparing the accuracy of Naive Bayes coupled with our method (HCL FS) vs. FCBF and ReliefF . . . 67

3.7 Comparison on Feature Reduction . . . 68

3.8 Average accuracy: Comparison between our approach and Wrapper methods . . . 68

3.9 Comparing the accuracy of J48 decision tree coupled with our method (HCL FS) vs. FCBF and ReliefF . . . 69

4.1 Overview of the system . . . 80

4.2 Accuracy of three different classifiers on the ACM Categories and on clusters of KH. . . 82

5.1 Person: a sample dataset with categorical attributes (a) and its related contingency table (b). . . 85

5.2 Datasets characteristics . . . 102

5.3 Purity results on real and synthetic data . . . 104

5.4 NMI results on real and synthetic data . . . 105

5.5 Adjusted Rand Index results on real and synthetic data . . . . 106

5.6 Stability analysis on sigma parameter. . . 107

5.7 Time performance w.r.t. the number of instances (a) and attributes (b). . . 108

5.8 Mean number of attributes in the context for DILCARR . . . 109

5.9 Computational time for each hierarchical clustering phase. . . 110

6.1 An example dataset (a) and a related co-clustering (b). . . 117

(8)

(9)

List of Tables

6.1 Datasets characteristics . . . 127

6.2 Average values of External Indices . . . 130

6.3 Comparison between ITCC and HiCC with the same number of cluster for each level . . . 130

6.4 Complete view of performance parameters for re1 . . . 131

6.5 Average τS, mean depth of hierarchies and mean time . . . 132

6.6 Row hierarchy of oh15 . . . 132

6.7 Column hierarhcy of re1 . . . 133

(10)

(11)

List of Algorithms

1 HCL − W ard(M) . . . 56

2 HCL-FS(dendrogram, classifier, dataSet D) . . . 57

3 getLevelPartition(dendrogram, l) . . . 57

4 getRepresenter(cluster C) . . . 57

5 HCL-Ward(M) . . . 77

6 BuildDistanceMatrix(TDlist) . . . 77

7 LabelNodeByPageRank(n) . . . 78

8 createTaxonomy(C) . . . 78

9 preProcessData(C, minWfreq, minCfreq) . . . 79

10 computeCorrelationMatrix(D) . . . 97

11 DILCAM(V ectorSUY, Y , σ) . . . 97

12 DILCARR(matrixSU, Y ) . . . 98

13 DistanceComputation(Y , context(Y )) . . . 98

14 τ CoClust(D, Niter) . . . 121

15 optimizeRowCluster(R, C, T ) . . . 121

16 HiCC(D, Niter) . . . 122

(12)

(13)

1

Introduction

In recent years the problem of managing huge amounts of data increased.

The research in computer science, especially in the database and data mining field, is focusing its attention to problems related to storing, indexing and extracting knowledge from the data. For instance the large quantity of information available from the web allows industry to better understand human and social behaviors and to use this information to fit user needs. Many automatic techniques have been developed to explore, study and take advantage from this growing source of information. Data Mining is an interdisciplinary research field that aims at identifying valid, novel, potentially useful, and understandable patterns from data. It involves knowledge and approaches from, among others, Machine Learning, Statistics, Data Bases and Human Computer Interactions. In this thesis I will be focusing on Machine Learn- ing techniques in particular on unsupervised learning algorithms. For this reason we introduce some notions about machine learning. The main goal of machine learning techniques is precisely to help humans in constructing model of an interesting phenomena or a piece of the real world that cannot be built up easily directly staring from the users’ experience on a specific domain, and programs that learn a model of a portion of the real world and improve it from experience. Another goal of Machine Learning (ML) is to provide computational models for human learning. From the previous de-

(18)

scription we can see that ML is a complex and faceted research field where the goal is to develop algorithms that allow computers to learn. Machine learning algorithms are already at the cornerstone of many important applications across a great number of disciplines, such as medical diagnostics, bio-informatics, computer vision, speech and handwriting recognition, stock market modeling, fraud and intrusion detection. These kind of algorithms try to automatically learn to recognize complex patterns and make intelli- gent decisions based on data. The input of these algorithms is constituted by a representation of a set of data and their output is a simple or structured prediction that represents the recognition of the answer of the model system or the category to whom the observed objects belong to..

From the point of view of the degree of supervision provided to the learning algorithms we can distinguish among:

• Supervised Learning

• Semi-Supervised Learning

• Unsupervised learning

In the Supervised Learning setting a dataset is constituted by pairs of input objects (typically vectors), and desired outputs. The output of the model can be a continuous value (and the task is called regression), a class label (the classification task) or a structured output on the basis of the input object (the structure prediction task) . The final goal of a supervised learning algorithm is to learn a model (classifier) that is able to predict with high accuracy the correct classification for any new input object. We use a training set to learn the model and a test set to evaluate the accuracy obtained by the trained model. This phase is often referred to determination of the ¨generalization¨error of the classifier. In fact, the classification model, learnt on the training set, is used to predict the class of new data in the test set by means of a generalization operation under the hypothesis that the

(19)

observations in the training set are general enough to represent the whole object population.

Commonly in order to validate and select an appropriate model there are a set of different approaches proposed in literature [11]. In general the most used framework to perform model selection is the k-cross-validation approach. K-cross-validation is a technique in which we divide the whole dataset into K partitions. Then we build K models. Each model is trained on K-1 partitions (the training set) and tested on one partition (the test set). At the end of the k-cross-validation the average generalization error is computed and the accuracy for the model on the given dataset is obtained.

Using this technique we are able to select the best model (classifier) w.r.t.

the generalization error.

In the Semi-Supervised Learning setting a dataset is constituted by incomplete pairs of input objects, and desired outputs. The incompleteness is given by the fact that only a small portion of the input objects are coupled with outputs. In this area or research the target of the learner algorithm is to propagate the class information from the labeled objects to the unlabeled objects. Many approaches that deal with this problem build a graph representation of the data set where each vertex represents a specific object.

Each edge represents the similarity between two objects. This similarity is obtained using a predefined similarity measure. From this representation we can see that there are some nodes with an associated label (we refer to this set of nodes as L) and the other nodes without the class information (we refer to this set of nodes as U). The algorithms propagate, through this graph, the information from labeled nodes L to unlabeled nodes and assign a label to all nodes in U.

In the Unsupervised Learning setting a dataset is constituted by a set of input objects. In this area of research the target is to learn a model without any assumption on the output space. For instance we want to group objects that share high similarity inside the objects of their group and have high

(20)

dissimilarity with the objects that belong to a different group. This area of research has many applications because nowadays most of the datasets from which we want to extract knowledge are not labelled with a class information because labeling each object is a time consuming task. A very important approach within the realm of unsupervised learning is known as cluster analysis. It extracts clusters of similar objects or cluster representatives and explores the whole dataset with the purpose to detect similarities in the data or represent common characteristics.

If we want to collocate my work in the machine learning research we can put it in the unsupervised learning approach under hierarchical cluster analysis. I chose to focus my research on hierarchical clustering because:

• with the growing of the web the unsupervised learning will become predominant

• hierarchies are very useful because they give the user a structured output which can help to understand better the problem domain.

In particular the use of hierarchies helps people to obtained:

• a structured representation of the whole results;

• an easy way to browse the obtained results;

• an useful tool to visualize and organize data;

• infer specialization relationship between entities represented in data;

• a content based index.

Hierarchies help the user to index, manage and explore the data. Being able to automatically derive hierarchies (or also taxonomies) from data can help to understand the hidden structure that is concealed behind the data.

For instance if we analyze the results obtained by search engines, typically they consist in a long list of search results in response to a user-issued query.

(21)

1.1. OUTLINE OF THE THESIS Although the most authoritative results may be ranked high among all results under a proper ranking algorithm , it remains a question of whether the most authoritative pages are what the user actually wants. In particular, when a query is inherently ambiguous and the first pages are not the users’ intended results, the users may have difficulty in subsequently browsing the pages content. For example, if a user wants to find the benefits from an apple fruit and issues a query apple, all results on the first page are focused on the topic about the computer company Apple, rather than the intended topic of the apple fruit. This is because the Apple company is so well-known that Web pages about the sense, that is the Apple company, are more likely to be well linked and thus be ranked higher. A solution to this problem is to automatic organize the result in hierarchies according to different potential meaning, in this way the user might obtain many advantages in the exploration of the search results.

The techniques that we developed during our researches are well suited to be applied to the previous examples. They mainly focus to extract, in an unsupervised way, hierarchical structures that highlight the common characteristics between both objects and features. In particular we focused our attention to data sets represented by categorical or discrete feature spaces, represented for instance by the well-known word-document matrix.

1.1 Outline of the Thesis

In the first chapter of this thesis I introduce the useful definitions and background knowledge shared by the various approaches. We introduce the no- tion of contingency table and some association measures that we can apply on this kind of data representation. Furthermore some common classification algorithms are presented and at the end of this chapter on clustering and co-clustering I introduce some notions that are useful for the rest of the dissertation.

(22)

In the second chapter I introduce a technique to combine clustering (in particular a hierarchical one) with feature selection. In this section we perform cluster analysis on the feature space and with this clustering framework we help the algorithm to choose a good subset of the feature spaces to improve classification.

In the third chapter I use the technique applied in the previous chapter to build, in an automatically way, a word taxonomy over a set of documents.

This approach is interesting because it allows the user to index the data and browse the structured output. An easy way to label automatically each node of the generated taxonomy is given.

In the fourth chapter I studied how to learn the distance between the values of the same, categorical attribute. This study is very useful because with this result we can re-use, over categorical data, all the previous hierarchical (and not hierarchical) clustering algorithms studied for continuous data. We prove that this approach allows to improve the clustering quality in this domain using a local feature selection criterium.

In the fifth chapter we introduce a new approach to build simultaneously hierarchies on two dimensions. In this work we combine the idea derived by chapter 1 (building hierarchies over the feature space) with the idea derived by chapter 3 (build hierarchies over the objects space). To do this we extend a previous co-clustering algorithm to a hierarchical setting. We applied this approach to a document-word dataset and we obtained quantitative and qualitative good results.

Conclusion and future works are presented in the sixth chapter.

(23)

For the things we have to learn before we can do them, we learn by doing them.

Aristotele

2

Background

In this chapter we introduce some preliminary background knowledge. This section helps the reader understand the concepts at the basis of this thesis.

Notions we are going to introduce are about:

• Contingency table and measures that we use over it

• Feature Selection approaches

• Classification algorithms

• Clustering and Co-Clustering techniques

• Evaluation measures for clustering results

2.1 Contingency Table

Contingency tables are used to record and analyze relationships between two variables, especially categorical variables. If we analyze two categorical variables, I (with values I1 . . . Im) and D (with values D1 . . . Dn ). We can build a contingency table with m rows (as many rows as the number of values of the variable I) and n columns (as many columns as the number of values of the variable D). We denote each cell of the contingency table with nij.

(24)

The value in each cell nij of the contingency table represents the counting of how many times the value Di appears with the value Ij.

D₁ D₂ ... D_j D_n Total

I₁ n11 n12 ... n1j n1n I₁ Total I₂ n21 n22 ... n2j n2n I₂ Total

... ... ... ... ... ... ...

I_i n_i1 n_i2 ... n_ij n_in I_i Total

... ... ... ... ... ... ...

I_m n_m1 n_m2 ... n_mj n_mn I_m Total

Total D₁ D₂ D_j D_n Matrix

Total Total ... Total Total Total Figure 2.1: Contingency table between variable I and D.

This tool is very useful to study the distribution of two variables. Using this simple representation we can compute many interesting measures that can be useful to understand the association between variables.

2.2 Evaluation measures

2.2.1 χ

²

The χ²-statistic is a coefficient that can be helpful to interpret the relationship between two variables. It measures the degree of independence between two variables. The χ²-statistic coefficient is measured with the following formula:

χ² = Xm

i=1

Xn j=1

(Oij − Eij)² Eij

where

• Oij is the observed frequency;

• Eij is the expected (theoretical) frequency, asserted by the null hypothesis;

(25)

2.2. EVALUATION MEASURES

• m · n is the number of possible outcomes of each event.

Using the contingency table in Fig. 2.1 we can define the χ²-statistic in the following way:

χ² = Xm

i=1

Xn j=1

(nij− (^IⁱM atrixT otal^{T otal·D}^j^{T otal}))² (^IⁱM atrixT otal^{T otal·D}^j^{T otal}) In our case :

• nij is the observed frequency (Oij);

• (^IⁱM atrixT otal^{T otal·D}^j^{T otal}) is the expected (theoretical) frequency, asserted by the null hypothesis (Eij);

• m · n is the number of possible outcomes.

This statistic gives us a test to measure how much the observed frequency differs from the theoretical frequency obtained by the null hypothesis. The null hypothesis is the independence assumption where the expected frequency is equal to P (Ii) · P (Dj), where P (Ii) and P (Dj) are the two marginals. The χ²-statistic can be used to calculate a p-value by comparing the value of the statistic to the χ² distribution. Using the previous hypothesis the degree of freedom is equal to: (number of rows - 1) · (number of columns - 1). Small values of the χ² statistic indicate that the distribution of the two variables is very close to the one used in the null hypothesis. Otherwise big values of the χ² statistic indicate that the distribution of the two variables is far from the one used in the null hypothesis.

2.2.2 Entropy

Entropy, in information theory, is a measure of uncertainty associated with a random variable[41]. The term in this field refers to the Shannon entropy:

the average amount of information needed to transmit the state of a random

(26)

variable. In the context of data mining, entropy quantifies the uncertainty in the distribution of a random variable. It is very useful to understand the distribution of a variable in a dataset. For this reason entropy, and measures derived by it, are employed in feature selection task.

Given a random variable D we denote its entropy as:

H(D) = − X

Dj∈D

P (Dj) log₂(P (Dj))

using the previous example each DjT otal is equal to summation over the column. DjT otal = Pm

i=1nij

We introduced the concept of entropy in terms of average amount of information necessary to specify the state of a random variable. Entropy is not a properly measure derived from a contingency table. If we extend the computation of entropy to the case of two variables H(D, I), we can use a contingency table to compute this measure.

H(D, I) = − X

Dj∈D

X

Ii∈I

P (Dj, Ii) log₂(P (Dj, Ii))

The entropy definition is also the basic block to understand the concept of Mutual information showed in the following section.

2.2.3 Mutual Information

In probability and information theory, the mutual information of two random variables is a quantity that measures the reciprocal dependence of the two variables.

Formally, the mutual information of two discrete random variables I and D can be defined as:

MI(I; D) = X

Dj∈D

X

Ii∈I

P (Ii, Dj) log₂

P (Ii, Dj) P (Ii) P (Dj)

(27)

2.2. EVALUATION MEASURES where P (Ii, Dj) is the joint probability distribution function of I and D, and P (Ii) and P (Dj) are the marginal probability distribution functions of I and D respectively.

Mutual information can be equivalently expressed as

MI(I; D) = H(I) − H(I|D)

= H(D) − H(D|I)

= H(I) + H(D) − H(I, D)

= H(I, D) − H(I|D) − H(D|I)

where H(I) and H(D) are the marginal entropies, H(I|D) and H(D|I) are the conditional entropies, and H(I, D) is the joint entropy of X and Y.

The conditional entropy of a variable I given the variable D is defined as follows:

H(I|D) = − X

Dj∈D

P (Dj)X

Ii∈I

P (Ii|Dj) log₂P (Ii|Dj)

Intuitively, if entropy H(I) is the measure of uncertainty about a random variable I, then H(I|D) is the amount of information that D does not convey on I. This is the amount of uncertainty remaining about I when D is known.

Thus the right side of the first of these equalities can be read as the amount of uncertainty in I, minus the amount of uncertainty in I which remains after D is known, which is equivalent to the amount of uncertainty in I which is removed by knowing D. This corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other. Mutual Information is a symmetrical measure and It can also be expressed as a Kullback-Leibler divergence, of the product P (I) × P (D) of the marginal distributions of the two random variables I and D, from the joint distribution P(I,D) of the two random variables:

(28)

MI(I; D) = DivKL(P (I, D)kP (I)P (D)) where DivKL(P kQ) is equal to

D_KL(P kQ) =X

i

P (i) log₂ P (i) Q(i) Furthermore, let p(I|D) = p(I, D) / p(D). Then

MI(I; D) = X

Dj∈D

p(Dj)X

Ii∈I

p(Ii|Dj) log₂ p(Ii|Dj) p(Ii)

= X

Dj∈D

p(Dj) Div_KL(p(Ii|Dj)kp(Ii))

= ED{Div_KL(p(Ii|D_j)kp(Ii))}

Mutual information can also be understood as the expectation of the Kullback-Leibler divergence of the univariate distribution P (I) of I from the conditional distribution P (I|D) of I given the D: more different are the distributions P (I|D) and P (I) the greater is the mutual information.

2.2.4 Symmetrical Uncertainty

Another important measure is the Symmetrical Uncertainty [42]. This measure is a normalized version of mutual information. Symmetrical Uncertainty is defined as follows:

SU(I, D) = 2 · MI(I; D) H(I) + H(D)

This measure varies between 0 and 1 (1 indicates that knowledge of the value of either I or D completely predicts the value of the other variable; 0 indicates that I and D are independent). The advantage of normalized version of mutual information is that this measure is not biased by the number of

(29)

2.2. EVALUATION MEASURES values of each attribute as occurs for entropy and mutual information.

2.2.5 Normalize Mutual Information

Normalized Mutual Information [78] is another normalized measure obtained from the mutual Information.

Normalized Mutual Information is defined as follows:

NMI(I, D) = MI(I; D) pH(I) × H(D)

This measure varies between 0 and 1 (1 indicates that knowledge of the value of either I or D completely predicts the value of the other variable; 0 indicates that I and D are independent). This measure has the same advantages of the Symmetrical Uncertainty.

2.2.6 Goodman-Kruskal τ

Goodman-Kruskal τ can be described as a measure of proportional reduction in the prediction error [36]. It describes the association between two categorical variables - a dependent and an independent one - in terms of the increase in the probability of correct prediction of the category of the dependent variable when information is supplied about the category of the other, independent variable. τ is intended for use with any categorical variable while other measures (eg., γ) are intended for ordinal ones.

In order to review the meaning of τ consider the table in Figure 2.1.

τ determines the predictive power of a variable I, considered as an inde- pendent variable, for the prediction of the dependent variable D. The predic- tion power of I is computed as a function of the error in the classification of D.

The prediction error in the classification of D is first computed when we do not have any knowledge on the variable I. ED denotes this error here. The

(30)

reduction of this error allowed by I is obtained by subtraction from ED of the error in the classification of D that we make when we have the knowledge of the value of I in any database example. ED|I denotes this latter error. The proportional reduction in prediction error of D given I, here called τ_{I D}, is computed by:

τ_{I D} = ED − ED|I ED

2.1

ED and ED|I are computed by a predictor which uses information from the cross-classification frequencies and tries to reduce as much as possible the prediction error. In the prediction, it also preserves the dependent variable distribution (relative frequencies of the predicted categories Dj) in the following way: when no knowledge is given on the category of I, category Dj is predicted with the relative frequency determined by ^D_{T otal}^j^{T otal}; otherwise, when it is known that Ii is the category of variable I, category Dj is predicted with the relative frequency determined by _I ⁿ^ij

iT otal. Therefore, error ED

is determined by:

ED =X

i

(T otal − IiT otal

T otal · IiT otal) 2.2 while error ED|I is:

ED|I =X

j

X

i

(DjT otal − nij

DjT otal · nij) 2.3 We can see that τ satisfies many desirable properties for a measure of association. For instance, it is invariant by rows and columns permutation.

Secondarily, it seems a good measure for the association between two fea- tures since it takes values between (0,1) and it is 0 if and only if there is independence between the features (solves the problem of zero uniqueness).

Furthermore, it has the benefit that can be interpreted since it has an oper- ational meaning - it corresponds to the relative reduction in the prediction

(31)

2.3. FEATURE SELECTION error in a way that preserves the class distribution.

2.3 Feature Selection

Feature selection is an important area of research in data mining and machine learning communities. Nowadays data are represented by hundred to tens of thousands of variables (features). For this reason the task of extract relevant and not redundant subset of features is very important. In [39] an introduction to variable and feature selection is found. There are different approaches to feature selection and different situations in which feature selection can be applied. In general feature selection was introduced in machine learning in the supervised setting. But nowadays there are studies about feature selection also for the unsupervised setting. The two problems are different from each other. In the supervised analyses we can consider the class information that guides the search of a suitable subset of features. In the unsupervised analyses we can use only the information on how the data are distributed. An example of a feature selection for unsupervised learning is the work presented in [95]. We can obtain a first categorization of the feature selection approaches:

• supervised feature selection

• unsupervised feature selection

2.3.1 Supervised Feature Selection

Many works on feature selection were about the supervised approach. In this setting we can use two different approaches:

1 Rank all the attributes w.r.t. their correlation with the class label.

With this method we can obtain a ranking of the features and filter those attributes with a value over a user-defined threshold.

(32)

2 Exploring the possible subsets of the features and choose the one that maximizes the accuracy of a given classifier.

The first one is named Filter approach. In this setting all features are ranked w.r.t. to the class. After the rank step some approaches, for instance [90], apply a second step that tries to erase redundant attributes in order to remove possible noise.

The second one is named Wrapper approach. In this setting a classifier is combined with a search mechanism that tries to explore the features space.

The space is defined by the set of all possible feature subsets. Each subset chosen by the search method is evaluated by a predefined classifier. The search mechanism is used like a wrapper for the classifier. The subset that obtains the best accuracy result w.r.t. the chosen classifier is the subset of features returned by the Wrapper feature selection approach.

2.3.2 Unsupervised Feature Selection

When no knowledge about the class variable is supplied, supervised feature selection cannot be applied. Feature selection for clustering or in general for unsupervised learning is a very important problem. This task is important because today huge quantities of data are available (for example web data), and all these data do not carry information on the belonging class. There are approaches that try to manage this problem [10], [86]. In recent years some interesting feature selection algorithms for unsupervised learning have been proposed. These new methods build a graph representation of the dataset. This graph representation is an embedding of the density of the dataset. Using algebraic properties that hold on the matrix that represents the connections between the nodes in the graph they are able to resolve the feature selection problem for unsupervised learning. An example of this approach is presented in [95].

(33)

2.4. CLASSIFICATION

2.4 Classification

In this section we introduce some simple classification techniques. We introduce these algorithms because in the rest of the thesis we use them to evaluate our results. For this reason a short presentation of these methods is given in the following section.

2.4.1 Naive Bayes

The Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular attribute of a class is unrelated to the presence (or absence) of any other attribute.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood. The techniques similar to Lapla- cian Smoothing [79] have been adopted in naive Bayes in order to estimate the probability of occurrence of attribute values that have never been met in the training set examples. This is particular useful in text classification when the classification algorithm manages in the test set words that were absent in the training set. In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers often work much better in many complex real-world situations than one might expect. In this explanation we denote with n the number of attribute, Fi a generic attribute and C the class attribute.

Using Bayes’ theorem, we write

p(C|F1, . . . , Fn) = p(C) p(F1, . . . , Fn|C) p(F1, . . . , Fn)

(34)

We can rewrite the previous equation in the following way:

posterior = prior × likelihood evidence

The evidence is fixed for each posterior, for this reason:

posterior ∝ prior × likelihood

The naive Bayes algorithm use a strong independence assumption: each attribute is independent from each other attribute. With this assumption we can rewrite the likelihood of the data in this way:

p(F₁, . . . , Fn|C) = Yn

i=1

p(Fi|C) And the posterior in this way:

posterior ∝ p(C) Yn i=1

p(Fi|C)

One should notice that the independence assumption may lead to some unexpected results. Despite the fact that the far-reaching independence assumptions are often inaccurate, the naive Bayes classifier has several properties that make it surprisingly useful in practice. In particular, the decoupling of the class conditional attribute distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate the problem of curse of dimensionality [39], in which the high-dimensional space degrades the performance of the learning algorithm.

2.4.2 Decision Tree

Decision tree learning is a practical method for inductive inference. This technique learns a hierarchical structure in a recursive way. Example of al-

(35)

2.4. CLASSIFICATION

Figure 2.2: Example of Decision Tree

gorithms that use this learning strategy are CART, C4.5 and ID3 [65]. The Decsion tree approach learns in a recursive way a set of partitions and sub- paritions. Starting from the main partition that contains all training data, at each step every existing region is split into subregions. This recursive splitting procedure is continued until a region meets a local termination criterion.

Then It applies a first split using the values of a chosen attribute. When all regions have met the termination criterion, the decision tree stops to learn partitions. To split the regions a particular attribute is selected and its values are used to split the region in subregions. For instance ID3 algorithm uses information theory to select the best attribute, given a region, to split the region. In figure 2.4.2 there is an example of this procedure. The algorithm starts to analyze the whole dataset, then the decision tree splits the dataset choosing the attribute OUTLOOK. At each level the algorithm chooses a specific attribute to split the node. For instance always in figure 2.4.2, at the second level the algorithm chooses the attribute HUMIDITY for the left branch and the attribute WINDY for the right branch. Generally the process is repeated until the termination criterion is met. In the example the algorithm went on until each leaf contains all elements from the same class.

(36)

From the learned trees we can extract a sets of if-then rules to improve human readability. After the splitting phase is completed a pruning procedure is usually employed. For instance some approaches recombines adjacent regions in a bottom-up way using misclassification risk from cross-validation in order to determine when to stop the pruning. The decision tree approach is an interesting way to learn a model with intrinsic structure, but it has some limitations. The first limitation is given by the fact that as the partitioning proceeds the number of training observations in successive regions becomes smaller and smaller. At the end of the process the remaining data should be insufficient to provide meaningful estimates. Another limitation regards the splitting procedure. In [72] we can see that a little perturbation of the data can have a big influence on the Decision tree performance. An explanation of this phenomena is the follow: suppose that the noise in the data affects those attributes that the Decision tree chooses as splitting attributes. This fact can change in a determinant way the resulting decision tree and can degrade the performance of the algorithm.

2.4.3 Nearest Neighbors

The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms[65]. It finds the K nearest neighbors of a query point xq

in the training set, and then predicts the class label of the point xq in the following way:

f(xq) = arg max

c∈C

Xk i=1

σ(c, f (xi)) where:

• σ(a, b) = 1 if a = b and σ(a, b) = 0 otherwise

• f (xi) is the function that returns the correct classification of an object xi

(37)

2.4. CLASSIFICATION

• f(xq) returns the classification of an object xqobtained by this classifier.

An object is classified by a majority vote of its neighbors, with the object being assigned to the most common class amongst its k nearest neighbors. k is a positive integer, typically small. If k = 1, then the object is simply assigned to the class of its nearest neighbor. In this case the classification function is not so smooth. As the number of k increases the classification function becomes more smoother. In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. The same method can be used for regression, by simply assigning to the object the average of the values of its k nearest neighbors. It can be useful to weigh the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. The neighbors are taken from a set of objects for which the correct classification (or, in the case of regression, the value of the property) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

In order to identify neighbors, the objects are represented by vectors in a multidimensional feature space. It is usual to use the Euclidean distance, though other distance measures, such as the Manhattan distance, could in principle be used instead. The k-nearest neighbor algorithm is sensitive to the local structure of the data. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

For this fact we can categorize this classifier as a lazy algorithm. In the actual classification phase, the test sample (whose class is not known) is represented as a vector in the same feature space. Distances from the new vector to all stored vectors are computed and k closest samples are selected.

2.4.4 Artificial Neural Network

Artificial Neural Networks (ANN) provide a robust method to approximate vector-valued target functions. In general ANN are considered like an univer- sal approximation function. In general ANNs are well suited to problems in

(38)

which there is a lot of noise inside the data. As far as the data representation is concerned, instances are represented as numerical vectors of predefined features. In general the output of the target function may be discrete-valued or real-valued. This learning method is used in many regression tasks where the variable to predict is continuous. The basic block of an ANN is the perceptron. The perceptron is simply a weighted sum of the input vector. This simple algorithm enables to obtain a linear classifier. This simple classifier has a small hypothesis space that does not allow the algorithm to manage data characterized by a non linear structure.

f (xi) = Xn

i=0

wi· xi

The perceptron returns 1 if the weighted sum is greater than a given threshold. Otherwise it returns -1. This simple algorithm is well-suited for binary classification problems when the data space is linearly separable. For instance this method can not manage the XOR problem [65]. To avoid the problem related to the perceptron inability of management of non linearly separable data spaces, ANNs have been introduced. An ANN is a network of perceptrons with many levels. In general the first level is the input level, a non predefined number of hidden levels and, at the end, an output levels that produces in output the final value for the class variable. This approach uses the backpropagation strategy in order to train the network: it uses the gradient descent to search inside the hypothesis space. Gradient descent is useful because it allows the ANN to search through hypothesis spaces containing many different types of continuously parameterized hypotheses.

The backpropagation algorithm propagate downward the error in prediction of the output level and it tries to proportionally re-distribute the error to all the nodes inside the network.

(39)

2.4. CLASSIFICATION

2.4.5 Support Vector Machine

Support Vector Machine (SVM) was originally proposed by Vapnik [81] as a binary classifier. The SVM methodology can be employed for classification and regression problems with good generalization results. SVM is a supervised learning technique. There are two stages in order to create and apply an SVM classifier: training and prediction. In the training stage, or learning stage, instances with labels are given in order to design support vectors which are used to predict desired outputs. Also for this approach instances are represented as a numerical vector of predefined features. The output of this phase is a set of support vectors. These support vectors constitute a model, the prediction model. The optimization problem introduces the use of slack variables. This kind of variables help to relax the constraints to obtain a soft separating hyperplane. To solve the problem of management of non linearly separable data the kernel trick was introduced. This method maps the instance vector using, in general, a non-linear function. This mapping is done by a kernel function, which maps the input features into a higher dimensional space such that data are linearly separable. Using this mapping function the similarity between two objects is evaluated with the original features space avoiding the construction of the new features. There are several kernel functions proposed in literature. The most usual kernels are the following:

• Polynomial (homogeneous): k(x, x^′) = (x · x^′)^d

• Polynomial (inhomogeneous): k(x, x^′) = (x · x^′+ 1)^d

• Radial Basis Function: k(x, x^′) = exp(−γkx − x^′k²), for > 0

• Gaussian Radial basis function: k(x, x^′) = exp

−^kx−x_2σ2^′^k²

• Hyperbolic tangent: k(x, x^′) = tanh(κx · x^′ + c), for some (not every) κ > 0 and c < 0

(40)

In the prediction phase, the prediction model is applied to predict the class variable for previously unseen instance vectors. This is done using the generated hyperplane in the transformed features space. Support Vector Machines are very useful and they obtained good performance w.r.t. other state of the art methods.

2.5 Clustering

Clustering is the process of grouping a set of objects into group of similar objects. A cluster is a set of objects such that the objects within the cluster are similar to each other and are dissimilar to the objects in other clusters.

The measure adopted to ascertain the degree of similarity between the objects depends from the particular application. Clustering is a common task in the field of unsupervised learning where the knowledge of the objects class is not available. Unlike classification, clustering is a form of learning by observation, rather than learning by examples like supervised classification.

There are many types of clustering techniques:

• Partitional

• Hierarchical

• Density-based

• Model-based

2.5.1 Partitional

Partitional methods are divided into two major subcategories: the centroid and the medoids algorithms. The centroid algorithms represent the clustesr by using the gravity centre of the instances. The most well-known centroid algorithm is k-means [4]. The medoid algorithms represent the clusters by means of the instances closest to the center of gravity [o center of mass].. The

(41)

2.5. CLUSTERING k-means method partitions the dataset into k subsets such that all points in a given subset are closest to the same centroid. It randomly selects k of the instances to represent the clusters. Based on the selected instances, all remaining ones are assigned to their closest centroid. Then it computes the new centroid by taking the mean point of all data points belonging to the same cluster. The operation is iterated until a termination criterium is satisfy.

Generally, the k-means algorithm has the following important properties:

Strengths:

• It has a linear computational complexity in space and time. The number of iterations required for the convergence is very often limited and can be safely bounded.

Weakness:

• It often terminates at a local optimum

• The clusters have spherical shapes

• It is sensitive to outliers

Traditional partitional approaches generate partitions as the name suggests.

In a partition, each object belongs to one and only one cluster. Hence, the clusters are disjoint and the clustering is hard. On the contrary, fuzzy clustering gives each point the possibility of belonging to more clusters. Fuzzy clustering associates each point with every cluster using a membership function. Larger membership values indicate higher confidence in the assignment of the pattern to the cluster. One widely used algorithm is the Fuzzy C- Means (FCM) algorithm [4], which is based on k-means.

2.5.2 Hierarchical

The hierarchical methods group data instances into a tree of clusters. There are two major methods under this category. One is the agglomerative method,

(42)

which forms the clusters in a bottom-up fashion. The other method works in the opposite way, i.e., in a divisive way: it starts from a unique cluster including all the data instances and splits up the clusters into smaller clusters in a top-down fashion until each cluster contains only one instance. The out- come of both the agglomerative and divisive methods can be represented by dendrograms in which each node (or branch) of the dendrogram represents a cluster.

The dendrogram is one of the major advantages of the hierarchical methods because it can be inspected easily by the domain experts. Indeed, the dendrogram provides a wide choice on the available clusters, selected at the different abstraction levels in the tree. Furthermore, studying the order in which the clusters have been merged or split tells the user something about the degree of similarity between the clusters and their objects. However, both methods suffer from their inability to perform adjustments once the splitting or merging decision is made. Indeed, they are not guided by a global objective function, but a ¨local¨one: they decide at each step which is the best decision (cluster split or merging) on the current clusters without changing previous decisions. Other advantages are:

• not require the number of clusters to be known in advance

• a flat partition can be derived afterwards (e.g. via a cut through the dendrogram)

• compute a complete hierarchy of clusters

• the resulting dendrogram can be easily inspected by the end-user and can be integrated easily into other methods

Hierarchical clustering techniques use various criteria to decide locally, at each step which clusters should be joined (or split for divisive approaches).

For agglomerative hierarchical techniques, the criterion is typically to merge the closest pair of clusters, where close is defined by a specified measure of

(43)

2.5. CLUSTERING cluster proximity. There are three definitions of the closeness between two clusters:

• single-link

• complete-link

• average-link

The single-link similarity between two clusters is the similarity between the two most similar instances, where each instance comes from a cluster of the pair. Single link is good at handling non-elliptical shapes, but is sensitive to noise and outliers. The complete-link similarity is the similarity between the two most dissimilar instances, one from each cluster. Complete-link is less susceptible to noise and outliers, but can break large clusters, and has trouble with convex shapes. The average-link similarity is a compromise between the two. It consists in computing the average similarity between all the pairs of points coming from the two clusters. Some of the most well-known hierarchical clustering algorithms are: Ward agglomerative hierarchical clustering [4], Balanced Iterative Reducing and Clustering using Hierarchies BIRCH [94] and Clustering Using REpresentatives CURE [37].

Ward’s hierarchical method [4] is one of the hierarchical clustering methods most used in literature. It is a greedy, iterative and agglomerative hierarchical method. An objective function determines the best candidate clusters that will be merged in a new cluster, at each iteration of the algorithm (corresponding to a dendrogram level). The objective function is based on the computation of an overall, global measure of goodness of each candidate clustering solution. In Ward’s method, the proximity between two clusters is defined as the variation in the overall, global measure of the clustering goodness that is measured when two clusters are merged. In turn, the global measure of each solution is given by a summation, over all clusters of the solution, of a measure of cluster cohesion. Cluster cohesion could be computed in many ways. One, well-known measure, is the sum of squared errors

(44)

(SSE) within each cluster or differently viewed, the variance of the cluster elements (let call it Wi for cluster i). When two clusters are merged, in the agglomerative hierarchical clustering, the overall global measure increases.

All the candidate solutions are computed but the best one is determined by the minimum increase in the objective function. At the core of the method lies the computation of the cluster cohesion of each new cluster and the up- date of the matrix of distances for the inclusion of the new cluster. The new cluster cohesion is computed on the basis of the cohesions of the clusters that are considered for the merge, as follows

Wir = (|i| + |p|)Wip+ (|i| + |q|)Wiq− |i|Wpq

|i| + |r|

2.4

where p and q represent two clusters, r represents the new cluster formed by the fusion of cluster p and q, while i represents any other cluster, |i| means the cardinality of cluster i while Wij denotes the cohesion of the cluster that would be obtained if cluster i and j were merged.

BIRCH [94] uses a hierarchical data structure called CF-tree for partitioning the incoming data points in an incremental and dynamic way. CF-tree is a height-balanced tree, which stores the clustering features and it is based on two parameters: branching factor B and threshold T, which refers to the maximum distance between two instances belonging to the same cluster. A CF tree is built as the data is scanned. As each data point is encountered, the CF-tree is traversed, starting from the root and choosing the closest node at each level. When the closest leaf cluster for the current data point is fi- nally identified, a test is performed to see if adding the data item to the candidate cluster will result in a new cluster with a diameter greater than the given threshold, T. BIRCH can typically find a good clustering with a single scan of the data and improve the quality further with a few additional scans. It can also handle noise effectively. BIRCH has one drawback: it may not work well when clusters are not spherical because it uses the concept

(45)

2.5. CLUSTERING of radius or diameter to control the boundary of a cluster. In CURE [37], instead of using a single centroid to represent a cluster, a constant number of representative points are chosen to represent a cluster. The number of points chosen, is a parameter, c, but it was found that a value of 10 or more works well. The similarity between two clusters is measured by the similarity of the closest pair of the representative points belonging to different clusters.

Unlike centroid/medoid based methods, CURE is capable of finding clusters of arbitrary shapes (e.g. ellipsoidal, spiral, cylindrical, non-convex) and sizes, as it represents each cluster via multiple representative points. Shrinking the representative points towards the centroid helps CURE to avoid the problem of noise present in the single link method. However, it cannot be applied directly to large data sets. For this reason, CURE takes a random sample and performs the hierarchical clustering on the sampled data points.

2.5.3 Density-Based

Density-based clustering algorithms try to find clusters based on density of data points in a region. The key idea of density-based clustering is that for each instance of a cluster the neighborhood is defined. the neighborhood is given by the area included within a circle centered around the data point with a given radius (Eps). The density of the neighborhood is given by counting the number of data instances in the circle. A lower threshold for the density is given (MinPts). One of the most well known density-based clustering algorithms is DBSCAN [30]. DBSCAN separates data points into three classes:

• Core points. These are points that are at the interior of a cluster. A point is an interior point if there are enough points in its neighborhood.

• Border points. A border point is a point that is not a core point, i.e., there are not enough points in its neighborhood, but it falls within the neighborhood of a core point.

(46)

• Noise points. A noise point is any point that is not a core point or a border point.

To find a cluster, DBSCAN starts with an arbitrary instance (p) in data set (D) and retrieves all instances of D with respect to Eps and MinPts. In [6]

a new algorithm (OPTICS) is introduced, which creates an ordering of the data set representing its density-based clustering structure. It is a versatile basis for an interactive cluster analysis. Another density-based algorithm is DENCLUE [47]. The basic idea of DENCLUE is to model the overall point density analytically as the sum of influence functions of the data points. The influence function can be seen as a function, which describes the impact of a data point within its neighborhood. Then, by determining the maximum of the overall density function can identify clusters. The algorithm allows a compact mathematical description of arbitrarily shaped clusters in high- dimensional data sets and it is significantly faster than the other density based clustering algorithms. Moreover, DENCLUE produces good clustering results even when a large amount of noise is present. As in most other approaches, the quality of the resulting clustering depends on an adequate choice of the parameters value. In this approach, there are two important parameters, namely σ and ǫ. The parameter σ determines the influence of a point in its neighborhood and ǫ describes whether a density-attractor is significant. Density-attractors are local maxima of the overall density function.

2.5.4 Model Based

An example of model-based clustering is the Expectation-Maximization (EM) algorithm [4]. It assumes an underlying cluster probability model with parameters that describe the probability that an instance belongs to a certain cluster. The strategy in this algorithm is to start with initial guesses for the mixture model parameters. These values are then used to calculate the cluster probabilities for each instance. Then these probabilities are used to

(47)

2.6. CO-CLUSTERING re-estimate the parameters using the Maximum Likelihood principle, and the process is iteratively repeated. A drawback of such algorithms is that they tend to be computationally expensive. Another model-based method is the Self-Organizing Maps (SOM) [61]. SOM can be thought of a two layers neural network. Each neuron is represented by n-dimensional weight vector, x

= (x1, , xn), where n is equal to the dimension of the input vectors. The neurons of SOM are themselves cluster centers. In order to accommodate interpretation of the map units can be combined to form bigger clusters.

The SOM is trained iteratively. In each training step, one sample vector x from the input data set is chosen randomly, and the distance between it and all the weight vectors of the SOM is calculated using a distance measure, e.g., Euclidean distance. After finding the Best-Matching Unit (BMU i.e.

the neuron whose weight vector is closest to the input vector), the weight vectors of the SOM are updated so that the Best-Matching Unit is moved closer to the input vector in the input space. The topological neighbors of the BMU are also treated in a similar way. An important property of the SOM is that it is very robust w.r.t. the presence of noise. The outliers can be easily detected from the map, since distance in the input space from other units is large.

2.6 Co-Clustering

Co-Clustering is a recent field of research in the Data Mining area. This approach is based on a matrix representation of the data. Normally the matrix rows represent the objects and the matrix columns represent the features over which the objects are represented. This technique forms clusters simultaneously both on rows and columns. In general co-clustering algorithms are iterative and they use the clustering information obtained from one side of the matrix to clustering the other side. One of the earliest co-clustering formulations was introduced by Hartigan [43]. This algorithm begins with

(48)

the entire data in a single cluster or block and then at each stage finds how to split in two pieces every row or column block, choosing the block that produces the largest reduction in the total within block variance. The splitting is continued until the reduction of the within block variance due to further splitting is less than a given threshold. Kluger et al. [58] propose a spec- tral co-clustering method. The intuition is to consider that the correlation between two columns is better estimated by the expression level mean of each column w.r.t. a partition of the rows. First, they perform an adequate normalization on the data because they must remove systematic biases in the gene expression. To normalize data they apply a technique similar to Principal Component Analysis (PCA) to reduce the dimensionality and the noise of the data. They call this process bistochastization. The results is a rectangular matrix that has a doubly stochastic-like structure: all rows sum to a constant and all columns sum to a different constant. Using this new matrix they apply a decomposition algorithm to obtain eigenvector of the matrix. They interpret the eigenvector in a probabilistic way. With this interpretation they obtain the bi-partition of both genes and conditions. Their algorithm critically depends on the normalization procedure. Dhillon et al.

[22] and Robardet et al. [70] have considered the two searched partitions as discrete random variables whose association must be maximized. Different measures can be used. Whereas Cocluster [22] uses the loss in mutual information, Bi-Clust [70] uses Goodman-Kruskal’s τ coefficient to evaluate the link strength between the two variables. In both algorithms, a local optimization method is used to optimize the measure by alternatively changing a partition when the other one is fixed. The main difference between these two approaches is that the τ measure is independent of the number of co-clusters and thus Bi-Clust can automatically determine the number of co-clusters.

Another co-clustering formulation was presented in [17]. Authors propose two different residue measures, and introduce their co-clustering algorithm which optimizes the sum-squared residues function. Recently, Banerjee et al.

UNSUPERVISED APPROACHES FOR THE GENERATION OF STRUCTURES ON

UNSUPERVISED APPROACHES FOR THE GENERATION OF STRUCTURES ON

LARGE DATA

Dino Ienco

Tutor: Prof. Rosa Meo

XXII Ciclo, Gennaio 2010

Universit`a degli Studi di Torino

Scuola di Dottorato in Scienza e Alta Tecnologia

UNSUPERVISED APPROACHES FOR THE GEN- ERATION OF STRUCTURES ON LARGE DATA

Dino Ienco

Acknowledgments

During all these years of PhD studies I have been helped, encouraged and supported by several people which I would like to acknowledge.

I have grown up as a researcher inside the atmosphere of the Department of Computer Science of The Univer- sity of Torino. My advisor Rosa Meo has supervised my growth with patient guidance, keeping trust in my capa- bilities and she gave me a lot of opportunities to study and to improve my skills.

Marco Botta has been like a second tutor for my PhD

program and he helped me with interesting discussions

about some topics of my works. Roberto Esposito has

been a very good friend and I want to thank Roberto

for his great (working and moral) help and for the useful

and interesting discussions about machine learning, data

mining and not only scientific topics. Ruggero Pensa has

been like a scientific brother for me with the joint work

in the last year. He helped me to finalize and concretize

A special thanks to Enrica, my girlfriend: she was always

nearby me when I needed and in recent years we have

grown together addressing the difficulties of the life.

List of Figures

List of Tables

List of Algorithms

Contents

1

Introduction

1.1 Outline of the Thesis

2

Background

2.1 Contingency Table

2.2 Evaluation measures

2.2.1 χ

2.2.2 Entropy

2.2.3 Mutual Information

2.2.4 Symmetrical Uncertainty

2.2.5 Normalize Mutual Information

2.2.6 Goodman-Kruskal τ

2.3 Feature Selection

2.3.1 Supervised Feature Selection

2.3.2 Unsupervised Feature Selection

2.4 Classification

2.4.1 Naive Bayes

2.4.2 Decision Tree

2.4.3 Nearest Neighbors

2.4.4 Artificial Neural Network

2.4.5 Support Vector Machine

2.5 Clustering

2.5.1 Partitional

2.5.2 Hierarchical

2.5.3 Density-Based

2.5.4 Model Based

2.6 Co-Clustering