• Non ci sono risultati.

2.5 Clustering

2.5.2 Hierarchical

The hierarchical methods group data instances into a tree of clusters. There are two major methods under this category. One is the agglomerative method,

which forms the clusters in a bottom-up fashion. The other method works in the opposite way, i.e., in a divisive way: it starts from a unique cluster including all the data instances and splits up the clusters into smaller clusters in a top-down fashion until each cluster contains only one instance. The out-come of both the agglomerative and divisive methods can be represented by dendrograms in which each node (or branch) of the dendrogram represents a cluster.

The dendrogram is one of the major advantages of the hierarchical meth-ods because it can be inspected easily by the domain experts. Indeed, the dendrogram provides a wide choice on the available clusters, selected at the different abstraction levels in the tree. Furthermore, studying the order in which the clusters have been merged or split tells the user something about the degree of similarity between the clusters and their objects. However, both methods suffer from their inability to perform adjustments once the splitting or merging decision is made. Indeed, they are not guided by a global ob-jective function, but a ¨local¨one: they decide at each step which is the best decision (cluster split or merging) on the current clusters without changing previous decisions. Other advantages are:

• not require the number of clusters to be known in advance

• a flat partition can be derived afterwards (e.g. via a cut through the dendrogram)

• compute a complete hierarchy of clusters

• the resulting dendrogram can be easily inspected by the end-user and can be integrated easily into other methods

Hierarchical clustering techniques use various criteria to decide locally, at each step which clusters should be joined (or split for divisive approaches).

For agglomerative hierarchical techniques, the criterion is typically to merge the closest pair of clusters, where close is defined by a specified measure of

2.5. CLUSTERING cluster proximity. There are three definitions of the closeness between two clusters:

• single-link

• complete-link

• average-link

The single-link similarity between two clusters is the similarity between the two most similar instances, where each instance comes from a cluster of the pair. Single link is good at handling non-elliptical shapes, but is sensitive to noise and outliers. The complete-link similarity is the similarity between the two most dissimilar instances, one from each cluster. Complete-link is less susceptible to noise and outliers, but can break large clusters, and has trouble with convex shapes. The average-link similarity is a compromise between the two. It consists in computing the average similarity between all the pairs of points coming from the two clusters. Some of the most well-known hierarchical clustering algorithms are: Ward agglomerative hi-erarchical clustering [4], Balanced Iterative Reducing and Clustering using Hierarchies BIRCH [94] and Clustering Using REpresentatives CURE [37].

Ward’s hierarchical method [4] is one of the hierarchical clustering methods most used in literature. It is a greedy, iterative and agglomerative hierar-chical method. An objective function determines the best candidate clusters that will be merged in a new cluster, at each iteration of the algorithm (corresponding to a dendrogram level). The objective function is based on the computation of an overall, global measure of goodness of each candidate clustering solution. In Ward’s method, the proximity between two clusters is defined as the variation in the overall, global measure of the clustering goodness that is measured when two clusters are merged. In turn, the global measure of each solution is given by a summation, over all clusters of the solution, of a measure of cluster cohesion. Cluster cohesion could be com-puted in many ways. One, well-known measure, is the sum of squared errors

(SSE) within each cluster or differently viewed, the variance of the cluster elements (let call it Wi for cluster i). When two clusters are merged, in the agglomerative hierarchical clustering, the overall global measure increases.

All the candidate solutions are computed but the best one is determined by the minimum increase in the objective function. At the core of the method lies the computation of the cluster cohesion of each new cluster and the up-date of the matrix of distances for the inclusion of the new cluster. The new cluster cohesion is computed on the basis of the cohesions of the clusters that are considered for the merge, as follows

Wir = (|i| + |p|)Wip+ (|i| + |q|)Wiq− |i|Wpq

|i| + |r|







2.4

where p and q represent two clusters, r represents the new cluster formed by the fusion of cluster p and q, while i represents any other cluster, |i| means the cardinality of cluster i while Wij denotes the cohesion of the cluster that would be obtained if cluster i and j were merged.

BIRCH [94] uses a hierarchical data structure called CF-tree for partition-ing the incompartition-ing data points in an incremental and dynamic way. CF-tree is a height-balanced tree, which stores the clustering features and it is based on two parameters: branching factor B and threshold T, which refers to the maximum distance between two instances belonging to the same cluster. A CF tree is built as the data is scanned. As each data point is encountered, the CF-tree is traversed, starting from the root and choosing the closest node at each level. When the closest leaf cluster for the current data point is fi-nally identified, a test is performed to see if adding the data item to the candidate cluster will result in a new cluster with a diameter greater than the given threshold, T. BIRCH can typically find a good clustering with a single scan of the data and improve the quality further with a few additional scans. It can also handle noise effectively. BIRCH has one drawback: it may not work well when clusters are not spherical because it uses the concept

2.5. CLUSTERING of radius or diameter to control the boundary of a cluster. In CURE [37], instead of using a single centroid to represent a cluster, a constant number of representative points are chosen to represent a cluster. The number of points chosen, is a parameter, c, but it was found that a value of 10 or more works well. The similarity between two clusters is measured by the similarity of the closest pair of the representative points belonging to different clusters.

Unlike centroid/medoid based methods, CURE is capable of finding clusters of arbitrary shapes (e.g. ellipsoidal, spiral, cylindrical, non-convex) and sizes, as it represents each cluster via multiple representative points. Shrinking the representative points towards the centroid helps CURE to avoid the problem of noise present in the single link method. However, it cannot be applied directly to large data sets. For this reason, CURE takes a random sample and performs the hierarchical clustering on the sampled data points.