Comparison of S-DBSCAN and SF-DBSCAN - AlessandroRenda Algorithmsandtechniquesfordatastreammini

(Algorithm 10): the core objects in pList that belong to the same cluster of y (line 3) and the border objects that have a non-zero membership to the cluster of y (line 6) are investigated: clusters of core objects are updated and the membership degree of border objects to the new cluster is evaluated. The CMember row related to the deprecated cluster is set to zero (line 8), thus not affecting the subsequent analysis.

4.3 Comparison of S-DBSCAN and SF-DBSCAN

A preliminary experimental study has been carried out on three synthetic datasets showing concept drift with the aim of comparing the performance of SF-DBSCAN and S-DBSCAN. In the following, we first describe the datasets, and then we discuss the results.

Datasets

For our tests, we referred to the recently proposed synthetic benchmark Gaussian-MotionData (GMD) collection (Márquez et al., 2018). It consists of nine datasets with concept drift: in each dataset, distinct clusters are generated from distinct Gaussian distributions that change their mean and/or covariance over time. As we aimed to visually show the behaviour of SF-DBSCAN, we selected from GMD the two datasets with two-dimensional data and with a limited number of objects, namely GMD-1C2D1kLinear and GMD-4C2D800Linear. The former contains 1000 objects with one single cluster, which shows a progressive drift in the bidimensional feature space. The latter has 800 objects and four clusters, which change their co-variance and move at a steady pace towards a central point. We also generated yet another dataset, 3C2D7500Merge, by properly setting the data generation param-eters so to show how clusters can be merged during the streaming data analysis.

3C2D7500Merge consists of three clusters with 7500 objects: the centers of two clus-ters get closer over time until they merge with each other.

Results

Table 4.1 shows the parameter values adopted for SF-DBSCAN on each test dataset.

Their choice is the result of a grid search over a reasonable set of values for εmin, εmax and MinPts, using ARI as a reference metric for the analysis. This sort of opti-mization of the parameters configuration is clearly unfeasible in genuine clustering applications. However, the goal of the present analysis is just to explore the poten-tial of the two algorithms; we will broadly discuss the issue of parameter tuning in the next chapter.

A preliminary insight into the modeling capability of SF-DBSCAN is shown in Fig. 4.2. It reports the results obtained on the 25%, 50%, 75% and 100% of the

GMD-30 Towards fuzzy solutions for clustering data streams

Table 4.1: Parameter configuration for SF-DBSCAN Dataset εmin εmax minPts GMD-1C2D1kLinear 1.00 2.00 10 GMD-4C2D800Linear 1.45 5.00 6

3C2D7500Merge 1.70 7.00 8

1C2D1kLinear dataset. The algorithm tracks the evolution of data along the stream-ing analysis. Obviously, the dataset with a sstream-ingle cluster can be perfectly modeled with a high enough value of εmin: in this case we show how, even with a relatively small radius, some objects are initially considered as outliers (green triangles) and later incorporated into the unique cluster.

Figure 4.2: Results of SF-DBSCAN after analyzing 25%, 50%, 75% and 100% of the dataset GMD-1C2D1kLinear. Figure from (Aliperti et al., 2019).

Datasets GMD-4C2D800Linear and 3C2D7500Merge are used for measuring the quality of the partitions obtained with SF-DBSCAN and S-DBSCAN in terms of Ad-justed Rand Index (ARI). As for SF-DBSCAN, the ARI is evaluated after a defuzzi-fication process: the cluster label for a border object is assigned on the basis of its highest membership degree

For a fair comparison with its fuzzy counterpart, S-DBSCAN is executed with three different parameter configurations: we set MinPts coherently with SF-DBSCAN, while ε is equal to εmin, εmax, and their average value, respectively. Results for datasets GMD-4C2D800Linear and 3C2D7500Merge are reported in Tables 4.2 and 4.3, re-spectively. Dataset GMD-1C2D1kLinear is excluded from the numerical evaluation since it contains a unique cluster.

Results suggest that SF-DBSCAN outperforms, on average, the relative crisp base-line: the high sensitivity of S-DBSCAN to the parameters setting makes it less suit-able to handle evolving data streams, especially when clusters are not well spaced.

A small value of the neighborhood radius (εmin) leads to structures that are signif-icantly less accurate than those detected by SF-DBSCAN over the entire streaming process. Conversely, higher values of ε (εavg and εmax) make S-DBSCAN group to-gether close clusters, thus dramatically affecting the clustering performance.

4.3 Comparison of S-DBSCAN and SF-DBSCAN 31

Table 4.2: Adjusted Rand Index for dataset GMD-4C2D800Linear GMD-4C2D800Linear

25% 50% 75% 100% Avg

SF-DBSCAN - εmin, εmax 0.940 0.993 0.993 0.982 0.909 S-DBSCAN - εmin 0.698 0.865 0.911 0.924 0.798 S-DBSCAN - εmax 0.987 0.713 0.000 0.000 0.611 S-DBSCAN - εavg 0.975 0.993 0.993 0.000 0.763 Table 4.3: Adjusted Rand Index for dataset 3C2D7500Merge

3C2D7500Merge

25% 50% 75% 100% Avg

SF-DBSCAN - εmin, εmax 0.998 0.984 0.565 0.569 0.752 S-DBSCAN - εmin 0.885 0.919 0.537 0.548 0.655 S-DBSCAN - εmax 0.544 0.000 0.000 0.000 0.200 S-DBSCAN - εavg 0.543 0.000 0.000 0.000 0.294

A further evaluation of the two algorithms is performed through a visual analy-sis of intermediate and final partitions.

(a) S-DBSCAN: 25% (b) S-DBSCAN: 50% (c) S-DBSCAN: 75% (d) S-DBSCAN: 100%

(e) SF-DBSCAN: 25% (f) SF-DBSCAN: 50% (g) SF-DBSCAN: 75% (h) SF-DBSCAN: 100%

Figure 4.3: Dataset GMD-4C2D800Linear partitioned with S-DBSCAN (first row, a-d) and SF-DBSCAN (second row, e-h). Figure from (Aliperti et al., 2019).

In Figures 4.3 and 4.4 we compare the clustering quality of S-DBSCAN and SF-DBSCAN, showing the structures obtained after analyzing the 25%, 50%, 75% and 100%of the objects in GMD-4C2D800Linear and 3C2D7500Merge, respectively. As

32 Towards fuzzy solutions for clustering data streams

(a) S-DBSCAN: 25% (b) S-DBSCAN: 50% (c) S-DBSCAN: 75% (d) S-DBSCAN: 100%

(e) SF-DBSCAN: 25% (f) SF-DBSCAN: 50% (g) SF-DBSCAN: 75% (h) SF-DBSCAN: 100%

Figure 4.4: Dataset 3C2D7500Merge partitioned with S-DBSCAN (first row, a-d) and SF-DBSCAN (second row, e-h). Figure from (Aliperti et al., 2019).

for S-DBSCAN, only partitions corresponding to the best Adjusted Rand Index are shown.

Fig. 4.3e shows how, after analyzing a few objects, SF-DBSCAN partitions the data into four clusters, each associated with a Gaussian distribution. In Fig. 4.3f (snapshot after 50% of the objects), Fig. 4.3g (snapshot after 75% of the objects) and Fig. 4.3h (all objects analyzed) we can notice that SF-DBSCAN successfully follows and models the concept drift by properly elongating the clusters. Interestingly, when all objects have been analyzed (Fig. 4.3h), thanks to the fuzzy borders, SF-DBSCAN identifies a limited number of outliers, modelling more effectively the data distribu-tion. Figures 4.4e, 4.4f, 4.4g and 4.4h show that SF-DBSCAN models correctly the scenario in which one cluster drifts towards another, merging them into a new clus-ter. It is worth pointing out that the crisp version of the algorithm labels as outliers many more objects than its fuzzy counterpart does, both for GMD-4C2D800Linear (Figures 4.3a, 4.3b, 4.3c and 4.3d) and for 3C2D7500Merge (Figures 4.4a, 4.4b, 4.4c and 4.4d). Relaxing the constraint on the neighborhood size determines the inclu-sion of marginal points of the distributions into the fuzzy border of clusters.

Figure 4.5 shows a detail of the output fuzzy partition produced by SF-DBSCAN on 3C2D7500Merge after analyzing half of the dataset. Cluster membership is shown in detail for 7 objects, those contained in the black box in the inner figure. It can be observed that some objects have a non-zero degree of membership to multiple clus-ters, thus implementing the abstraction of fuzzy overlapping borders. Border objects are represented with an opacity proportional to their highest degree of membership

4.3 Comparison of S-DBSCAN and SF-DBSCAN 33

Figure 4.5: Detail on the fuzzy partition captured by SF-DBSCAN on 3C2D7500Merge after analyzing 50% of the dataset: clusters with fuzzy over-lapping borders.

to a cluster.

Computational Complexity

The time complexity of S-DBSCAN and SF-DBSCAN is evaluated at the arrival of a new object p. Let n be the number of objects of the stream collected so far, ˆk the average number of objects in the kernel-neighborhood of an object and f the data dimensionality. For both the algorithms, the time complexity is mainly influenced by the regionQuery procedure, that, in the worst case, must be performed ˆk times.

Typically, ˆk and f are small with respect to n, resulting in a complexity of O(_n). It can be reduced to O(log n) if a spatial indexing structure is adopted (Ester et al., 1996).

Border objects membership assessment policy

As anticipated in Section 3.2, the most suitable policy for the assignment of the de-gree of membership of an object to a cluster has been investigated. In the original implementation of FuzzyBorder (Ienco and Bordogna, 2018) a border object y gets a

34 Towards fuzzy solutions for clustering data streams

membership degree to the fuzzy border of a cluster C defined as follows:

µ_C(y) = min_x_∈_neighcore₍_y₎µx(y) _(4.2) where neighcore(y) = {x ∈ core(c) ∧µx(y) >0}. Given the monotonically decreas-ing membership function defined in Equation 4.1 and depicted in Fig. 3.2, Equation 4.2 can be interpreted as the implementation of a farthest-core policy, according to which the farthest core object is considered for the cluster membership assessment of a border object.

In SF-DBSCAN instead, we adopt a nearest-core policy. A border object y gets a membership degree to the fuzzy border of a cluster C defined as follows:

µC(y) = max_x_∈_neighcore₍_y₎µx(y) _(4.3) where neighcore(y)is defined as above.

A visual comparison of the approaches is proposed to appreciate the substan-tial differences between them: Figure 4.6 shows the border objects (along with con-nected core objects) captured in the two scenarios on dataset 3C2D7500Merge. In addition, the histogram of membership degree of border objects is also reported.

It is evident that in the case of the farthest-core policy (Fig. 4.6a), the border objects have in general a very low membership degree. The histogram suggests that the membership degree values are distributed close to zero and border objects are barely visible in the bidimensional attribute space. Connected core objects, those that determine the membership degree of at least one border object (represented as black dots) are spread within clusters. The dependence on inner core objects rather than peripheral ones can lead to incosistent modeling of fuzzy borders, especially when clusters have a non-convex shape.

Conversely, the nearest-core policy leads to a more intuitive modeling of fuzzy borders. Figure 4.6b shows how membership degree values are more evenly dis-tributed in the [0, 1]range than in the farthest-core case. The core objects on which the assignment depends are always those closest to the boundary of the clusters.

Nel documento AlessandroRenda Algorithmsandtechniquesfordatastreammining (pagine 34-39)