Advantages and limitations of SF-DBSCAN - AlessandroRenda Algorithmsandtechniquesfordatastreamm

34 Towards fuzzy solutions for clustering data streams

membership degree to the fuzzy border of a cluster C defined as follows:

µ_C(y) = min_x_∈_neighcore₍_y₎µx(y) _(4.2) where neighcore(y) = {x ∈ core(c) ∧µx(y) >0}. Given the monotonically decreas-ing membership function defined in Equation 4.1 and depicted in Fig. 3.2, Equation 4.2 can be interpreted as the implementation of a farthest-core policy, according to which the farthest core object is considered for the cluster membership assessment of a border object.

In SF-DBSCAN instead, we adopt a nearest-core policy. A border object y gets a membership degree to the fuzzy border of a cluster C defined as follows:

µC(y) = max_x_∈_neighcore₍_y₎µx(y) _(4.3) where neighcore(y)is defined as above.

A visual comparison of the approaches is proposed to appreciate the substan-tial differences between them: Figure 4.6 shows the border objects (along with con-nected core objects) captured in the two scenarios on dataset 3C2D7500Merge. In addition, the histogram of membership degree of border objects is also reported.

It is evident that in the case of the farthest-core policy (Fig. 4.6a), the border objects have in general a very low membership degree. The histogram suggests that the membership degree values are distributed close to zero and border objects are barely visible in the bidimensional attribute space. Connected core objects, those that determine the membership degree of at least one border object (represented as black dots) are spread within clusters. The dependence on inner core objects rather than peripheral ones can lead to incosistent modeling of fuzzy borders, especially when clusters have a non-convex shape.

Conversely, the nearest-core policy leads to a more intuitive modeling of fuzzy borders. Figure 4.6b shows how membership degree values are more evenly dis-tributed in the [0, 1]range than in the farthest-core case. The core objects on which the assignment depends are always those closest to the boundary of the clusters.

4.4 Advantages and limitations of SF-DBSCAN 35

(a) Farthest core policy (b) Nearest core policy

Figure 4.6: Comparison between farthest-core and nearest-core policies. Borders ob-ject are represented with an opacity proportional to the degree of membership to the connected core object. A core object is shown (as a black dot) only if it is the farthest one (a) or the nearest one (b) to a border object, i.e. it determines its fuzzy membership. Dataset 3C2D7500Merge is considered.

the membership assessment of border objects. In other words, the introduction of εmax does not affect the identification of core objects, yet mitigating the parameter sensitivity of crisp DBSCAN towards the identification of cluster borders.

On the other hand, it is worth underlining the main limitations of SF-DBSCAN:

• SF-DBSCAN keeps all objects in memory to evaluate the partition. As high-lighted in Section 2.1, this is not feasible in real streaming applications, where the stream can be unbounded;

• although the introduction of fuzziness allows to better handle concept drift compared to the crisp implementation, a proper adaptation to evolving sce-narios is not possible unless minor importance is given to older points;

• finding a proper parameter setting for SF-DBSCAN (and even for the DB-SCAN itself) can still be considered an open issue.

The analysis of these drawbacks has prompted us to development of a novel proposal, which is discussed in the next chapter.

Chapter 5 TSF-DBSCAN: a fuzzy density-based clustering algorithm for data stream

In the previous chapter we have described a first fuzzy adaptation of DBSCAN clustering algorithm for the streaming setting, namely SF-DBSCAN. The discussion presented in this chapter stems from the strengths and limitations of SF-DBSCAN and leads to the proposal of a novel fuzzy density-based clustering algorithm for data streams dubbed DBSCAN (Temporal Streaming Fuzzy DBSCAN). TSF-DBSCAN has been presented in (Bechini et al., 2020c). In the final part of the chap-ter we focus on the open issue of paramechap-ter setting, which is crucial in DBSCAN as well as in its extensions, and we investigate a novel automatic tuning procedure currently restricted to the static case.

5.1 Motivations and background on baseline algorithms

In this section we describe the rationale behind the development of the proposed TSF-DBSCAN algorithm. In particular we briefly recall the algorithms that inspired our solution along with their benefits and limitations.

The very forerunner of the algorithms that led to developing TSF-DBSCAN is DBSCAN: designed for static datasets, it determines clusters by searching for density connected objects. In Chapter 4 we have introduced two extensions of DBSCAN algorithm aimed at managing data streams, i.e. S-DBSCAN and SF-DBSCAN. In both of them, the partition is updated at the arrival of each new object p by properly handling one out of the following scenarios: i) p is labeled as an outlier, ii) p joins an existing cluster, iii) p determines the merging of two existing clusters, iv) p initiates a new cluster. Unlike S-DBSCAN, however, SF-DBSCAN captures clusters with fuzzy borders by relaxing the constraint on the neighborhood size, thus allowing marginal

38 TSF-DBSCAN: a fuzzy density-based clustering algorithm for data stream

objects of data distribution to be included into fuzzy overlapping borders of clusters.

Experimental comparisons indicate that SF-DBSCAN is characterized by a higher modelling capability than S-DBSCAN thanks to the introduction of fuzzy borders.

Since all data are kept in memory, however, the sequence of objects to analyze has necessarily to be bounded, and the capability to handle non-stationary data streams is compromised.

TSF-DBSCAN extends and improves SF-DBSCAN by introducing a forgetting mechanism by means of a damped window model, thus enabling adaptation to evolving scenarios and ensuring compliance with memory and computational con-straints. As a result, TSF-DBSCAN integrates concepts of fuzzy set theory, forgetting mechanisms and density-based clustering for the analysis of potentially unbounded data streams.

The main features of these algorithms are summarised in Table 5.1.

Table 5.1: Comparison among DBSCAN and its extensions (S-DBSCAN, SF-DBSCAN and TSF-SF-DBSCAN) with respect to possible streaming data clustering re-quirements

Requirement DBSCAN S-DBSCAN SF-DBSCAN TSF-DBSCAN

Detection of arbitrarily shaped clusters ! ! ! !

Number of clusters not required as input parameter ! ! ! !

Ability to handle streaming data ! ! !

Enhanced modelling capability due to fuzzy borders ! !

Ability to deal with unbounded sequences of objects !

Ability to adapt to concept drift !

Nel documento AlessandroRenda Algorithmsandtechniquesfordatastreammining (pagine 39-43)