• Non ci sono risultati.

AlessandroRenda Algorithmsandtechniquesfordatastreammining

N/A
N/A
Protected

Academic year: 2022

Condividi "AlessandroRenda Algorithmsandtechniquesfordatastreammining"

Copied!
155
0
0

Testo completo

(1)

P

H

D P

ROGRAM IN

S

MART

C

OMPUTING

DIPARTIMENTO DI INGEGNERIA DELL’INFORMAZIONE(DINFO)

Algorithms and techniques for data stream mining

Alessandro Renda

Dissertation presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Smart Computing

(2)

PhD Program in Smart Computing

University of Florence, University of Pisa, University of Siena

Algorithms and techniques for data stream mining

Alessandro Renda

Advisor:

Prof. Alessio Bechini

Head of the PhD Program:

Prof. Paolo Frasconi

Evaluation Committee:

Prof. José Antonio Gámez Martín, Universidad de Castilla - La Mancha, Spain Prof. Stefano Rovetta, Università di Genova, Italy

XXXIII ciclo — October 2020

(3)

Dedico questa tesi a Giulio Regeni

(4)

ii

Acknowledgments

First, I would like to thank the members of the Evaluation Committee, Professor José Antonio Gámez Martín and Professor Stefano Rovetta, for their valuable feedback about this manuscript.

I would like to express my sincere thanks to the people who have supported, inspired and helped me during this period. The first thought goes to my supervi- sor, Ing. Alessio Bechini, for his continuous and priceless support and for his in- valuable scientific advice. I express my gratitude to Professor Francesco Marcelloni, for his enlightening guidance and inspiring vision. I would like to sincerely thank Professor Pietro Ducange for his precious and instructive advice, constant help and understanding. I would like to thank Professor Beatrice Lazzerini for giving me the opportunity to approach the prestigious activity of teaching, and for everything she has taught me. I would like to thank Professor Alessandro Lenci for his kindness and constructive remarks as a member of the Supervisory Committee along the past three years. Without them this experience would not have been as formative, chal- lenging and rewarding. I would like to thank all my office mates and colleagues for the valuable cooperation or, simply, for their friendly presence. I wish to take this opportunity to thank Professor Paolo Frasconi for his helpfulness as head of the PhD program, and, finally, the administrative staff of the Universities of Pisa and Florence, in particular Mmes Simona Altamura, Maura Pancanti and Antonella Righi.

I can only conclude this short section with a huge thank you to my family, for having always been present, warm and encouraging. I would like to extend these thanks to all my closest friends: I have learned a lot and drawn positive energy from them.

(5)

iii

Abstract

The abstraction of data streams encompasses a vast range of diverse appli- cations that continuously generate data and therefore require dedicated algo- rithms and approaches for exploitation and mining. In this framework both unsupervised and supervised approaches are generally employed, depending on the task and on the availability of annotated data. This thesis proposes novel algorithms and techniques specifically tailored for the streaming setting and for knowledge discovery from Social Networks.

In the first part of this work we propose a novel clustering algorithm for data streams. Our investigation stems from the discussion of general challenges posed by cluster analysis and of those purely related to the streaming setting.

First, we propose SF-DBSCAN (streaming fuzzy DBSCAN) a preliminary solu- tion conceived as an extension of the popular DBSCAN algorithm. SF-DBSCAN handles the arrival of new objects and continuously updates the clustering re- sult by taking advantage of concepts from fuzzy set theory. However, it gives equal importance to every collected object and therefore is not suitable to man- age unbounded data streams and to adapt to evolving settings. Then, we in- troduce TSF-DBSCAN, a novel "temporal" adaptation of streaming fuzzy DB- SCAN: it overcomes the limits of the previous proposal and proves to be effec- tive in handling evolving and potentially unbounded data streams, discovering clusters with fuzzy overlapping borders.

In the second part of the thesis we explore a supervised learning application:

the goal of our analysis is to discover the public opinion towards the vaccination topic in Italy, by exploiting the popular Twitter platform as data source. First, we discuss the design and development of a system for stance detection from text. The deployment of the classification model for the online monitoring of the public opinion, however, cannot ignore that tweets can be seen as a particular form of a temporal data stream. Then, we discuss the importance of leverag- ing user-related information, which enables the design of a set of techniques aimed at deepening and enhancing the analysis. Finally, we compare different learning schemes for addressing concept-drift, i.e. a change in the underlying data distribution, in a dynamic environment affected by the occurrence of real world context-related events. In this case study and throughout the thesis, the proposal of algorithms and techniques is supported by in-depth experimental analysis.

(6)

Contents

Contents 1

1 Introduction 3

1.1 Thesis contribution . . . 4

2 Data stream mining: Background 7 2.1 Overview . . . 7

2.2 Concept Drift . . . 8

2.3 Data stream clustering . . . 9

2.4 Assessment of clustering quality . . . 11

2.5 General challenges of clustering in the static setting . . . 12

3 Related Work 15 3.1 Data stream clustering algorithms . . . 15

3.2 Fuzzy clustering . . . 18

4 Towards fuzzy solutions for clustering data streams 23 4.1 S-DBSCAN: A preliminary incremental adaptation of DBSCAN . . . 23

4.2 SF-DBSCAN: Streaming Fuzzy DBSCAN . . . 24

4.3 Comparison of S-DBSCAN and SF-DBSCAN . . . 29

4.4 Advantages and limitations of SF-DBSCAN . . . 34

5 TSF-DBSCAN: a fuzzy density-based clustering algorithm for data stream 37 5.1 Motivations and background on baseline algorithms . . . 37

5.2 The Temporal Streaming Fuzzy DBSCAN . . . 38

5.3 Experimental Evaluation . . . 47

5.4 The open issue of parameter tuning: a preliminary study . . . 63

6 Learning from Twitter stream: stance detection from tweets 77 6.1 Motivation for the supervised learning approach . . . 78

6.2 Background and Related Works . . . 82

6.3 Stance Detection System . . . 85 1

(7)

2 CONTENTS

6.4 The vaccination topic case study: Experimental campaign . . . 89 6.5 Outcomes and Perspectives . . . 97 7 Public opinion mining from Twitter: enhancing the analysis 101 7.1 Twitter vaccination dataset: a statistical overview . . . 102 7.2 Characterizing twitter users: stance analysis . . . 104 8 Addressing Event-Driven Concept Drift in Twitter Stream 113 8.1 Related Works: Adaptive solutions for text stream classification . . . 115 8.2 Problem definition and baseline learning schemes . . . 116 8.3 The proposed Semantic learning scheme . . . 120 8.4 Experimental Analysis . . . 124

9 Conclusion and future works 135

A Publications 137

Bibliography 139

(8)

Chapter 1

Introduction

Recent developments in software and hardware technologies have enabled the gen- eration of a huge amount of data at a very high speed. The web and social networks pervade the everyday life of most of us: every minute approximately half a million tweets are posted on Twitter, 5 millions searches are handled by Google, and more than 100 millions emails are sent, most of which are spam1. Internet of Things (IoT) devices and sensors are driving the Fourth Industrial Revolution (Industry 4.0) and serve a myriad of diverse applications that continuously produce data streams. The advent of wearable devices, which constantly acquire and transmit measurements of body signals, have fostered novel approaches in the healthcare domain. The need to process and extract knowledge from such big data is prompting the development of fast and efficient methods, since traditional algorithms and technologies are of- ten unsuited to handle datasets characterized by large volume, variety and velocity (Laney, 2001). The property of ’velocity’, in particular, gives rise to the notion of data streams, as continuously generated and potentially unbounded sequences of objects (Cano and Krawczyk, 2020).

Batch processing algorithms are typically exploited for knowledge discovery for large, yet finite, amount of data, typically generated according to stationary proba- bility distributions. Conversely, the algorithmic abstraction of data stream allows to frame the main features and challenges of big data real-time analytics (Bifet et al., 2018): data streams are sequences of objects, each with an associated timestamp.

Accordingly, data mining algorithms and techniques tailored to the streaming sce- nario must account for the following aspects (Silva et al., 2013): (i) data objects ar- rive continuously; (ii) information should be extracted in real time; (iii) the stream is potentially unbounded; (iv) the data generating process may evolve over time thus making the relative distributions non-stationary.

From a machine learning perspective, both supervised and unsupervised learn- ing from data streams have attracted the attention of researchers. Clustering has

1https://www.internetlivestats.com/one-second/. (Last visited October 2020)

3

(9)

4 Introduction

gained popularity in this specific field, as it deals with unlabeled instances: in fact, the difficulty and the cost of annotating objects generated at a rapid rate typically hampers the adoption of supervised learning approaches in many real-world appli- cations.

The development of advanced clustering algorithms, able to cope with the chal- lenges posed by the streaming setting, is still considered essential for this area of research. The present investigation stems from the intuition that the combination of a density-based approach with concepts from fuzzy set theory may represent an unexplored contribution in the context of clustering data stream. In the first part of this thesis, we describe the design, development and validation of a novel clustering algorithm for data streams.

For other classes of applications, instead, supervised approaches are generally used: this is the case for example of sentiment analysis and opinion mining, in which the class labels are typically predefined. In the second part of the thesis, we deeply explore an application of knowledge discovery from Social Networks with a focus on the stream of tweets published on Twitter2. The widespread diffusion of Twitter has recently enabled a huge number of data mining applications based on text classifica- tion algorithms, therefore pertaining to the supervised learning domain. However, very often the investigations carried out are limited to the static analysis of plain-text and fail to deliver accurate and detailed insights from such a noisy and evolving en- vironment. We analyze the case study of mining the public opinion towards the vaccination topic: we first describe the design of a text classification system, then we discuss a set of techniques and approaches for enhancing and sharpening our analysis.

1.1 Thesis contribution

This thesis proposes novel solutions for mining knowledge from data stream and from Social Networks, both from an algorithmic and an application point of view.

The main contributions of the thesis are the following:

• the design and development of a novel density-based fuzzy clustering algo- rithm for data streams;

• a step-by-step experimental analysis for the validation of the proposed ap- proach;

• the proposal of a set of algorithms and techniques for an application of data mining from the Twitter stream: the detection of users’ stance towards the vaccination topic.

2https://twitter.com/

(10)

1.1 Thesis contribution 5

The thesis is organized as follows. Chapter 2 provides an overview on data stream mining, with a focus on concepts and challenges related to cluster analysis.

Particular attention is given to the discussion of concept drift, a phenomenon typical of evolving settings, which relates to both supervised and unsupervised learning.

Background on a popular and established clustering algorithm, namely DBSCAN, is also provided since it represents the very first forerunner of our proposal. Chapter 3 reviews the related works in the research fields relevant to our analysis. Particular attention is dedicated to state of the art approaches for streaming clustering, but we also recall the most relevant fuzzy extensions of the DBSCAN algorithm. Chapter 4 describes the preliminary steps towards our proposal of a clustering algorithm for the streaming setting. We first introduce a naive online version of DBSCAN.

Then, we enhance its modelling capability with the introduction of fuzzy concepts.

A simple experimental analysis highlights the benefits of the fuzzy proposal, but at the same time it sheds light on the limits of the proposed approach. Chapter 5 thor- oughly describes TSF-DBSCAN, a novel fuzzy density-based clustering algorithm, designed to overcome some of the limits of the existing approaches. A wide ex- perimental analysis compares the proposed approach with its forerunners and with the state of the art clustering algorithms for data streams. The chapter is concluded with the description of some open issues and perspectives, and with the proposal of a preliminary solution addressing the problem of input parameter setting, common to most data mining algorithms.

The second part of the thesis presents a data mining application of stance detec- tion from social network. After motivating the transition to the supervised learn- ing paradigm, we describe the general framework of knowledge discovery from the Twitter stream in Chapter 6: the goal is to uncover the public opinion expressed by Italian users on the popular platform. The data collection and processing pipeline are illustrated along with the steps that led to the design of a stance classification system from text. Preliminary results from an online monitoring campaign are re- ported. Chapters 7 and 8 describe two orthogonal improvements of the stance clas- sification system presented in Chapter 6. In Chapter 7 we describe a sophisticated pipeline to get a sharper picture of the public opinion, obtained by taking into con- sideration user-related information. In particular, we show the results of a stance monitoring campaign with advanced analysis on temporal and spatial scales. In Chapter 8 we discuss the importance of addressing concept drift in Twitter stream, and we compare different approaches for learning in such a non-stationary envi- ronment. Chapter 9 concludes the thesis providing a general discussion of the pre- sented work.

(11)
(12)

Chapter 2

Data stream mining: Background

This chapter describes the key notions of data stream mining. The main concepts and issues are introduced: they serve as background both for the discussion of the related works and the algorithms and techniques discussed in the thesis. In partic- ular we provide the definition of concept drift and discuss two fundamental aspects of data stream clustering, namely data structures for summarization and time win- dow models. Such concepts are extensively covered in the literature of data stream mining and clustering: in this chapter we make particular reference to the most re- cent works, namely (Silva et al., 2013; Gama et al., 2014; Ramírez-Gallego et al., 2017;

Bifet et al., 2018; Moulton et al., 2019; Carnein and Trautmann, 2019; Zubaroğlu and Atalay, 2020; Cano and Krawczyk, 2020). We also make a brief digression in the static scenario to recall the key concepts of one of the best known clustering algo- rithms, i.e. DBSCAN, from which our analysis originates.

2.1 Overview

Formally, a data stream DS is an ordered sequence of data objects DS = {x1, x2, ..., xN}, where each object xiis described as an n-dimensional feature vector xi = (x1i, xi2, ..., xin) and has an associated timestamp. Supervised learning approaches assume that each object xihas also an associated label yi, e.g. a categorical value in classification prob- lems. Conversely, ground-truth labels are not available in unsupervised learning applications.

A variety of machine learning and statistics methods have been proposed to ex- tract information from the growing amount of available data. Only in the early 1990s this research field has been referred to as data mining. Traditional data mining tech- niques, however, assumes the availability of static and resident data repositories, and turns out to be inadequate for processing large volumes of data streams.

Data stream mining literature (Guha et al., 2003; Silva et al., 2013; Nguyen et al., 2015; Bifet et al., 2018; Ramírez-Gallego et al., 2017; Zubaroğlu and Atalay, 2020;

7

(13)

8 Data stream mining: Background

Cano and Krawczyk, 2020) has consistently underlined a number of challenges pe- culiar of the streaming setting:

• data objects arrive continuously and information should be extracted in real time;

• the size of the stream is potentially unbounded (N →∞);

• the data generating process may evolve over time leading to non-stationary distributions.

Therefore, the intrinsic nature of data streams imposes some restrictions on the learning system. Firstly, the amount of time for processing instances must be lim- ited to offer an adequate responsiveness. Furthermore, it is unfeasible to store the entire dataset and to process each object multiple times: typically each sample is read, processed and immediately discarded, just contributing to the maintenance of stream synopsis. Sometimes, this constrain is relaxed through the implementation of dedicated buffers that store the most recent objects. Finally, the algorithm must be able to detect and/or adapt to changes in the underlying data generating processes, a phenomenon widely referred to as concept drift.

2.2 Concept Drift

Given a set of input variables X and a target variable Y, a concept is defined as a set of prior probabilities and class-conditional density functions for all classes Y = y1, . . . , yC(Nguyen et al., 2015):

S= {(p(y1), p(X|y1)); . . . ;(p(yC), p(X|yC))} (2.1) In the static setting, the entire dataset is generated by a single concept, and train- ing and test set are supposed to be identically distributed. The intrinsic dynamic nature of the streaming setting drops this assumption and leads to the occurrence of multiple concepts over time.

A formal general definition of concept drift is given in (Gama et al., 2014). Con- cept drift between two timestamps t0and t1can be defined as:

X : pt0(X, Y) 6= pt1(X, Y) (2.2) where pti represents the joint probability distribution at timestamp ti between the set of input variables X and the target variable Y. Such drift may occur due to a change in p(X)or in P(Y|X)(Moulton et al., 2019).

(14)

2.3 Data stream clustering 9

In classification task it is of the utmost importance to react to the changes that affect the prediction: drift can be handled explicitly, i.e. resorting to a change de- tector, or implicitly, i.e. endowing the model with an intrinsic ability to adapt (e.g.

through sliding window) (Cano and Krawczyk, 2020). However, concept drift is widely discussed in clustering literature as well (Moulton et al., 2019; Silva et al., 2013; Zubaroğlu and Atalay, 2020): due to the absence of class labels the only rele- vant phenomenon is the change over time of the underlying data distribution p(X).

Figure 2.1: Different types of data distribution changes of time. Figure adapted from (Gama et al., 2014)

In practical applications the adaptation strategy depends on the particular type of concept drift: a qualitative conceptualization of concept drift is provided in (Gama et al., 2014), where authors distinguish between different forms of changes in data distribution over time. Fig. 2.1 reports one-dimensional toy examples for several types of drift, namely sudden, incremental, gradual and reoccurring. Sudden or abrupt concept drift refers to the scenario in which the mean of the data distribu- tion suddenly switches from one value (or concept) to another, without exploring intermediate values. This also happens in gradual concept drift, but in this case both concepts are maintained during a transitional period. Conversely, in incremental con- cept drift the data distribution switches from one concept to another exploring many intermediate concepts. Notably, data distribution concepts may exhibit redundancy or periodicity: when the drift restores already seen concepts it is referred to as reoc- curring concept drift.

2.3 Data stream clustering

Clustering can be regarded as an optimization task (Carnein and Trautmann, 2019), where the goal is to simultaneously maximize the intra-cluster homogeneity and the inter-cluster heterogeneity. In the streaming setting such optimization process obeys the order of arrival of the items and takes place under resources and time constraints.

Two main computational approaches have been proposed in the literature for clustering data streams: we refer to these approaches as online clustering and two- stage clustering. In the former, the partition is updated at each new occurrence of an object. In the latter, the algorithm keeps track of the received data by calculating

(15)

10 Data stream mining: Background

a data synopsis or summarization in the online stage, and then exploits such com- pact representation for the evaluation of the actual partition in the offline clustering stage, to be executed on demand or periodically; typically, the offline stage leverages traditional clustering algorithms, such as DBSCAN (Ester et al., 1996) or k-means (MacQueen et al., 1967).

Data structures

In most streaming applications, restrictions imposed by memory constraint do not allow saving all the incoming objects: as a consequence the output partition is often evaluated on particular data structures that summarize the input stream. Authors in (Silva et al., 2013) identified the most commonly used data structures: feature vectors, prototype arrays, coreset trees and grids.

A feature vector (introduced in (Zhang et al., 1996)) includes the number, the linear sum and the square sum of the data objects in the cluster. Such an abstrac- tion can be efficiently updated as new objects arrive, and conveniently used to cal- culate both clusters centroids and radii. More recently in (Aggarwal et al., 2003) authors extended the original feature vector by including temporal information in the data summarization step, thus introducing the concept of microcluster. Prototype array consists in a collection of representative instance, e.g. centroids or medoids, while coreset trees organize prototypes as binary trees. In grid-based summarization the object space is discretized along each dimension, thus defining a grid structure.

The clustering operation is performed on the grid cells, which are populated in the online stage with the stream of objects.

Time window models

In addition to ensuring compliance with available computational resources, stream- ing clustering algorithms must implement adaptation mechanisms to evolving set- tings, pursuing a higher model quality. For this reason, higher importance is given to most recent objects while outdated information can be discarded. The time window models fulfill this purpose; the most common implementations are sliding, land- mark and damped window models (Silva et al., 2013; Nguyen et al., 2015; Zubaroğlu and Atalay, 2020).

In sliding window model a first in first out method is adopted: only the most re- cent N objects are stored and equal importance is given to all of them. The window length N represents an user-defined input parameter. Conversely, damped or fading window model, gives a different weight to each instance based on its age: recent samples are assigned a higher weight than older ones. Among possible decay func- tions, a popular solution is w(t) = 2λtwhere the weight assigned to past instances depends on the value of the parameter λ (decay factor). In landmark window model

(16)

2.4 Assessment of clustering quality 11

the data stream is divided in chunks, that do not intersect and are separated by landmarks (relevant objects defined in terms of elapsed time or number of collected instances); similarly to sliding window model, in landmark window model all ob- jects between two landmarks have the same weight.

2.4 Assessment of clustering quality

Various metrics have been proposed for the evaluation of clustering algorithms. In genuine clustering applications, where the availability of true class labels cannot be assumed, internal measures provide indications about the quality of the clustering, e.g. Sum of Squared Error (SSE), Davies-Bouldin index (Davies and Bouldin, 1979), Xie-Beni index (Xie and Beni, 1991) and Silhouette index (Rousseeuw, 1987). Such measures are often based on the criteria of compactness (i.e. the intra-cluster ho- mogeneity) and separation (i.e. the inter-cluster heterogeneity); as a consequence, most of them are meaningful only for convex clusters, and thus are not suitable for evaluating partitions with arbitrary-shaped clusters (Liu et al., 2013).

On the other hand, external measures evaluate the consistency between clustering results and ground truth (groups) and are widely used as criteria to validate and compare different algorithms. Most of such measures are borrowed or adapted from the information retrieval and classification task. Accuracy, F-score, adjusted rand index, normalized mutual information and purity are frequently adopted. In the following, we just provide the definition of adjusted rand index (ARI) (Hubert and Arabie, 1985) as a widely accepted external metric extensively used in the present work. ARI is a corrected-by-chance version of the Rand Index, which measures the agreement between clustering results and ground truth labels:

ARI= ij(n2ij) − [i(a2i)j(b2j)]/(n2)

1

2[i(a2i) +j(b2j)] − [i(a2i)j(b2j)]/(n2)

where ai, bjand nijare the number of objects in class i, the number of objects in clus- ter j and the number of objects in class i and cluster j, respectively. ARI is bounded above by 1 in the event of complete agreement between the clustering result and the real structure, while a value of 0 indicates that the agreement can be entirely explained by chance. It may also assume negative values, i.e. lower than the index expected for random partitions.

Notably, the correction-for-chance implemented in ARI prevents such an index from being biased towards a large number of clusters, as shown in (Vinh et al., 2010):

this property is particularly relevant when dealing with density-based approaches, in which the number of clusters is not fixed in advance. A popular counterexample is represented by Purity, which is highly sensitive to the number of clusters and has its maximum value for the trivial solution of one cluster per data object.

(17)

12 Data stream mining: Background

2.5 General challenges of clustering in the static setting

As a key data mining task, clustering has been extensively covered in literature. The goal of this section is just to highlight the typical requirements of clustering, also valid for the static case, and to discuss a notable density-based algorithm, namely DBSCAN, as crucial background for the rest of the thesis.

Typical general requirements of clustering in data mining are the following (Han et al., 2011):

• scalability;

• ability to deal with different type of attributes;

• discovery of cluster of arbitrary shape;

• requirements for domain knowledge to determine input parameters;

• ability to deal with noisy data;

• ability to handle incremental updates and insensitivity to input order;

• capability of clustering high-dimensional data;

• ability to perform constraint-based clustering;

• interpretability and usability.

Many clustering algorithms have been proposed to address, at least to some ex- tent, these challenges. Traditionally, a rough categorization of clustering algorithms considers four broad families: partitioning, hierarchical, density-based and model- based methods. Among them, density-based methods have gained popularity be- cause of their ability to detect cluster with arbitrary shape, to handle outliers and not to require the preliminary specification of the number of clusters. Indeed, they are based on the notion of density, defined as the number of objects in a region of the attribute space, and they identify clusters as high density regions separated by low density regions.

A notable exmample of density-based clustering algorithm:

DBSCAN

DBSCAN has been originally proposed by Ester et al. in (Ester et al., 1996) and is certainly one of the most popular density-based clustering algorithms. The pseu- docode of the algorithm is reported in Algorithms 1 and 2.

(18)

2.5 General challenges of clustering in the static setting 13

Algorithm 1:DBSCAN(P) Given: D- dataset

1: C=0

2: Clusters=

3: for p∈ Ds.t. p is unvisited do 4: mark p as visited

5: p.neighbors=regionQuery(p, ε)

6: ifcardinality(p.neighbors)<MinPts then 7: label p as outlier

8: else

9: C=nextCluster

10: Clusters=Clusters∪expandCluster(p, p.neighbors, C) 11: end if

12: end for

13: return Clusters

Algorithm 2:expandCluster(p, p.neighbors, C)

Given: p- object marked as visited, just recognized as core point Given: p.neighbors - set of objects in the ε-neighborhood of p Given: C- actual cluster

1: add p to C as a core point 2: for s∈ p.neighbors do 3: if sis not visited then 4: mark s as visited

5: s.neighbors=regionQuery(s, ε)

6: ifcardinality(s.neighbors)≥MinPts then 7: p.neighbors= p.neighbors∪s.neighbors 8: add s to C as core

9: end if 10: end if

11: if sis not member of any cluster then 12: add s to C

13: end if 14: end for 15: return C

DBSCAN identifies clusters by inspecting the neighborhood of each generic ob- ject p, on the basis of two parameters: the radius ε, which defines the neighborhood around p, and the minimum number MinPts of neighbors within the ε distance.

Each object p with at least MinPts neighbors is marked as a core object. An object s that is not a core object, but belongs to the ε-neighborhood of some core object p, is

(19)

14 Data stream mining: Background

denoted as border object. Finally, an object that is neither a core nor a border object is identified as a noise object or an outlier.

DBSCAN searches for clusters of density connected objects. Two objects p and s are density connected if there exists a core object t, such that both p and s are density reachable from t, that is, there exists a set of core objects leading from t to p and from t to s. Notably, DBSCAN does not require the prior knowledge of the number of clusters. However, the clustering result strongly depends on the parameters ε and MinPts.

(20)

Chapter 3

Related Work

Surveys on data stream clustering algorithms, challenges and applications have been presented in (Silva et al., 2013; Amini et al., 2014; Nguyen et al., 2015; Carnein and Trautmann, 2019; Moulton et al., 2019). In this chapter we just recall the most signif- icant proposals in the context of density-based and fuzzy algorithms for clustering data streams, including recent proposals. Part of the content of this chapter has been presented in (Bechini et al., 2020c).

3.1 Data stream clustering algorithms

As highlighted in Chapter 2, most clustering algorithms adopt either a two-stage ap- proach or an online approach. A schematic comparison of the two approaches is depicted in Fig. 3.1.

In this section, an overview of the most relevant works, representative of the two approaches, is provided.

Two-stage approaches

The first proposal of a summarization technique to handle very large datasets (and also streaming ones) is due to Zhang et al. (Zhang et al., 1996), who introduced the Clustering Feature (CF) vector, i.e. a triplet that keeps summarized data information to support clustering operations, with no need to store all the objects. A CF vector includes the number, the linear sum and the square sum of the data objects in the cluster. Such an abstraction can be efficiently updated as new objects arrive, and conveniently used to calculate both cluster centroids and radii.

Authors in (Aggarwal et al., 2003) extended the original CF vector by including temporal information in the data summarization step. They proposed CluStream, a framework for clustering evolving data streams. CluStream includes the notions of microcluster in a two-stage approach for handling evolving data streams, and of

15

(21)

16 Related Work

Evaluate Partition Read next object

in the stream

Store Object Information / Evaluate Synopsis DATA STREAM

(a) Online Clustering

Evaluate Partition Read next object

in the stream

Store Object Information / Evaluate Synopsis

User Request or Periodic condition DATA STREAM

(b) Two-stage Clustering

Figure 3.1: Schematic comparison between the two main computational approaches for clustering data streams.

pyramidal time frameto let users obtain a data partition over different time-horizons.

Just like CF vectors, microclusters are data structures that summarize a set of objects from the stream: less memory is thus required, and the computational efficiency is improved. The macro-clustering process over the pyramidal time frame produces the global partition of data, allowing the user to capture the evolution of clusters over different time periods. The offline stage of CluStream is based on a variant of the k-means algorithm, and this determines two important drawbacks: the number of macroclusters must be specified in advance, and only spherical clusters can be captured.

DenStream (Cao et al., 2006) adapts the two-stage framework to the case of a density-based clustering algorithm. It is able to uncover arbitrary-shaped clusters without prior specification of the number of clusters. Furthermore, it handles poten- tially unbounded evolving data streams by adopting a damped window model. For the online summarization step, two types of microclusters are defined: potential- core microclusters (p-mc), and outlier microclusters (o-mc). For each new object o, o is absorbed in the closest p-mc, if the absorption does not increase the radius beyond a predefined threshold ε. Otherwise, o is absorbed in the closest o-mc or initializes a new o-mc. A p-mc differs from an o-mc because it exceeds a weight threshold, defined by two user defined parameters β and µ. The arrival of new objects and the weight decay strategy can clearly lead to changes of status for existing microclusters. Thus, periodically the weight of each p-mc is checked and a pruning strategy is applied to free the outlier buffer. In the offline stage the notion of density-reachability and density connectedness are adopted for microclusters, similarly to what happens in

(22)

3.1 Data stream clustering algorithms 17

the classical DBSCAN algorithm.

A shared density graph has been recently proposed for better handling the low density regions between nearby microclusters (Hahsler and Bolaños, 2016). Such an approach, named DBStream, differently from previous works, does not assume a Gaussian distribution of objects around a microcluster center. The notion of shared density is introduced as the weight of objects that lie in the intersection between two microclusters, divided by the area of the intersection itself. The shared density graph is a weighted graph with microclusters as vertices, and edges representing the shared density between pairs of microclusters. Similarly to DenStream, exponential fading is adopted for dealing with evolving data streams. Furthermore, weak mi- croclusters and weak entries in the shared density graph are removed to speed up the computation. During the reclustering stage, two microclusters are assigned to the same macrocluster only if their shared density exceeds a predefined threshold.

Thanks to the definition of a connectivity graph, DBStream can also detect clusters of different density.

Grids represent an alternative to microclusters as a structure for incrementally summarizing the data stream. In grid-based methods the object space is discretized along each dimension, thus defining a grid structure. The clustering operation is performed on the grid cells, which are populated in the online stage with the stream of objects. A representative case of grid-based solutions is D-Stream (Chen and Tu, 2007), where the importance of each grid reflects the local density of the incom- ing flow of objects and varies according to the fading model applied to the stream.

A later extension (Tu and Chen, 2009) deals with the issue of low density regions between adjacent grid cells, introducing also the concept of attraction.

MuDi-Stream (Amini et al., 2016) copes with the challenge of clustering multi- density data in the streaming settings, following the online-offline paradigm. In the online phase, it summarizes information on the data stream by leveraging a hybrid method based on both grids and microclusters; in the offline phase, it applies a vari- ant of DBSCAN for macro-clustering. During the online phase, the evolving data stream is organized in the form of core-miniclusters. The proposed offline compo- nent does not require the definition of a constant distance threshold ε, but relies on the notion of local cluster density: intuitively, a core-minicluster is merged to an existing macro-cluster if it has a coherent value of mean distance from its core neigh- bors that are identified thanks to the grid partitioning of the space. In MR-Stream (Wan et al., 2009), grid cells are recursively halved along each dimension, resulting in a tree that allows discovering clusters at multiple resolutions. Furthermore, the authors propose a technique to trigger the offline stage at proper time, instead of periodically, thus improving the overall performance.

(23)

18 Related Work

Online approaches

The first proposal for an online extension of DBSCAN for dynamic data warehous- ing environments (Ester and Wittmann, 1998) required a complete partition update whenever an object was moved to/from the dataset.

More recently, other fully online techniques have been proposed. The CODAS (Clustering Online Data-streams into Arbitrary Shapes) algorithm (Hyde and An- gelov, 2015) exploits the microcluster abstraction to perform online clustering. At the occurrence of a new object, data synopses are adjusted, and the overall parti- tion is evaluated by checking overlapping microclusters. Thus, the global cluster membership of the new object can be immediately assessed. Nevertheless, CODAS does not implement any forgetting mechanism and indeed is unsuitable for han- dling evolving data streams. The CEDAS (Clustering Evolving Data-streams into Arbitrary Shapes) algorithm (Hyde et al., 2017) overcomes such limitation by intro- ducing a time-decaying energy property for microclusters, thus allowing discarding objects with low energy. A graph structure for representing adjacency of micro- clusters is exploited to efficiently update the overall partition when microclusters are deleted. Recently, BOCEDS (Buffer-based Online Clustering for Evolving Data Streams) (Islam et al., 2019) was proposed as an improvement over the CEDAS algo- rithm. In addition to adapt the radius of each microcluster, a buffer for temporarily irrelevant microclusters is employed and the energy function is redefined including spatial information.

3.2 Fuzzy clustering

Crisp, or hard, clustering methods force each object in a dataset to belong at most to one cluster. The introduction of fuzzy set theory by Lotfi Zadeh (Zadeh, 1965) opened the way for alternative methods capable of modelling the uncertainty of belonging, through the definition of proper membership functions (Yang, 1993).

Fuzzy sets extend the concept of classical sets, allowing an object to belong to a set with any degree between 0 and 1. More formally, a fuzzy set A is defined on a set of elements U (universe of discourse) as a function µA : U → [0, 1]The value µA(x) is the membership degree of an object x to the fuzzy set A. The adoption of fuzzy set theory in cluster analysis leads to the following outcomes: (i) an object belongs to a cluster with a membership degree in[0, 1], and (ii) an object may belong to multiple clusters with a non-zero membership degree.

The analysis of related works carried out in this section does not provide an ex- haustive survey on fuzzy clustering. Instead, it concerns two specific issues: recent proposal of fuzzy extensions of DBSCAN algorithm and fuzzy clustering algorithm for data stream.

(24)

3.2 Fuzzy clustering 19

Fuzzy extensions of DBSCAN algorithm for static datasets

Even in static settings, most soft extensions of the DBSCAN clustering algorithm aim to mitigate its high sensitivity to the setting of the ε and MinPts parameters. With this objective, authors in (Ienco and Bordogna, 2018) have proposed three fuzzy extensions able to capture clusters with variable density or with overlapping bor- ders, thus relaxing the strong dependence on the parameter setting. Notably, these extensions do not target data streams, but require the whole dataset to be available offline.

The three versions adopt different ways to make use of fuzziness: (i) the Fuzzy- Core version generates clusters with fuzzy cores by introducing a soft constraint on the minimum number MinPts of objects in the neighborhood. This solution allows determining clusters with variable core points density; (ii) the FuzzyBorder version generates clusters with fuzzy borders by introducing a soft constraint on the neigh- borhood radius ε; (iii) the FuzzyDBSCAN version relaxes both the constraints on ε and MinPts and produces clusters with both fuzzy cores and fuzzy borders.

The empirical comparison carried out by the authors on real-world benchmark datasets shows that FuzzyBorder solution achieves superior performance compared to the other extensions, thanks to its capability of modelling clusters with fuzzy overlapping borders. For this reason, the pseudocode of the algorithm is reported in Algorithms 3 and 4, adapting the original notation to the one used in this thesis.

The parameters MinPts, εmax and εminare set in the initialization step.

Algorithm 3:FuzzyBorder_DBSCAN(P) Given: D- dataset

1: C=0

2: Clusters=

3: for p∈ Ds.t. p is unvisited do 4: mark p as visited

5: pMin=regionQuery(p, εmin) 6: ifcardinality(pMin)<MinPts then 7: label p as outlier

8: else

9: C=nextCluster

10: Clusters=Clusters∪expandCluster(p, pMin, C) 11: end if

12: end for

13: return Clusters

The FuzzyBorder version exploits the membership function defined in Figure 3.2:

an object x can belong to the neighborhood of another object y with a membership

(25)

20 Related Work

Algorithm 4:expandCluster(p, pMin, C)

Given: p- object marked as visited, just recognized as core point Given: pMin- set of objects in the εmin-neighborhood of p

Given: C- actual cluster 1: add p to C as a core point

2: pMinMax=regionQuery(p, εmax)\pMin 3: fuzzyBorder=pMinMax

4: for spMin do 5: mark s as visited

6: sMin=regionQuery(s, εmin) 7: ifcardinality(sMin)≥MinPts then 8: pMin=pMin∪sMin

9: sMinMax=regionQuery(s, εmax)\sMin 10: fuzzyBorder=fuzzyBorder∪sMinMax 11: add s to C as core

12: else

13: fuzzyBorder=fuzzyBorder∪ {s}

14: end if 15: end for

16: fuzzyBorder=fuzzyBorder\C

17: for xfuzzyBorder do

18: add x to C as border object with membership µxcwhere c is the nearest core of C 19: end for

20: return C

degree µ ∈ (0, 1]. In particular, if the distance d(x, y) between the two objects is lower than εmin, then µ = 1; if εmin < d(x, y) < εmax, then µ ∈ (0, 1). The strategy of the FuzzyBorder version mimics the original DBSCAN algorithm for the identifi- cation of core objects: only objects within εmin =1) are considered for the com- putation of the number of neighbours and the determination of the core objects.

Differently from DBSCAN, the membership degree of a border object to a specific cluster can be lower than 1, depending on its distance from the farthest core object of the cluster. Actually, in Chapter 4 we will show that such a choice does not properly capture the intuition of fuzzy borders, especially when the cluster shapes are not convex: indeed, we will opt for adopting a closest-core policy for the assessment of membership degree of border objects. Notably, one object can belong to the border of multiple clusters, each with a different membership degree. Indeed, the algo- rithm partitions data into clusters with fuzzy overlapping borders.

Fuzzy clustering algorithms for data streams

The recently proposed d-FuzzStream algorithm (Schick et al., 2018) extends the tra- ditional offline-online framework by incorporating concepts from fuzzy set theory.

(26)

3.2 Fuzzy clustering 21

0 min max

Distance between two objects 0.00

0.25 0.50 0.75 1.00

Membership degree

Membership function

Figure 3.2: The function that defines the membership degree of an object to the neighborhood of another object. Figure from (Aliperti et al., 2019).

During the online step examples are summarized in fuzzy structures, i.e. fuzzy microclusters (f-mcs), which differ from crisp microclusters as each object may be- long to multiple microclusters with a different membership degree. Compared to the former FuzzStream proposal (de Abreu Lopes and de Arruda Camargo, 2017), d-FuzzStream introduces the notions of fuzzy dispersion and similarity of micro- clusters (adaptations of microcluster radius and mutual distance, respectively) in order to reduce the summarization complexity and to promote absorption of ex- amples and merging of f-mcs. d-FuzzStream poses a limit on the number of f-mcs kept in memory: whenever such limit is exceeded, the least recently updated f-mcs are deleted, thus ensuring both memory compliance and adaptation to evolving set- tings. The offline step organizes f-mcs in fuzzy macroclusters through a weighted fuzzy c-means (WFCM) procedure (Hathaway and Hu, 2008). Each f-mc is repre- sented as a virtual point in the original space, located in the f-mc center and weighted by the sum of the membership degrees of the relative examples. WFCM requires to specify the expected number of macroclusters; the f-mcs with the highest weights are used to initialize the cluster prototypes. For the purpose of clustering evaluation, each original object is mapped onto its closest microcluster. Similarly, an alternative formulation of fuzzy microclusters has been proposed in (Laohakiat et al., 2018), maintaining the online-offline approach.

(27)
(28)

Chapter 4

Towards fuzzy solutions for clustering data streams

This chapter presents two fully-online preliminary extensions of DBSCAN cluster- ing algorithm for streaming data. First, we describe S-DBSCAN (Streaming DB- SCAN) as a simple incremental adaptation of the crisp DBSCAN algorithm, which reacts to the continuous arrival of objects from a stream. Then, we leverage the in- troduction of a fuzzy membership function to propose a first streaming fuzzy adap- tation of DBSCAN, i.e. SF-DBSCAN (Streaming Fuzzy DBSCAN). Experiments on synthetic datasets highlight the benefits of the introduction of fuzziness. Part of this work has been presented in (Aliperti et al., 2019). The chapter ends with a discus- sion of the limits of these approaches, starting points for the development of a novel algorithm.

4.1 S-DBSCAN: A preliminary incremental adaptation of DBSCAN

Motivation and objectives

As a density-based clustering algorithm, DBSCAN is particularly suited for detect- ing clusters of arbitrary shape, is able to effectively handle noise and outliers and does not require to preliminary specify the number of clusters. However, the orig- inal implementation assumes the offline availability of the whole dataset and there- fore is not capable of handling streaming data.

S-DBSCAN and SF-DBSCAN pursue the goal of enriching DBSCAN with the ability to manage streaming data as input. For both the algorithms, we update the clusters online at each collected object. Furthermore, we assume that the stream is bounded and can fit in memory. Although unrealistic for genuine streaming appli-

23

(29)

24 Towards fuzzy solutions for clustering data streams

cations, this assumption allows us to compare crisp and fuzzy approaches under simplified setting.

S-DBSCAN procedure

The pseudocode of S-DBSCAN is shown in Algorithms 5 and 6. The procedure requires to specify the values for the parameters MinPts and ε, and it updates the identified clusters at any occurrence of a new object. The algorithm makes use of the data structure pList to keep all the received objects, and any new object p is added as soon as it is processed. The newly collected object may turn out to be an outlier, a border object or a core object; notably, all the objects in its ε-neighborhood are ex- amined (Algorithm 5, line 2) since it may alter their condition. Our implementation pursues the capability of adapting to evolving contexts: indeed, if a border object lies in the neighborhood of core objects of different clusters, it is assigned to the most recent one (Algorithm 6, line 8).

Algorithm 5:S-DBSCAN(p) Given: p- new object

1: label p as outlier

2: S= {p} ∪regionQuery(ε, p, pList) 3: add p to pList

4: for sS do

5: s.neighbors=regionQuery(ε, s, pList) 6: ifcardinality(s.neighbors)≥MinPts then 7: ifwasCoreBefore(s) then

8: let Cs be the cluster of object s 9: Cs= Cs∪ {p}

10: else

11: ifall objects in s.neighbors are outliers then 12: C= nextCluster

13: add all objects in s.neighbors to C

14: else

15: merge(s, s.neighbors)

16: end if

17: end if 18: end if 19: end for

4.2 SF-DBSCAN: Streaming Fuzzy DBSCAN

SF-DBSCAN represents an adaptation of the FuzzyBorder variant of the DBSCAN algorithm for streaming data, introduced in (Ienco and Bordogna, 2018) and de-

(30)

4.2 SF-DBSCAN: Streaming Fuzzy DBSCAN 25

Algorithm 6:merge(s, s.neighbors) Given: s- object

Given: s.neighbors - set of objects in the ε-neighborhood of s 1: C=nextCluster

2: add s to C

3: for p∈s.neighbors do 4: ifisCore(p) then

5: let Cp be the cluster of object p 6: add all objects∈Cpto C 7: else

8: add p to C as border object 9: end if

10: end for

scribed in Section 3.2. The constraint on the neighborhood size (ε) is relaxed by defining two distance thresholds (εminand εmax), thus decoupling the identification of core objects and the membership assessment of border objects. For this purpose, we adopt the same membership function proposed in the original FuzzyBorder. For- mally, given d(x, y)the distance between two objects x and y, the fuzzy membership of y to the neighborhood of x, namely µx(y), is defined as follows:

µx(y) =





1 if d(x, y) < εmin

εmaxd(x,y)

εmaxεmin if εmin ≤d(x, y) ≤εmax

0 if d(x, y) > εmax

(4.1)

A visual comparison of crisp and fuzzy membership functions is provided in Fig.

4.1. According to the terminology usually adopted for microclusters (Hyde et al., 2017; Islam et al., 2019), given an object p, we define the region at a distance from p lower than a threshold εminas kernel region, and the region at a distance from p lower than a threshold εmax but higher than εmin as shell region. When used as attributes of an object p, i.e. p.kernel and p.shell, they indicate the set of objects in kernel and shell regions, respectively.

As in S-DBSCAN, clusters are updated on the basis of every newly analyzed data object. The pseudocode of SF-DBSCAN is presented in Algorithms 7, 8, 9, 10.

In addition to pList, SF-DBSCAN requires to store, for each fuzzy border object and for each cluster in the εmax-neighborhood, both the membership value and the closest core belonging to the cluster. We stored this information by using an efficient data structure. For the sake of simplicity, in the pseudo-code we adopted a matrix, namely CMember, which is updated during the execution. Given a cluster C and an object x, CMember[C, x]represents the membership degree of x to cluster C. Every time a new object is analyzed, a new column is added to the matrix. Every time a

(31)

26 Towards fuzzy solutions for clustering data streams

(a) Crisp neighborhood (above) and corresponding membership function (below)

(b) Fuzzy neighborhood (above) and corresponding membership function (below)

Figure 4.1: Comparison between crisp (a) and fuzzy (b) neighborhood and mem- bership function. In the reported bidimensional case the plot of the function can be considered along any straight line through the coordinate of the object, such as the dotted red one.

new cluster label is generated (nextCluster statement in Algorithms 7 and 8) a new row is initialized to zero.

The parameters εmin, εmaxand MinPts are set in the initialization step, while pList and CMember are initialized as empty list and matrix, respectively, and are dynam- ically updated at each new object: for the sake of readability, we avoid mentioning them as input in Algorithms 7, 8, 9 and 10. The well-known issue of parameter tun- ing in ordinary DBSCAN has been addressed with heuristic approaches (Ester et al., 1996): in realistic streaming settings, the required data for such heuristics could be gathered during a warm-up phase. Notably, the introduction of εmaxdoes not affect the identification of core objects, yet mitigating the parameter sensitivity of classic DBSCAN towards the identification of cluster borders.

When a new object p of the stream is analyzed, the object itself and those in its εmin-neighborhood are investigated by SF-DBSCAN (Algorithm 7). If the object x under investigation satisfies the condition to be a core object and it had not already been labelled as a core object, two options are viable, based on the labels of objects in the εmin-neighborhood of x: either x starts a new cluster (if condition at line 9 is verified) or it merges with an existing cluster (line 20 - merge procedure). Otherwise (lines 28 to 33), if x is not a core and it lies in the fuzzy neighborhood of a core y,

(32)

4.2 SF-DBSCAN: Streaming Fuzzy DBSCAN 27

Algorithm 7:SF-DBSCAN(p) Given: p- new object

1: label p as outlier

2: p.kernel, p.shell=fuzzyQuery(εmin, εmax, p, pList) 3: add p to pList

4: S= {p} ∪p.kernel

5: CheckingList=

6: for xS do

7: x.kernel, x.shell=fuzzyQuery(εmin, εmax, x, pList)

8: ifcardinality(x.kernel)≥MinPtsand¬wasCore(x) then 9: ifall objects in x.kernel are outliers then

10: C=nextCluster

11: insert row CMember[C, :] filled with zeros 12: add x to C as core point

13: for k∈x.kernel do

14: add k to C with CMember[C,k]=1

15: end for

16: for s∈x.shell s.t.¬isCore(s) do 17: fuzzyBorderUpdate(s, x)

18: end for

19: else

20: merge(x, x.kernel, x.shell) 21: end if

22: else

23: ifwasCoreBefore(x) then

24: CheckingList=CheckingList∪ {x}

25: end if 26: end if 27: end for

28: if¬isCore(p) then

29: CheckingList=CheckingList∪p.shell 30: for y∈CheckingLists.t. isCore(y) do 31: fuzzyBorderUpdate(p,y);

32: end for 33: end if

i.e. at a distance lower than εmax from y, the fuzzyBorderUpdate procedure is called (algorithm 9): border objects are treated as in the FuzzyBorder variant of DBSCAN (see Algorithms 3 and 4) and the CMember matrix is updated accordingly.

The merge procedure is described in Algorithm 8: whenever a new object x sat- isfies the core condition and is close to objects that belong to existing clusters, x generates a new cluster C. Non-core objects at a distance lower than εmax are up- dated according to fuzzyBorderUpdate. If existing core objects lie in the εminneigh- borhood of the core object x, the updateCluster procedure is called on each core y

(33)

28 Towards fuzzy solutions for clustering data streams

Algorithm 8:merge(x,x.kernel,x.shell) Given: x- object

Given: x.kernel - objects in the εmin-neighborhood of x

Given: x.shell - objects in the εmax-neighborhood of x, but not in its εmin-neighborhood 1: C=nextCluster

2: insert row CMember[C, :] filled with zeros 3: add x to C as core object

4: for k∈x.kernel do 5: ifisCore(k) then 6: updateCluster(k, C) 7: else

8: fuzzyBorderUpdate(k, x) 9: end if

10: end for

11: for s∈x.shell do 12: if¬isCore(s) then

13: fuzzyBorderUpdate(s, x) 14: end if

15: end for

Algorithm 9:fuzzyBorderUpdate(b, c) Given: b- border object

Given: c- core object

1: µbc=fuzzyMembership(εmin, εmax)

2: C=cluster of object c

3: CMember[C,b]=max(CMember[C,b], µbc)

Algorithm 10:updateCluster(y, C) Given: y- object to merge

Given: C- cluster

1: oldC=argmax(CMember[:, y]) 2: forx∈pList\ {y}do

3: ifisCore(x) and argmax(CMember[:, x]) == oldC then 4: add x to C as core object

5: else

6: if¬isCore(x) and CMember[oldC,x]6=0 then

7: CMember[C,x]=max(CMember[C,x], CMember[oldC,x]) 8: CMember[oldC,x]=0

9: end if 10: end if 11: end for 12: add y to C

(34)

4.3 Comparison of S-DBSCAN and SF-DBSCAN 29

(Algorithm 10): the core objects in pList that belong to the same cluster of y (line 3) and the border objects that have a non-zero membership to the cluster of y (line 6) are investigated: clusters of core objects are updated and the membership degree of border objects to the new cluster is evaluated. The CMember row related to the deprecated cluster is set to zero (line 8), thus not affecting the subsequent analysis.

4.3 Comparison of S-DBSCAN and SF-DBSCAN

A preliminary experimental study has been carried out on three synthetic datasets showing concept drift with the aim of comparing the performance of SF-DBSCAN and S-DBSCAN. In the following, we first describe the datasets, and then we discuss the results.

Datasets

For our tests, we referred to the recently proposed synthetic benchmark Gaussian- MotionData (GMD) collection (Márquez et al., 2018). It consists of nine datasets with concept drift: in each dataset, distinct clusters are generated from distinct Gaussian distributions that change their mean and/or covariance over time. As we aimed to visually show the behaviour of SF-DBSCAN, we selected from GMD the two datasets with two-dimensional data and with a limited number of objects, namely GMD-1C2D1kLinear and GMD-4C2D800Linear. The former contains 1000 objects with one single cluster, which shows a progressive drift in the bidimensional feature space. The latter has 800 objects and four clusters, which change their co- variance and move at a steady pace towards a central point. We also generated yet another dataset, 3C2D7500Merge, by properly setting the data generation param- eters so to show how clusters can be merged during the streaming data analysis.

3C2D7500Merge consists of three clusters with 7500 objects: the centers of two clus- ters get closer over time until they merge with each other.

Results

Table 4.1 shows the parameter values adopted for SF-DBSCAN on each test dataset.

Their choice is the result of a grid search over a reasonable set of values for εmin, εmax and MinPts, using ARI as a reference metric for the analysis. This sort of opti- mization of the parameters configuration is clearly unfeasible in genuine clustering applications. However, the goal of the present analysis is just to explore the poten- tial of the two algorithms; we will broadly discuss the issue of parameter tuning in the next chapter.

A preliminary insight into the modeling capability of SF-DBSCAN is shown in Fig. 4.2. It reports the results obtained on the 25%, 50%, 75% and 100% of the GMD-

(35)

30 Towards fuzzy solutions for clustering data streams

Table 4.1: Parameter configuration for SF-DBSCAN Dataset εmin εmax minPts GMD-1C2D1kLinear 1.00 2.00 10 GMD-4C2D800Linear 1.45 5.00 6

3C2D7500Merge 1.70 7.00 8

1C2D1kLinear dataset. The algorithm tracks the evolution of data along the stream- ing analysis. Obviously, the dataset with a single cluster can be perfectly modeled with a high enough value of εmin: in this case we show how, even with a relatively small radius, some objects are initially considered as outliers (green triangles) and later incorporated into the unique cluster.

Figure 4.2: Results of SF-DBSCAN after analyzing 25%, 50%, 75% and 100% of the dataset GMD-1C2D1kLinear. Figure from (Aliperti et al., 2019).

Datasets GMD-4C2D800Linear and 3C2D7500Merge are used for measuring the quality of the partitions obtained with SF-DBSCAN and S-DBSCAN in terms of Ad- justed Rand Index (ARI). As for SF-DBSCAN, the ARI is evaluated after a defuzzi- fication process: the cluster label for a border object is assigned on the basis of its highest membership degree

For a fair comparison with its fuzzy counterpart, S-DBSCAN is executed with three different parameter configurations: we set MinPts coherently with SF-DBSCAN, while ε is equal to εmin, εmax, and their average value, respectively. Results for datasets GMD-4C2D800Linear and 3C2D7500Merge are reported in Tables 4.2 and 4.3, re- spectively. Dataset GMD-1C2D1kLinear is excluded from the numerical evaluation since it contains a unique cluster.

Results suggest that SF-DBSCAN outperforms, on average, the relative crisp base- line: the high sensitivity of S-DBSCAN to the parameters setting makes it less suit- able to handle evolving data streams, especially when clusters are not well spaced.

A small value of the neighborhood radius (εmin) leads to structures that are signif- icantly less accurate than those detected by SF-DBSCAN over the entire streaming process. Conversely, higher values of ε (εavg and εmax) make S-DBSCAN group to- gether close clusters, thus dramatically affecting the clustering performance.

(36)

4.3 Comparison of S-DBSCAN and SF-DBSCAN 31

Table 4.2: Adjusted Rand Index for dataset GMD-4C2D800Linear GMD-4C2D800Linear

25% 50% 75% 100% Avg

SF-DBSCAN - εmin, εmax 0.940 0.993 0.993 0.982 0.909 S-DBSCAN - εmin 0.698 0.865 0.911 0.924 0.798 S-DBSCAN - εmax 0.987 0.713 0.000 0.000 0.611 S-DBSCAN - εavg 0.975 0.993 0.993 0.000 0.763 Table 4.3: Adjusted Rand Index for dataset 3C2D7500Merge

3C2D7500Merge

25% 50% 75% 100% Avg

SF-DBSCAN - εmin, εmax 0.998 0.984 0.565 0.569 0.752 S-DBSCAN - εmin 0.885 0.919 0.537 0.548 0.655 S-DBSCAN - εmax 0.544 0.000 0.000 0.000 0.200 S-DBSCAN - εavg 0.543 0.000 0.000 0.000 0.294

A further evaluation of the two algorithms is performed through a visual analy- sis of intermediate and final partitions.

(a) S-DBSCAN: 25% (b) S-DBSCAN: 50% (c) S-DBSCAN: 75% (d) S-DBSCAN: 100%

(e) SF-DBSCAN: 25% (f) SF-DBSCAN: 50% (g) SF-DBSCAN: 75% (h) SF-DBSCAN: 100%

Figure 4.3: Dataset GMD-4C2D800Linear partitioned with S-DBSCAN (first row, a-d) and SF-DBSCAN (second row, e-h). Figure from (Aliperti et al., 2019).

In Figures 4.3 and 4.4 we compare the clustering quality of S-DBSCAN and SF- DBSCAN, showing the structures obtained after analyzing the 25%, 50%, 75% and 100%of the objects in GMD-4C2D800Linear and 3C2D7500Merge, respectively. As

Riferimenti

Documenti correlati

decreases the activation energy of the forward reaction and increases the activation energy of the reverse reaction.. increases the activation energy of the forward reaction

In detail, 3D-3D superimpositions were repeated four times in a group of 50 matches and 50 mismatches according to different reference models for both the steps (registration

In this paper, SWOT analysis will be conducted to explore the strengths, weaknesses, opportunities, and threats to essential facilities in the efficiency of industrial

Japan - Altunay Perendeci, Turkey ORAL SESSION 6 WORKSHOP ON IMAGE ANALYSIS AND SPECTROSCOPY IN AGRICULTURE.. Session Chairs: Ta-Te

mento dovuto alla crisi di Mucca Pazza... del pollo nel giro di due anni. Come è stato precedentemente ricordato l’analisi ha escluso che fra le due categorie di beni vi

One observes initially that, since G = PGL(6, 2) acts transitively on the lines of PG(5, 2), every such spread is equivalent to one with spine µ and, since PGL(4, 2) is transitive

Since both an insertion or a deletion inside an aligned phrase can be expressed either with an adaptive pointer (encoded in DeltaBits bits) or with a literal run (whose length

In riferimento alle applicazioni di queste piattaforme in saggi di detection, la rilevazione del fenomeno di binding tra un recettore e il suo analita viene misurato sulla base