Development and analysis of Distributed Fuzzy Decision Tree on Spark

(1)

D

IPARTIMENTO DI

I

NGEGNERIA DELL

’

INFORMAZIONE LAUREA MAGISTRALE IN

C

OMPUTER

E

NGINEERING

Development and Analysis of

Distributed Fuzzy Decision Tree

on Spark

RELATORI IL CANDIDATO

Ing. Alessio Bechini Lorenzo Baglini

Dipartimento di Ingegneria Dell’Informazione

Prof. Francesco Marcelloni

(2)

(3)

(4)

(5)

Chapter 1 Introduction

Machine Learning represents one of the fundamental areas of Artificial In-telligence and deals with the creation of systems based on observations or examples as data for the synthesis of new knowledge (classifications, gener-alizations, reformulations). Machine Learning is used for multiple purposes where the machine is required to make decisions about the data. In market-ing, for example, it is used to classify potential or current targets (i.e. cus-tomers); in medicine it is used to diagnose diseases or pathologies; in e-mail systems for the detection of spam messages; in security for face recognition in anti-terrorism systems. To solve classification problems we have a series of classifiers: Support Vector Machine, Naive Bayes, Logistic Regression, Neu-ral Networks, and so on. Depending on the problem which they have to solve, it is better to use one classifier rather than another. Among this classifier we have also Decision Trees, which reach a very good performance in terms of correct classification rate and they are known to be the most interpretable classifiers, even by people not expert in this field. Comprehensibility is easy when the tree makes decision on very precise concept, for example a type of a

(9)

colour, person’s gender, and so on. But when a numerical decision is needed, for example on speed of a vehicle, the tree lacks of comprehensibility. For this reason, we need to assign some symbolic meaning to these decision. A well-known symbolic framework for knowledge comprehensibility is the fuzzy logic. Fuzzy logic can be seen as an extension of classical bivalent logic, in which a concept can be true or false.

40 55 70 0 0.2 0.4 0.6 0.8

1 slow medium high

Speed (km/h)

Mem

b

ership

degree

Figure 1.1: Example of fuzzy sets.

In classical logic we assign a truth value of 0 (false) and 1 (true), while in fuzzy logic we can assign truth values from 0 to 1. For this, the latter can deal with vague and imprecise information. With reference to the above example of speed of a vehicle, we can state that the speed can be slow, medium or fast. These three terms are called linguistic variables. In this case, we mapped the continuous interval of speed into three sets, called fuzzy sets. Fuzzy sets are sets in which membership is not again 0 or 1, but it is described by a function called membership function, which represents the membership degree of a certain value for a given set, and takes all value between 0 and 1. In Figure 1.1 we can see an example. Using fuzzy logic together with decision trees we obtain fuzzy decision trees, from where we can extract a set of fuzzy

(10)

rules which describe the process of decision making.

The procedure of mapping a numerical continuous space into a set of finite values is called discretization. This process deeply influences the overall accuracy of the classifier, and for this reason an investigation on this aspect is required. Along with this aspect, we will also investigate the parameters that can influence the complexity of trees, expressed by the number of nodes, to know how to preserve the fundamental strength of Decision Trees which is interpretability.

1.1 Decision Tree

Decision trees (DT) are widely used classifiers, employed in many different application domains such as security assessment, health system and road traffic congestion. The popularity of decision trees is mainly due to the simplicity of the learning schema and to the interpretability [9], so they can explain how an output is inferred from the inputs.

Decision tree induction assumes that one variable is target (for predic-tion, this variable concerns the future value, which is unknown in the time of the decision) and other variables are input variables. For known values of the input, we are able to estimate the value of the target variable. When using decision trees, we can distinguish between two phases. In the first phase, the so called training phase, the decision tree is learned from a historical data. Learning can be interactive or automatic. The result of the first phase is a decision tree. In the second phase, the decision tree is used. It can be used as knowledge (manager makes a decision with a knowledge of a decision tree) or as a predictive or a classification model (we estimate the future value of a target variable or a class for known values of input variables).

(11)

Figure 1.2: Example of a decision tree when choosing if the weather allows to play a tennis match.

Decision tree can handle both categorical and continuous variable. A categorical variable is a variable which can take only one value from a set of fixed values, while a continuous variable can take values from a continuous scale. For example, the weight, height, and age of respondents in a survey would represent continuous variables. However, a person’s gender, occupa-tion, or marital status are categorical: either a person is male or female, never married, married, or divorced, etc.

In a decision tree, each internal node denotes a test on an attribute, each branch represents the outcome of the test, and each leaf node holds a class label. The topmost node is called the root node.

(12)

1.2 Fuzzy Decision Tree

In this thesis, we dealt with fuzzy decision trees (FDTs), which integrates decision trees with the fuzzy set theory. Like classical decision trees, FDTs can be categorized into two main groups: binary split trees and multi-way split trees, depending on the splitting method. Binary split are character-ized by recursively partitioning the attribute space into two subspaces, while multi-way split partition space into a number of subspaces, which can be more than two.

Fuzzy decision trees differ from traditional crisp decision trees on three aspects:

1. The fuzzy sets representing the data have to be defined. 2. Their inference procedures are different.

3. They use splitting criteria based on fuzzy restrictions.

The first point is discussed in 2.2 and it is one of the issue faced in this thesis, the second point is discussed in 2.1, while the third point is better discussed in 3.2.1.

1.3 Contribution of this thesis

FDT learning schemes require that a fuzzy partition has been already de-fined upon each continuous attribute. So continuous attributes should be discretized, and this operation drastically affect the accuracy of the classi-fier [8]. Many methods has been discussed, which are mainly summarized in

(13)

[14]. As we can see from this last paper, the analysis investigate how different discretization approaches can influence the accuracy and the complexity (in terms of number of nodes) of the FDT generated. In addition, to the best of our knowledge, only a few classifiers proposed for managing big data employ fuzzy sets [11]. Since the use of a fuzzy reasoning can help to mitigate the effect of a Boolean decision, exploring this field is very important. For this reason, we think that a better investigation on discretization impact in a fuzzy decision tree classifier is needed.

In this thesis, we analyze the impact of discretization method based on fuzzy entropy, using different fuzzy set shapes such as triangular, trapezoidal and pseudogaussian on small and large dataset. We also experiment the variation of the minimum number of instances on a node, in order to see if we can have a good accuracy with simplier trees, to preserve the main feature of a decision tree which is the interpretability. Linked to this last concept, there is another experiment on limiting the number of fuzzy sets defined over partitions in output of the discretizer, in a big dataset context. We also compare performances of fuzzy binary tree with fuzzy multiway tree, making also a comparative analysis with the crisp decision tree. Finally, we will investigate over a new proposed method to assign class label to unlabeled instances.

1.4 Framework selection

To deal with Big Data (term which identifies large and complex dataset), the most famous programming model is MapReduce [6], proposed by Google to simplify the distribution of computation tasks over clusters.

(14)

Figure 1.3: MapReduce model example.

processed by map tasks in a parallel manner. The framework sorts the out-puts of the maps, which now are input to the reduce tasks. After this step, we have the output of the MapReduce job.

This model is used in some very well-known computing framework, such as Apache Hadoop[1] and Apache Spark[2][5]. However, Hadoop is not appro-priate for tree learning process, and in general for applications that need it-erative computations, because it is inadequate for in-memory computations. For this reason, Apache Spark is more suitable for these types of tasks. Spark is a fast in-memory computational engine, which minimizes read/write cycle to disk, speeding up the whole process. Its architectural foundation is RDD (Resilient Distributed Dataset), fault-tolerant collection of elements, distributed over the cluster of machines. In this way, operations called tasks are applied on RDD, which can be cached in-memory to speed up computa-tions. Spark uses a master/worker architecture, where a driver process sends tasks to the workers, called executors. Then, the driver waits the executors to run these tasks, and finally collects the result.

(15)

This framework provides API based on this processing engine, like Spark SQL, Spark Streaming, and MLLib. This last library is the most popular machine learning library for Spark, and implements a lots of algorithm for classification, clustering, regression problem and other data mining tools.

(16)

Chapter 2 Fuzzy Decision Tree

The Distributed Fuzzy Decision Tree developed consists of two processes. The first is the discretization process, which generates fuzzy partitions on continuous attributes, while the second is the learning process, which gen-erates the decision tree. In this section, we introduce some background on Fuzzy Decision Tree, then we will explain Fuzzy Discretization and Fuzzy Learning process. With this chapter, the reader will have the basics on the model behind Fuzzy Decision Tree and will better understand the issues ex-plained in the following chapter.

2.1 FDT Background

The main purpose of classification is to assign a class label Cm from a pre-defined set C = {C1, . . . , CM} (where M is the number of class labels) to an unlabeled instance. Each instance is described by numerical or categorical features, for example a set X = {X1, . . . , XF} of F features. If the feature is

(17)

numerical, then Xf is defined on an universe Uf ⊂ R, while if it’s categorical, the feature is defined over a predefined set Lf = {Lf,1, . . . , Lf,Tf} of values.

In CDT, every leaf nodes holds a class label. In FDT, a leaf node can holds one or more class labels, and each of them has a weight wm associated, which denotes the strength of the label in that leaf. So, when a unlabeled instance x is presented to the tree, it activates multiple paths, starting from the root until leaves, while in CDT, only one path is activated. In fact, nodes in FDT are fuzzy sets, and instances reach the leaves with a certain matching degree, which is related to the membership degree of fuzzy sets activated along the path. The formula to compute the matching degree mdCN(x) is:

mdCN(x) = T N (µCN(xf), mdP N(x)) (2.1) where TN is a T-norm, µCN_(x

f) is the membership degree of feature xf to the current node CN, which considers Xf as splitting attribute, and mdP N(x) is the matching degree of this instance at parent node PN. On leaves, we compute association degree with class Cm of unlabeled instance as:

ADLN_m (x) = mdLN(x) · w_mLN (2.2)

where wLN

m is the weight associated to class Cm at this node. To classify the instance, there are two approaches: maximum matching, in which we asso-ciate the instance to the class with maximum activation degree, and weighted vote, in which we associate the instance to the class with the maximum total of activation degrees, obtained summing activation degree per class at each leaf reached by the instance.

2.2 FDT Discretization

In this section, we illustrate the discretization (partitioning) of continuous attributes in FDT. The discretization is based on Fuzzy Entropy, so we briefly

(18)

recall some definitions and concepts on that.

Let T Rf = [x1,f, . . . , xN,f]T ve the projection of the training set T R along attribute Xf. We assume that the values xi,f are sorted in increasing order. Let If be an interval defined on the universe of attribute Xf. Let lf and uf be the lower and upper bounds of If. Let Sf be the set of values xi,f ∈ T Rf contained in If. Let us assume that a fuzzy partition PIf =

{Bf,1, . . . , Bf,K_PI

f

}, where KP_If is the number of fuzzy sets in PIf, is defined

on If. Let Sf,1, . . . , Sf,K_PI

f

be the subsets of points in Sf, contained in the supports of Bf,1, . . . , Bf,K_PI

f

, respectively. The weighted fuzzy entropy W F Ent(PI_f, If) of partition PI_f is defined as:

W F Ent(PI_f, If) = K_PI f X j=1 |Bf,j| |Sf| · F Ent(Bf,j) (2.3)

where |Bf,j| is the fuzzy cardinality of fuzzy set Bf,j, |Sf| is the cardinality of set Sf and F Ent(Bf,j) is the fuzzy entropy of Bf,j.

We recall that the fuzzy cardinality of a fuzzy set Bf,j is computed as

|Bf,j| = Nf,j

X

i=1

µBf,j(xi,f) (2.4)

where Nf,j is the number of points in Sf,j and µBf,j(xi,f) is the membership

degree of xi to fuzzy set Bf,j. The fuzzy entropy of Bf,j is defined as

F Ent(Bf,j) = M X m=1 −|Bf,j,Cm| |Bf,j| · log₂ |Bf,j,Cm| |Bf,j| (2.5)

where fuzzy cardinality |Bf,j,Cm| is computed on the set of instances in Sf,j

with class label Cm.

Suppose we have to run the partitioning process with triangular fuzzy sets. At the beginning, If coincides with the universe of Xf and Sf = T Rf.

(19)

For each value xi,f between lf and uf (at the beginning of the partitioning procedure, i = 1, . . . , N ), we define a strong fuzzy partition Pxi,f on If by

using three triangular fuzzy sets, for example Bf,1, Bf,2, Bf,3. The cores of Bf,1, Bf,2 and Bf,3 coincides with lf, xi,f and uf, respectively. Let Sf,1, Sf,2 and Sf,3 be the subsets of points in Sf, contained in the supports of Bf,1, Bf,2 and Bf,3.

For each partition Pxi,f induced by xi,f, we compute the weighted fuzzy

entropy W F Ent(Pxi,f, If) using Eq.2.3. The optimal value x

0

i,f, which mini-mizes W F Ent(Pxi,f, If) over all possible candidate fuzzy partitions, is then

selected. This value identifies the fuzzy partition Px0

i,f = {B

0

f,1, Bf,20 , Bf,30 }. Then we apply recursively the procedure for determining the optimal strong fuzzy partition to the intervals I_f1 = [lf, x0i,f] and I

2 f = [x

0 i,f, uf].

lf x1,f x2,f x3,f uf

Figure 2.1: An example of fuzzy partition defined on x2,f

This recursive process is repeated for each subset until stopping condi-tion is met [7]. Given that W F Ent(P0

If, If) is the weighted fuzzy entropy at

step 0, and W F Ent(Pxc, If) is the weighted fuzzy entropy on the partition

originated by cutpoint xc, we define fuzzy information gain as:

(20)

The stopping condition gives us a fuzzy gain threshold below which the par-titioning process ends. This formula is used not only to stop the parpar-titioning process, but also to do feature selection. In fact, if the fuzzy entropy gain obtained with the first split is too low, the feature considered is discarded.

2.3 FDT Learning process

The algorithm selects a splitting attribute which maximizes the fuzzy infor-mation gain and it creates children based on this split. As described in the introduction, the FDT has two way to generate split: the binary and the multi-way split. In case of binary split, we fuse adjacent fuzzy sets of the attribute selected into two disjoint groups, while in multi-way split we create a new child node for each fuzzy set. We will illustrate the algorithm in details later in Section 3.2.1.

The learning process continues until one stopping condition of the following is met:

1. the node contains only instances of the same class;

2. the node contains a number of instances lower than a fixed threshold λ;

3. the tree has reached a maximum fixed depth β;

(21)

Chapter 3 Issues on FDT

Since many aspects can be analyzed in experiments with decision trees, in this section we will see a list of issues that we have to face when dealing with FDT. We will see in the next chapter that the experiments are aimed to further investigate on these issues and to point out some relationship that may exists between these issues.

3.1 Membership function shape

Discretization can deeply influence the performance of the classification [8]. In the FDT, the first phase is a discretization process which defines member-ship functions upon each continuous attribute. These membermember-ship functions have many shapes and, of course, depending on it, the overall classification is affected. Among all, we decided to select triangular, pseudogaussian and trapezoidal membership functions.

(22)

how to fuse adjacent sets into one. This comes for free with simple shapes like triangular or pseudogaussian, but with more complex shapes like trapezoidal ones, the partitioning scheme needs some adjustments.

3.1.1 Triangular set

Triangular set [13] is a classical fuzzy set and the most simple which we can use in discretization process. It’s defined by three parameters, two delimiting the ends of the segment and one that indicates the peak. As we can see from Figure 3.1a, defining sets over partitions and fusing adjacent partitions is very easy. In the example in Figure 3.1a, we can see the two cutpoint selected with dashed lines. As a general rule, the number of fuzzy sets obtained is the number of cutpoints plus the lower and the upper bound of the universe on which the feature is defined. In fact, in the example we end up having 4 fuzzy sets.

3.1.2 Pseudogaussian set

The gaussian fuzzy set is another very well-known fuzzy set shape, but to implement it in the discretizer, we have to do some modification. In fact, the gaussian function never reaches the value 0 within an interval (it goes to zero at ”infinitum”). For this reason, it cannot be used as it is, because in the strong partition generated, the function should take all values between 0 and 1. So we have to approximate the shape of the gaussian function with another simpler function, which is the parabola. In the end, because of the relation with the gaussian set, we decided to call it ”pseudogaussian”.

(23)

23

c1 c2

ul uf

(a) Triangular fuzzy sets.

c1 c2

A B

ul uf

(b) Pseudogaussian fuzzy sets.

c1 _C c2 _D

ul uf

(c) Trapezoidal fuzzy sets.

(24)

parabola in vertex form, which is:

f (x) = a(x − h)2+ k

Parameters h and k are respectively the x-coordinate and y-coordinate of the vertex, while a parameter is the same parameter of the very well-known formula f (x) = ax2+ bx + c.

Looking at Figure 3.1b, let’s consider the left part of the plot. We can imagine dividing it into four quadrants, in which the center of the quadrants is the meeting point of the lines. We compute the equation of the ”half” parabola with ul vertex passing through point A in order to obtain a pa-rameter. For symmetric reason, this parameter is the same of the parabola with c1 vertex passing through point A, and the opposite of the parabola with B vertex. In this way, we compute only once the parameter, and then we can derive the four equation to compute membership degree during the discretization process.

Again defining fuzzy sets on partitions and fusing adjacent sets is not too much complex, because this type of shape is very similar to the triangular one.

3.1.3 Trapezoidal set

The trapezoidal in yet another classical fuzzy set, although is a bit different w.r.t previous sets because is characterized by four parameters, instead of three. When we select a cutpoint, we have to choose a distance (let’s call it d ), in term of values, to generate fuzzy sets on the two new partitions identified. In this way, we fuse adjacent fuzzy sets and we can continue partitioning process. This process is slightly different from previous ones,

(25)

because when we repeat the procedure on the generated subset, we have now to consider that we can’t select as cutpoints all values in the subset.

For example, looking at Figure 3.1c, we have two cutpoint: c1 and c2. When we consider the partition that goes from c1 to c2, we can select as cutpoint only values starting from point C to point D.

3.2 Binary FDT vs multiway FDT vs crisp

DT

3.2.1 FDT: binary vs multi-way split

The FDT classifier uses two different methods for splitting a node: the binary split partitions the attribute space into two subspace, while the multi-way split partitions the attribute space in a number of subspace that can be more than two. Both the trees use the same criteria to select the splitting attributes: we select the feature with the maximum fuzzy information gain F Gain, computed for a generic attribute Xf as:

F Gain(Pf; IG) = F Ent(G) − W F Ent(Pf; IG) (3.1) where IG is the support of fuzzy set G.

Once the attribute is selected, in multi-way split we generate a child node for each fuzzy set defined over the splitting attribute. In binary split, we have to generate only two nodes. For this reason, we need to group together adjacent fuzzy sets into two disjoint group Z1 and Z2. Given that G1 and G2 are two subsets that contains the support of the two groups, and that Tf is the number of fuzzy set defined over the attribute Xf, we start with

(26)

Z1 = {Af,1} and Z2 = {Af,2, . . . , Af,Tf}. Then we compute fuzzy information

gain with Eq.3.1, with Pf = {Z1, Z2} and cardinality:

|G1| = N1 X i=1 T N (µAf,1(xf,i), µG(xi)) and |G2| = N2 X i=1 T N (µAf,2(xf,i) + · · · + µA_Tf(xf,i), µG(xi))

where N1 and N2 are the numbers of instances in the supports of fuzzy sets in Z1 and Z2, respectively, and µG(xi) is the membership degree of instance xi to the parent node. We iterate moving fuzzy set from Z1 to Z2, until we have Z2 = {Af,Tf}. We select the partition Pf with the highest

information gain and we create two child nodes using this partitioning. In case of categorical attributes, fuzzy binary tree still performs binary splits. However, since a categorical attribute with L values generates 2L−1₋₁ candidates, the computational cost can become very prohibitive for a large number of values. In case of binary classification, we can reduce the number of candidates to L − 1 y sorting the categorical values according to the prob-ability of membership to the positive class. As proved in [4], this approach gives the optimal split in terms of entropy. In case of multiclass classification, we adopt the heuristic method proposed in [10] to approximate the best split: the number of candidates is reduced to L − 1 y sorting the categorical values according to their impurity.

From these technique for generating the tree, we can expect that the multi-way tree will be generally complex than the binary tree, because it creates more nodes at higher levels of the tree, but shorter, because it uses all the fuzzy sets available for generating nodes, while the binary tree can

(27)

only partition the number of fuzzy sets into two subsets, generating a deeper tree.

3.2.2 Fuzzy tree vs crisp tree

Another interesting aspect on fuzzy decision tree is the comparison with the crisp decision tree. We want to investigate if the accuracy of fuzzy decision tree is similar to the crisp version, and, in conjunction with the issue de-scribed in 3.3, if the decreasing in accuracy in fuzzy trees is less pronounced w.r.t the crisp tree, leveraging of the fact that the approximate reasoning can bring some benefit when the parameter considered is increased.

3.3 Changing the minimum threshold of

num-ber of instances

We know that the tree learning process stops when there are less than a fixed number of instances on a nodes (as stated in section 2.3, stopping criteria number 2). In our experiments, we want to see what happens when varying this threshold: maybe we can reduce the number of nodes (so the complexity of the tree) without reducing the accuracy, and maybe if we can observe an optimal threshold percentage value which can be used.

In order to run this experiment, we tested a post-pruning approach. In fact, in a first phase, we decided to train a fully-grown tree, without any limitation on the minimum number of instances which a node must hold before being selected as leaf. Then, on this tree, we built multiple tree, varying the percentage of the instances on a node w.r.t the size of the training

(28)

0 5 10 15 20 0 20 40 60 80 100

Min instances on nodes threshold [%]

Accuracy

[%]

Figure 3.2: Accuracy vs %instances on a node.

set.

We expect that at lower percentage the accuracy will decrease less quickly than higher ones, while the complexity of the tree (measured in terms of nodes number) will start decreasing quickly at lower percentage. In this way, we obtain a simpler tree, with almost the same accuracy but with a great reduction in terms of complexity. An example of a possible profile of accuracy is shown in Figure 3.2, while a possible profile of number of nodes is shown in Figure 3.3.

3.4 Limiting the number of fuzzy sets

Typically, in big dataset, the number of fuzzy set generated after the dis-cretization step is high. This leads to an increasing complexity during the training of the tree, especially when we adopt the multi-way split, where

(29)

0 5 10 15 20 Min instances on nodes threshold [%]

#No

des

Figure 3.3: #Nodes vs %instances on a node.

we create a new child node for each fuzzy set. An higher number of fuzzy sets makes also the partitions less interpretable. So we want to investigate what happens if we limit the number of fuzzy set in the discretizer. We want to discover if a high number of fuzzy set is really necessary to obtain the best accuracy, or if a lower number, depending on the distribution of the attributes, can reach an acceptable or maybe an higher percentage of accuracy. In fact, since the distribution of the attributes can influence the discretization, a discretization with less fuzzy sets can work better in case of a skewed distribution, for example.

3.5 Proposing a new classification method

As last issue, we want to investigate on a new experimental classification method that we will present in this section (only for test with big dataset).

(30)

Unlike crisp decision trees, with FDT, both in binary split and in multi-way split cases, we label each leaf node with all the classes that have at least one example in the leaf node. Each class Cm has an associated weight wLN_m proportional to the fuzzy cardinality of training instances of the mth class in the node. More formally, wLN

m =

|G_Cm|

|G| , where GCm is the set of instances

in G with class label equal to Cm. Then FDT adopts the weighted vote for deciding the class to be output for the unlabeled instance. For each class, the vote is computed as sum of the association degrees determined by any leaf node of the tree for that class, where the association degree is calculated by Eq.2.2. Each activated leaf produces a list of class association degrees, which are summed up to compute the strength of vote for that class. The unlabeled instance is associated with the class with the highest strength of vote.

The idea behind the new experimental classification method is to have an histogram of class distribution, with buckets of matching degree calculated at each leaf of the tree.

0 0.2 0.4 0.6 0.8 1

Matching degree

#Instances

of

classes

(31)

Taking Figure 3.4 as an example, every bar of the histogram contains a set of instances of some classes for a given matching degree. The histogram is equi-width, and the number of buckets is ten.

When an unlabeled instance has to be classified, we compute the match-ing degree of all the path activated by the instance towards the leaves. Then we look at the interval corresponding to matching degree in the histogram of leaves activated, and we compute the relative frequency of classes in that interval. Again a list of relative degrees for each classes is produced by ev-ery leaf activated. In the end, we sum up evev-ery relative frequency for evev-ery class, and we classify the instance as the class with the highest sum of relative frequencies.

We want to investigate on this new approach, because we want to test if instances belonging to the same class activate same buckets in histograms. If this is true, we can find common ”activation degree pattern” that can maybe increase accuracy. In the weighted vote approach instead, all matching degree are taken into account, losing the concept of ”common pattern” when they are all multiplied by the weight and summed up.

(32)

Chapter 4 Experimental setting

Given the complexity of the problem, the analytical approach is impossible to follow. For this reason, we follow the experimental method. After presenting a list of issues, we will see the experimental environment which is used to run experiments, including the modifications made on the available code and the results obtained.

4.1 Hardware/software configuration

All experiments have been executed on an Openstack 1 cluster, provided by University of Pisa. We use one master node equipped with a 4-core VCPUs, 7 GB of RAM and a 160GB Hard Drive, and four slaves nodes, equipped with the same amount of resources. The master node hosts a DataNode which stores all the training dataset used in the experiment, a NameNode and the Spark driver process, while each slave node runs two executors. We reserved

(33)

on master node 2 VCPU and 3 GB of RAM for running Hadoop services, and we reserved 1 GB of RAM in slaves nodes for the OS. In the end, the experiments have been deployed upon Apache Spark 2.2.0 as data-processing framework, using a total of twelve executors processes. All virtual machines run Ubuntu 14.04.3 LTS as Operating System.

4.2 Dataset description

In Table 4.1, we can see the small dataset used in the experiments, taken from KEEL repository2. We mainly choose dataset with numerical and real attributes, because, as a first objective, we want to test the effect of a dis-cretization process on continuous features. However, we choose one dataset (Vowel) for investigating the behaviour of the other issues. Dataset are al-ready partitioned in the form for 5-fold cross validation.

Dataset # Instances # Continuous/categorical attributes # Classes

Iris 150 4/0 3 Glass 214 9/0 7 Segment 2310 19/0 7 Vehicle 846 18/0 4 Vowel 990 10/3 11 Wine 178 13/0 3

Table 4.1: Small dataset used in the experiment.

• Iris: this is perhaps the best known database to be found in the pattern recognition literature. The data set contains 3 classes of 50 instances

(34)

each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

• Glass: from USA Forensic Science Service; 6 types of glass which can be found in the crime scene, defined in terms of their oxide content (i.e. Na, Fe, K, etc).

• Segment : this database contains instances drawn randomly from a database of 7 outdoor images (classes). The images were handseg-mented to create a classification for every pixel. Each instance encodes a 3x3 region. The attributes encode some colour properties of the re-gions. The task is to determine the type of surface of each region. • Vehicle: the purpose is to classify a given silhouette as one of four types

of vehicle (van, saab, bus, opel), using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.

• Vowel : this database contains information about speaker independent recognition of the eleven steady state vowels of British English using a specified training set of LPC derived log area ratios.

• Wine: these data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The parameters used in simulations are listed in Table 4.2 and Table 4.3.

(35)

Parameter Value

Impurity Entropy

Minimum information gain 10−6

Table 4.2: Small dataset parameters used in discretization process.

Parameter Value

Impurity Entropy

Minimum information gain 10−6 Minimum instances per node 1

Maximum depth tree 15

Tnorm Product

Table 4.3: Small dataset parameters used in classification process.

In Table 4.4, we can see the big dataset used in the experiment. Even in this setting, we choose one dataset with some categorical features. All the experiments are again run using 5-fold cross validation.

• Covertype: Covertype is about predicting forest cover type from carto-graphic variables. A given observation of the forest cover type is a 30 x 30 meter cell. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and US Forest Service

Dataset # Instances # Continuous/categorical attributes # Classes

Covertype 581,012 10/44 2

ECO E 4,178,504 16/0 10

Susy 5,000,000 18/0 2

(36)

(USFS) data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

• ECO E : ECO (Electricity Consumption and Occupancy) contains ag-gregate electricity consumption data – including real and reactive power for each of the three phases – and plug-level measurements of selected household appliances.

• Susy: the data has been produced using Monte Carlo simulations. The first 8 features are kinematic properties measured by the particle detec-tors in the accelerator. The last ten features are functions of the first 8 features; these are high-level features derived by physicists to help dis-criminate between the two classes. This is a classification problem to distinguish between a signal process which produces super symmetric particles and a background process which does not.

The parameters used in big dataset simulations are listed in Table 4.5 and Table 4.6. In the first table, we can see some differences w.r.t Table 4.2. In fact, during the discretization process in big dataset, we cannot proceed with the method explained in 2.2. The discretization process iterates over every single point of the feature considered, and this is not a scalable solu-tion when dealing with Big Data, of course. For this reason, we limit the number of possible candidate partitions, splitting the domain of the continu-ous attributes into a fixed number of equi-frequency bins. Then we generate candidate fuzzy partitions using each bin, represented by its central value.

(37)

Parameter Value

Impurity Entropy

Binning strategy Equi-frequency

Maximum Bins 1000

Subsampling fraction 1%

Table 4.5: Big dataset parameters used in discretization process.

Parameter Value

Impurity Entropy

Minimum instances per node 10−4%

Maximum depth tree 15 (binary), 5 (multiway)

Tnorm Product

Table 4.6: Big dataset parameters used in classification process.

4.3 Extension of the implementation

In order to run experiments, we initially had to reorganize and implement some new functionalities. We extended the code, written in Scala [12], pub-lished on GitHub 3_{, implementing some classes for new discretization fuzzy} set shapes (pseudogaussian, trapezoidal, trapezoidal-adaptive). In these cases, we usually apply decorator pattern, but, for simplicity, we prefer to extend the code in the way presented in 4.1. FuzzyMDLFilter is an abstract class, while in the previous version of the code was implementing the only possible discretizer set, the triangular one. The library provide some dis-cretization API methods for Scala and Python. The filter method is invoked,

(38)

and calculateCutPoints method is applied to the continuous attributes of the dataset. calculatePriorEntropy is called at each step of the algorithm, com-puting the starting entropy of the partition, while, for each possible cutpoint in the partition, the method calculateNewCardinality is called. We select the best cutpoint which minimizes fuzzy entropy for the current partition, and if the mdlStopCondition is not met (so the Fuzzy information gain is large enough), the partitioning process continues recursively, calling again calcu-lateCutPoints for the partition on the right and on the left of the selected cutpoint. Depending on the fuzzy set shape, some methods are overriden to implement the correct behaviour.

To test the limitation of fuzzy set generated by the discretizer, we run the same algorithm and in the end, we take the best cutpoints (above all cutpoints selected) which generate the required number of fuzzy sets with the lowest entropy. For example, suppose we want to limit the discretizer with five fuzzy sets in output. In triangular partitioning, at the beginning we have zero cutpoint and two fuzzy sets. If we represent the process as a tree, we are at depth 0. Then, after the first split, we have one cutpoint and three fuzzy sets, and we call again the partitioning on the right and on the left of the cutpoint. At this point, we are at depth equal to 1. If the both split again, we end up having another two cutpoint, so globally we have three cutpoint and five fuzzy sets. But we can’t stop here, because the two branch of the recursion are independent and they run in parallel, to get the best performance. We can simply run the algorithm until it stops, and select the best cutpoints above all, but this can be useless and it can also waste resources, because we have to compare all the combinations of group of cutpoints to get the requested fuzzy sets. So, to optimize the process, in our example we can stop at next step, when we are at depth 3. In fact, we know that if we are at depth 3, three cutpoints have already been selected,

(39)

<<Interface>> SupervisedAttributeFilter + filter (RDD[LabeledPoint]) DiscretizerTriangularSet + filterStrategy: FilterStrategy FuzzyMDLFilter + filterStrategy: FilterStrategy + filter() + calculateCutPoints() + calculateNewCardinality() + calculatePriorEntropy() + mdlStopCondition() DiscretizerPseudogaussianSet + filterStrategy: FilterStrategy + calculateNewCardinality() + calculatePriorEntropy() DiscretizerTrapezoidalSet + filterStrategy: FilterStrategy + calculateCutPoints() + calculateNewCardinality() + calculatePriorEntropy() DiscretizerTrapezoidalAdaptiveSet + filterStrategy: FilterStrategy + calculateCutPoints()

Figure 4.1: UML diagram of the extended code.

regardless of the decision of the other branches. In the end we select the best cutpoints above all, but the number is reduced thanks to this optimization.

We implement also some more code, extending the other part of the library, published on GitHub 4 as well, to test the issue regarding the

(40)

imum number of instances on a node. In this case, we decided to adopt a post-pruning approach, iterating once over all nodes of the tree, which is already trained with a very low threshold of minimum number of instances. When iterating, every node is tested against a given threshold (from 1% to 20%), and we store in array if the node is a leaf or not at this percentage. We also provide a method which gives the tree at a certain threshold, passing it as an argument.

(41)

Chapter 5 Results

5.1 Small datasets

When testing the tree discretizer in case of small data, we have to say some-thing about the stopping condition, discussed in 2.2. In fact, this condition is specific for big dataset, and in the very first tests with small data, it tends to discard too many features. This is a problem, because if we want to compare accuracy and complexity of the tree, varying the discretization technique, the results are influenced by the number of features that are used for the classification: less features, less accuracy. This problem is even more present when comparing to the crisp decision tree, which does not use any type of feature selection. For this reason, we decided to avoid the use of the stopping condition when dealing with small data, limiting the number of fuzzy set in output of the discretizer (5 membership functions).

(42)

5.1.1 Results on discretization

In this test, we simply compare the accuracy and the complexity, measured in terms of number of nodes (lower is better), of the three discretization shapes: triangular, trapezoidal and pseudogaussian.

Looking at Figure 5.1, regarding accuracy, we can say that pseudogaus-sian and triangular discretization outperforms trapezoidal in all cases: this is very clear in Segment and Vowel datasets.

Regarding the complexity, the trapezoidal discretization leads to trees with less nodes, even if in some cases this means less accuracy. The pseudo-gaussian discretization and trapezoidal are again almost equal.

When comparing binary vs multiway split, we can immediately see that in terms of performance there are no substantial differences, but if we look the number of nodes, we can note that the multiway trees have more nodes than binary trees. This behaviour was expected, as said in 3.2.1. We can notice that the major increase is in Segment, which is the small dataset with the highest number of instances.

5.1.2 Results on varying minimum number of instances

In this test, we compare the same six small dataset when varying the thresh-old percentage of the minimum number of instances on a node. Since the triangular and trapezoidal are similar and they outperform again the trape-zoidal shape, we decided to report here only the plot of the triangular one.

We note immediately that the accuracy is decreasing with the increasing threshold, while the node number is decreasing, as expected. In fact, due to

(43)

43

(a) Binary split results on small dataset, using the three discretizer shapes.

(b) Multiway split results on small dataset, using the three discretizer shapes.

Figure 5.1: Results plot on small data. On the primary vertical axis we find accuracy (%) and on the secondary vertical axis we find the number of nodes.

(44)

44

(a) Binary split results on small dataset, using triangular discretization, varying the percentage of the minimum number of instances on a node.

(b) Multiway split results on small dataset, using triangular discretization, varying the percentage of the minimum number of instances on a node.

(45)

the post-pruning approach, the nodes are less and they have more instances on leaves, so misclassification is more probable. Average profiles of accuracy and node number are reported in Figure 5.3, which are similar to the ones plotted in Figure 3.2 and in Figure 3.3.

On datasets with a low number of instances, like Iris and Wine, we see that the reduction in terms of nodes is not so significant, so the accuracy is almost the same. In the other dataset, where the variation of percentage is more meaningful, it seems that the nodes reduction after 1% is very large: for example at 5%, it is more than half on average, while the accuracy is de-creasing a little. With percentages greater than 5%, the node number is not decreasing too much, while the error in classification becomes unacceptable. Finally, we can note that on Vowel the decreasing in accuracy is more accen-tuated w.r.t the other datasets, but we have to remember that this dataset is the only one in our tests that has also categorical attributes, where other classifier, like the associative one [3], are shown to be more effective.

5.1.3 Results on fuzzy vs crisp decision tree

In this last test for small data, we compared the fuzzy decision tree (binary and multiway) with the crisp decision tree, in a experiment which we also include the variation of the minimum threshold of instances in nodes, as explained in 3.2.2. The results are presented in Figure 5.4.

On the horizontal axis we find datasets: every dataset is reported three times with three different threshold of the minimum number of instances on nodes. Again we report only the triangular discretization (obviously mean-ingful only in case of fuzzy trees), to focus the comparison on the different classifiers.

(46)

46

(a) Accuracy profile obtained with an average on all dataset.

(b) Node number profile obtained with an average on all dataset.

Figure 5.3: Experimental profile of accuracy and node number obtained in the varying minimum #instances on nodes experiment.

(47)

Figure 5.4: Comparison b et w een FDT binary , crisp tree and FDT m ultiw a y.

(48)

In datasets with less cardinality, i.e., Glass, Iris, and Wine, the reduction in accuracy of all three types of decision tree is not so remarkable, even with the crisp tree. However we can note that in Wine, the accuracy reduction is greater w.r.t the other two datasets, looking at the crisp tree, while the node reduction is not too large. Instead, comparing with the fuzzy binary decision tree, the node reduction is slighly larger, while the accuracy does not decrease too much.

In datasets with more instances, such as Segment or Vehicle, we see that the decreasing accuracy of the crisp tree is higher, while the binary tree is more robust w.r.t the other two trees. The node reduction again is very significative in case of fuzzy tree, while in case of crisp tree, it is very low.

In the end, we have to say that the crisp decision tree uses less number of nodes on average, reaching the same accuracy of the fuzzy trees at lower percentage, while, at increasing percentage, the accuracy of fuzzy outper-forms the crisp. This result confirms the expectations discussed in 3.2.2, but due to the reduced number of nodes used by crisp, we can conclude that this type of tree is a strong contender.

5.2 Big datasets

Differently from small data, the stopping condition discussed in 2.2 is used in Big Data. In this context, we have made the same test of small data, adding some experiment regarding the number of fuzzy sets in output of the discretizer, and the new proposed classifier method. Since a classifier (and the discretization method) is influenced by the distribution of the input variables, in order to draw a complete analysis of the experiments, we have represented the distribution of the feature attributes of the three datasets.

(49)

We report the plots here in Figure 5.5.

5.2.1 Result on discretization

In this experiment, we compare accuracy and number of nodes of the three discretization method (triangular, trapezoidal, pseudogaussian). We find results in Figures 5.6, 5.7 and 5.8.

Again, in all results, we can see that the trapezoidal discretization cannot achieve good performance as triangular and pseudogaussian one. These other two are similar in terms of number of nodes and accuracy, and they differ only in Covtype dataset, when the pseudogaussian outperforms triangular in accuracy.

Comparing binary with multiway split, we can see that the multiway trees have many nodes, more than four times w.r.t binary trees on average. This is reasonable because it is able to investigate a higher number of corre-lations between attributes by generating a higher number of nodes at each level. This increased complexity offers advantages in Covtype and Eco, where the accuracy are better than the binary counterpart, while in Susy the ele-vated number of nodes does not provides any advantages in achieving better accuracy. This can be explained looking at the distribution of the features: in Susy we have many skewed distribution, because 3 attributes out of 18 are discarded (the three which are drawn as uniform distribution), so a good classification is hard to achieve. In Eco attributes covers the universe of the feature values in a broadly way, in which the multisplit can analyze more correlations as said before. In Covertype there are some categorical binary features, which can generate errors in approximation when selecting the best split in binary trees, as stated in 3.2.1 and then these errors are then

(50)

prop-50

(a) Susy attributes distribution.

(b) Eco attributes distribution.

(51)

Figure 5.6: Discretization experiment of Susy.

Figure 5.7: Discretization experiment of Eco

agated to the child nodes. For this reason, the multiway outperforms the binary tree in this case.

(52)

Figure 5.8: Discretization experiment of Covtype

5.2.2 Results on limiting the number of fuzzy sets

In this test, we decide to focus on the limit of the number of fuzzy sets, and we also test the minimum number of instances threshold on Big Data. In Table 5.1, 5.2 e 5.3 we find the results obtained. Since the pseudogaussian has the best accuracy for all the dataset, this type of discretization is selected for this test. MF equal to zero means that there are no limit in output of the discretizer. We report only data from the binary decision tree, since performance for multiway split are similar. In Table 5.1, we can see that we can obtain quite a good accuracy even if we use a threshold of 1%. We reduce a lot the number of nodes created by the tree, especially when we limit the number of membership functions. Looking at the other two tables, we see that the accuracy is reducing more quickly at 1%, and when limiting the number of fuzzy sets, the situation is getting worse.

(53)

features of the dataset: attributes in Susy have a skewed but more smooth distribution, and the discretization method does not create many fuzzy sets when they are not limited, as we can see from Table 5.1, while, for example, attributes in eCO e are not so smooth, so a good classification requires many fuzzy sets and more nodes to achieve an acceptable performance.

Dataset MF Threshold Acc (Tr) Acc (Tst) #Node Depth #FuzzySet

susy 3 1% 76.932±0.022 76.931±0.016 302.8 (14.0,2.0,9.187) (3,3,3) 5% 76.629±0.040 76.626±0.045 78.4 (9.0,2.0,6.308) (3,3,3) 10% 76.456±0.054 76.451±0.070 48.0 (9.0,2.0,5.4) (3,3,3) 15% 76.088±0.008 76.087±0.028 24.0 (6.0,2.0,4.077) (3,3,3) 20% 76.611±0.009 76.610±0.031 20.0 (5.0,2.0,3.727) (3,3,3) susy 5 1% 74.848±0.095 74.844±0.074 111.6 (15.0,3.0,8.285) (5,4,4) 5% 72.701±0.037 72.697±0.053 46.4 (9.0,3.0,5.355) (5,4,4) 10% 71.445±0.025 71.447±0.041 26.8 (5.4,3.0,4.018) (5,4,4) 15% 71.725±0.036 71.729±0.023 16.0 (4.0,2.0,3.333) (5,4,4) 20% 71.725±0.036 71.729±0.023 14.0 (4.0,2.0,3.125) (5,4,4) susy 7 1% 76.335±0.103 76.320±0.087 271.2 (15.0,2.0,9.38) (7,4,6) 5% 75.401±0.162 75.397±0.180 78.0 (13.8,2.0,7.213) (7,4,6) 10% 75.121±0.192 75.109±0.220 38.0 (10.4,2.0,5.698) (7,4,6) 15% 75.131±0.250 75.120±0.309 24.4 (9.6,1.0,5.543) (7,4,6) 20% 75.026±0.348 75.010±0.380 19.2 (7.0,1.0,4.458) (7,4,6) susy 0 1% 78.785±0.020 78.769±0.029 396.8 (14.6,3.6,9.243) (27,4,18) 5% 76.963±0.027 76.970±0.032 80.8 (8.6,3.0,6.04) (27,4,18) 10% 76.895±0.092 76.902±0.108 47.2 (7.4,3.0,5.093) (27,4,18) 15% 76.585±0.012 76.591±0.039 34.0 (6.2,3.0,4.546) (27,4,18) 20% 76.516±0.014 76.519±0.040 26.0 (6.0,2.0,4.214) (27,4,18)

Table 5.1: Accuracy, node number, tree depth and number of fuzzy sets in Susy dataset with a binary fuzzy tree.

5.2.3 Result on fuzzy vs crisp

In this section we will discuss about comparison between fuzzy decision tree and crisp decision tree. We find in Table 5.4 the accuracy with training and testing dataset and the number of nodes obtained. We can see that at same percentage of number of instances on a node, the accuracy of the fuzzy is

(54)

Dataset MF Threshold Acc (Tr) Acc (Tst) #Node Depth #FuzzySet eCO e 3 1% 64.825±0.182 64.815±0.208 331.2 (15.0,4.0,9.457) (3,3,3) 5% 63.633±0.102 63.636±0.131 84.0 (15.0,3.0,8.512) (3,3,3) 10% 63.285±0.113 63.278±0.145 64.0 (15.0,3.0,8.636) (3,3,3) 15% 61.560±0.011 61.551±0.039 49.2 (13.6,2.0,8.044) (3,3,3) 20% 59.708±0.017 59.715±0.025 32.0 (9.0,2.0,5.941) (3,3,3) eCO e 5 1% 67.843±2.613 67.834±2.605 238.8 (15.0,3.2,9.102) (5,5,5) 5% 64.352±1.844 64.335±1.841 70.8 (11.8,2.6,6.8) (5,5,5) 10% 62.123±2.270 62.114±2.272 38.8 (9.8,2.6,5.449) (5,5,5) 15% 61.197±1.742 61.188±1.743 23.6 (8.6,2.0,4.912) (5,5,5) 20% 61.458±1.103 61.446±1.105 17.2 (7.0,2.0,4.183) (5,5,5) eCO e 7 1% 63.645±0.045 63.659±0.074 389.2 (15.0,3.0,9.553) (7,7,7) 5% 62.876±0.016 62.893±0.027 76.0 (10.0,3.0,6.026) (7,7,7) 10% 60.953±0.010 60.967±0.035 32.0 (8.0,3.0,4.647) (7,7,7) 15% 59.956±0.020 59.970±0.044 22.0 (6.0,3.0,3.917) (7,7,7) 20% 58.357±0.053 58.368±0.068 18.0 (6.0,2.0,3.8) (7,7,7) eCO e 0 1% 81.881±0.092 81.879±0.067 438.4 (15.0,3.0,9.519) (41,35,38) 5% 76.601±0.379 76.591±0.356 90.4 (11.8,2.0,6.817) (41,35,38) 10% 75.539±0.338 75.518±0.321 47.2 (10.4,2.0,5.724) (41,35,38) 15% 74.493±0.079 74.487±0.060 31.2 (8.6,2.0,5.061) (41,35,38) 20% 73.397±0.664 73.401±0.709 17.2 (4.0,2.0,3.433) (41,35,38)

Table 5.2: Accuracy, node number, tree depth and number of fuzzy sets in Eco dataset with a binary fuzzy tree.

higher, but the number of nodes is more than twice in some cases in binary tree and even more in multiway split. Only with higher percentage (15-20%) the fuzzy trees outperforms the crisp tree. In fact, if we look at eCO e, at 15% and at 20% the binary tree has an higher accuracy and lower number of nodes than crisp decision tree at 5%. In other cases, we can confirm again that the crisp tree is a strong contender even in a Big Data context, using a few nodes and acquiring a comparable percentage of classification rate w.r.t fuzzy trees. A small note on binary vs multiway when increasing the threshold: we can see that in most of the cases, the reduction in accuracy is higher in multiway rather than binary split. This is maybe due to the fact that when we increase the threshold and we do not split a node, with the multiway we are merging at least more leaves node, so the aggregation has

(55)

Dataset MF Threshold Acc (Tr) Acc (Tst) #Node Depth #FuzzySet covtype 3 1% 67.855±0.042 67.914±0.101 76.8 (15.0,2.0,9.222) (3,3,3) 5% 67.866±0.041 67.925±0.102 24.0 (8.0,2.0,4.615) (3,3,3) 10% 67.598±0.049 67.634±0.122 14.0 (4.0,2.0,3.125) (3,3,3) 15% 67.598±0.049 67.634±0.122 14.0 (4.0,2.0,3.125) (3,3,3) 20% 66.610±0.040 66.640±0.102 8.0 (3.0,1.0,2.6) (3,3,3) covtype 5 1% 61.605±0.140 61.566±0.160 220.0 (15.0,2.0,11.076) (5,5,5) 5% 60.769±0.187 60.722±0.176 75.2 (12.6,1.0,8.781) (5,5,5) 10% 59.258±0.875 59.129±0.844 44.8 (11.2,1.0,7.387) (5,5,5) 15% 58.556±0.037 58.434±0.137 31.2 (10.0,1.0,6.39) (5,5,5) 20% 58.342±0.070 58.220±0.171 24.0 (10.0,1.0,5.923) (5,5,5) covtype 7 1% 62.565±1.114 62.478±1.025 261.2 (15.0,2.0,10.56) (7,6,6) 5% 61.148±0.590 61.116±0.515 71.6 (14.4,2.0,7.819) (7,6,6) 10% 60.247±0.184 60.190±0.135 36.0 (10.6,1.0,6.411) (7,6,6) 15% 59.879±0.186 59.832±0.055 23.2 (7.8,1.0,5.176) (7,6,6) 20% 59.616±0.118 59.608±0.126 17.2 (7.6,1.0,4.844) (7,6,6) covtype 0 1% 76.350±0.117 76.320±0.283 289.6 (15.0,3.0,9.297) (22,6,11) 5% 74.556±0.023 74.536±0.145 72.4 (12.6,3.0,6.626) (22,6,11) 10% 73.563±0.029 73.532±0.087 43.6 (10.6,3.0,5.714) (22,6,11) 15% 73.158±0.025 73.122±0.112 31.2 (10.2,2.0,5.614) (22,6,11) 20% 73.151±0.025 73.119±0.110 29.6 (9.8,2.0,5.443) (22,6,11)

Table 5.3: Accuracy, node number, tree depth and number of fuzzy sets in Covtype dataset with a binary fuzzy tree.

a coarser granularity than the binary case, in which we merge at least two nodes.

5.2.4 Result on new classification method

Results in Table 5.5 show that no significant improvements are obtained at lower percentages, while the accuracy at higher percentage is worse than the previous method. We have to say that the histogram of class distribution has equi-width buckets of activation degree. In our experiment we use ten buck-ets. The idea behind the proposed method is to spot out common activation pattern in leaves that can identify instances of similar classes. But when we

(56)

FBDT FMDT CDT

Dataset Threshold Acc (Tr) Acc (Tst) #Node Acc (Tr) Acc (Tst) #Node Acc (Tr) Acc (Tst) #Node

susy 1% 78.785±0.020 78.769±0.029 396.8 76.902±0.009 76.901±0.030 667.6 77.808±0.024 77.802±0.037 152.6 5% 76.963±0.027 76.970±0.032 80.8 74.226±0.313 74.226±0.331 198.6 76.482±0.028 76.481±0.057 27.0 10% 76.895±0.092 76.902±0.108 47.2 72.616±0.071 72.612±0.072 45.0 75.332±0.169 75.332±0.158 14.6 15% 76.585±0.012 76.591±0.039 34.0 72.477±0.007 72.476±0.029 24.0 74.476±0.084 74.478±0.094 8.2 20% 76.516±0.014 76.519±0.040 26.0 72.477±0.007 72.476±0.029 24.0 74.471±0.055 74.475±0.093 7.0 eCO e 1% 81.881±0.092 81.879±0.067 438.4 82.438±0.146 82.405±0.133 1228.2 77.709±0.603 77.682±0.627 145.8 5% 76.601±0.379 76.591±0.356 90.4 72.794±0.248 72.830±0.302 38.4 72.475±0.194 72.486±0.206 29.4 10% 75.539±0.338 75.518±0.321 47.2 72.794±0.248 72.830±0.302 38.4 69.391±0.239 69.414±0.272 13.0 15% 74.493±0.079 74.487±0.060 31.2 72.794±0.248 72.830±0.302 38.4 62.855±0.185 62.861±0.175 9.0 20% 73.397±0.664 73.401±0.709 17.2 72.794±0.248 72.830±0.302 38.4 62.219±0.137 62.237±0.117 7.0 covtype 1% 76.350±0.117 76.320±0.283 289.6 75.628±0.092 75.648±0.230 537.4 75.183±0.091 75.108±0.168 150.6 5% 74.556±0.023 74.536±0.145 72.4 74.030±0.033 74.015±0.156 110.8 72.569±0.341 72.515±0.238 28.2 10% 73.563±0.029 73.532±0.087 43.6 73.144±0.024 73.092±0.108 37.0 71.864±0.182 71.868±0.270 13.8 15% 73.158±0.025 73.122±0.112 31.2 73.144±0.024 73.092±0.108 24.4 70.155±0.068 70.183±0.083 11.0 20% 73.151±0.025 73.119±0.110 29.6 73.144±0.024 73.092±0.108 21.8 66.685±0.105 66.728±0.275 5.0

Table 5.4: Accuracy and number of nodes in fuzzy (pseudogaussian dis-cretization) vs crisp.

increase the threshold of minimum instances on nodes, the nodes become full of instances, which led to ”confuse” the common pattern. For this reason the accuracy degrades when we increase the threshold percentage.

(57)

57

FBDT FMDT

Dataset Threshold Acc (Tr) Acc (Tst) #Node Acc (Tr) Acc (Tst) #Node

susy 1% 78.248±0.143 78.235±0.138 396.8 71.966±0.234 71.962±0.259 667.6 5% 76.868±0.066 76.872±0.065 80.8 69.593±1.027 69.599±0.984 198.6 10% 76.386±0.295 76.391±0.300 47.2 53.414±3.831 53.412±3.830 45.0 15% 76.003±0.058 76.010±0.076 34.0 45.757±0.013 45.757±0.051 24.0 20% 75.848±0.015 75.846±0.042 26.0 45.757±0.013 45.757±0.051 24.0 eCO e 1% 79.039±0.210 79.029±0.228 438.4 70.587±0.442 70.604±0.399 1228.2 5% 72.982±0.229 72.981±0.250 90.4 48.481±0.000 48.481±0.000 38.4 10% 71.658±0.173 71.663±0.187 47.2 48.481±0.000 48.481±0.000 38.4 15% 61.754±0.067 61.775±0.093 31.2 48.481±0.000 48.481±0.000 38.4 20% 61.837±0.051 61.855±0.040 17.2 48.481±0.000 48.481±0.000 38.4 covtype 1% 75.204±0.220 75.165±0.227 289.6 72.766±0.063 72.716±0.171 537.4 5% 73.744±0.160 73.682±0.111 72.4 68.611±0.022 68.610±0.084 110.8 10% 73.091±0.060 72.999±0.115 43.6 63.161±2.737 63.133±2.823 37.0 15% 72.669±0.189 72.573±0.117 31.2 54.234±5.987 54.260±6.039 24.4 20% 72.633±0.192 72.541±0.083 29.6 51.240±0.000 51.240±0.000 21.8

Table 5.5: Accuracy and number of nodes obtained with the new classification method.

(58)

Chapter 6 Conclusions

In this work we discussed and analyzed in details the performance of fuzzy tree, varying many factor, like the discretization method, the minimum per-centage of instances on nodes, the limit of number of fuzzy sets and the classification method (these two only for Big Data), and also comparing the results with the already existing crisp decision trees.

The first aim of this thesis was to make a contribution for future re-searches on how to handle small and big datasets using FDT. We can say that no general rules can be extracted from our experimental data, but we can make some considerations depending to avoid common errors when dealing with a particular issue.

With small dataset, we can surely say that the pseudogaussian and the triangular discretization give the best performance, while the trapezoidal dis-cretization is still immature, probably due to its irregular form. The binary split is the best choice because it brings more accuracy, limiting the com-plexity of the tree, while the multiway split, in this small-data context, add

(59)

some unnecessary complexity to the tree without any real advantage. When looking at the minimum percentage of instances on nodes, we can say that an optimal percentage can be found between 1 and 5 percent, because within this interval the classification rate is maintained, while the number of nodes is reduced. With higher percentages, the performances degrade too much, while the complexity remains almost the same. Finally, when comparing the fuzzy tree with the crisp tree, we should admit that at lower percentages the complexity of the tree is low, while the accuracy is comparable with the fuzzy trees, confirming that the crisp tree already implemented in MLLib is a strong contender.

In a Big Data context, again the pseudogaussian and the triangular dis-cretization outperforms the trapezoidal in all cases. Comparing the binary tree with the multiway tree, we can see that the complexity, but also the accuracy of the latter is higher, especially in a dataset with some categori-cal attributes, in which the approximation made in binary tree to avoid too many split accumulates errors. When limiting the number of fuzzy sets, we have seen that the conclusions are deeply influenced by the distribution of the attributes of the considered dataset. In fact, in Susy the distribution is skewed but smooth, so it does not require an elevated number of nodes to have a good classification rate, while eCO e has an irregular distribution which needs more complexity to avoid to degrade performance too much. So we can conclude that we can limit the number of fuzzy sets in output of the discretizer when we are dealing with a big dataset with smooth distribution, in which we can preserve the interpretability, while in very irregular dataset we cannot get rid of the complexity to acquire some useful knowledge. Re-garding the comparison with the crisp tree, only with higher percentage we can obtain trees that outperforms the crisp ones both in term of classification rate and complexity, while at lower percentage, the fuzzy trees are able to

(60)

ac-quire 3-4% more accuracy, but in order to do that, they use more than twice the number of nodes of the crisp tree. Comparing the decreasing of accuracy when increasing the threshold of number of instances, we have found that for binary split the decrease is less accentuated than for multiway split, because in the first case we are aggregating at least two leaves, while in the second case we are aggregating at least all the children leaves of the splitted node. Finally, the new proposed classification method does not bring real benefits. As we said previously, at lower percentages of minimum number of instances per nodes, the performance are the same of the already implemented clas-sification method. When increasing the threshold, in leaves there are more instances, so the ”common pattern” of activation degree that the proposed method follows are very confused and this led to unsatisfactory results.

(61)

Chapter 7 Future Works

In my opinion, another approach which should be investigated is the genera-tion of fuzzy partigenera-tioning contextually while building the tree. This approach is very complex in a Big Data context, because it adds lots of computation which can require more memory and more learning time. The great advan-tage is that the building of the tree and the discretization are not part of two separate processes and of two different ”decision domains”, but the tree, es-pecially in deeper levels, can choose a better partitioning which can improve the final classification rate than the already defined partitioning during the preprocess phase. To improve the new classification method proposed based on histogram of class distribution, we can try to generate an equi-frequency histogram, instead of equi-width. Using an equi-width binning, we are mak-ing the assumption that the distribution of the activation degree is uniform, and this can be a strong assumption, depending on the dataset. The ad-vantage is that this approach is very light on memory requirements. An equi-frequancy binning instead can deal with other types of distribution, but again it is more computational heavy, because we have to recompute the

(62)

fre-quency of activation degree at each new node generated during the training phase of the tree.

(63)

Acknowledgments

I would like to thank Ing. Alessio Bechini and Prof. Francesco Marcelloni for following me in recent months, without ever letting me miss their support, helping me and accompanying me in the realization of the thesis.

A special thanks goes to Prof. Alberto Fern´andez, who addressed me during the thesis with his valuable advice, and to Ing. Carlo Vallati and Ing. Marco Barsacchi, who, with their patience and availability, have given me all the tools to carry out the experiments.

Finally, thanks to my family for giving me the serenity and infinite support, greatly facilitating my university career.

(64)

Bibliography

[1] Apache Hadoop. https://hadoop.apache.org/. [Online; accessed April 2018].

[2] Apache Spark. https://spark.apache.org/. [Online; accessed April 2018].

[3] Alessio Bechini, Francesco Marcelloni, and Armando Segatori. A mapre-duce solution for associative classification of big data. Inf. Sci., 332:33– 55, 2016.

[4] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classifi-cation and Regression Trees. Wadsworth, 1984.

[5] Bill Chambers and Matei Zaharia. Spark: The Definitive Guide: Big Data Processing Made Simple. O’Reilly Media, 2018.

[6] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data pro-cessing on large clusters. Commun. ACM, 51(1):107–113, 2008.

[7] Usama M. Fayyad and Keki B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In IJCAI, pages 1022–1029, 1993.

[8] Salvador Garc´ıa, Julián Luengo, José Antonio Sáez, Victoria López, and Francisco Herrera. A survey of discretization techniques: Taxonomy and

(65)

empirical analysis in supervised learning. IEEE Trans. Knowl. Data Eng., 25(4):734–750, 2013.

[9] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann, 2011.

[10] Wei-Yin Loh and Nunta Vanichsetakul. Tree-structured classification via generalized discriminant analysis. Journal of the American Statistical Association, 83(403):715–725, 1988.

[11] Victoria L´opez, Sara del R´ıo, Jos´e Manuel Ben´ıtez, and Francisco Her-rera. On the use of mapreduce to build linguistic fuzzy rule based clas-sification systems for big data. In FUZZ-IEEE, pages 1905–1912. IEEE, 2014.

[12] Martin Odersky, Lex Spoon, and Bill Venners. Programming in Scala: Updated for Scala 2.12. Artima Incorporation, USA, 3rd edition, 2016. [13] Armando Segatori, Francesco Marcelloni, and Witold Pedrycz. On

dis-tributed fuzzy decision trees for big data. IEEE Trans. Fuzzy Systems, 26(1):174–192, 2018.

[14] Mohsen Zeinalkhani and Mahdi Eftekhari. Fuzzy partitioning of con-tinuous attributes through discretization methods to construct fuzzy decision tree classifiers. Inf. Sci., 278:715–735, 2014.

Development and analysis of Distributed Fuzzy Decision Tree on Spark

D

I

’

C

E

Development and Analysis of

Distributed Fuzzy Decision Tree

on Spark

Contents

Chapter 1

Introduction

1.1

Decision Tree

1.2

Fuzzy Decision Tree

1.3

Contribution of this thesis

1.4

Framework selection

Chapter 2

Fuzzy Decision Tree

2.1

FDT Background

2.2

FDT Discretization

2.3

FDT Learning process

Chapter 3

Issues on FDT

3.1

Membership function shape

3.1.1

Triangular set

3.1.2

Pseudogaussian set

3.1.3

Trapezoidal set

3.2

Binary FDT vs multiway FDT vs crisp

DT

3.2.1

FDT: binary vs multi-way split

3.2.2

Fuzzy tree vs crisp tree

3.3

Changing the minimum threshold of

num-ber of instances

3.4

Limiting the number of fuzzy sets

3.5

Proposing a new classification method

Chapter 4

Experimental setting

4.1

Hardware/software configuration

4.2

Dataset description

4.3

Extension of the implementation

Chapter 5

Results

5.1

Small datasets

5.1.1

Results on discretization

5.1.2

Results on varying minimum number of instances

5.1.3

Results on fuzzy vs crisp decision tree

5.2

Big datasets

5.2.1

Result on discretization

5.2.2

Results on limiting the number of fuzzy sets

5.2.3

Result on fuzzy vs crisp

5.2.4

Result on new classification method