The phases in which classifier’s design can be divided Weka

(1)



The phases in which classifier’s design can be divided are reflected in WEKA’s Explorer structure:

◦

Data pre-processing (filtering) and representation

◦

Supervised Learning/Classification, or

◦

Unsupervised Learning / Clustering, or

◦ Rule-based classification

◦

Visualization

Weka

( http://www.cs.waikato.ac.nz/ml/weka/ )

(2)

Introduction to Weka



Online course on WEKA (http://weka.wakaito.ac.nz)



5 lessons of 6 modules each (7-10 minutes + exercises): about 15 hours of total engagement including time for assignments.



Lessons are also accessible on YouTube

(3)

Weka



WEKA lets one choose among a huge number of modules which implement the main steps necessary to develop a classification/data mining application (filters, classifiers, clustering algorithms, graphical analysis of results)



Modules can either be accessed :

◦

using a command-line interface

◦

through one of the GUIs provided by Weka

◦

Within one’s own code as functions of a JAVA

package

(4)

Weka



WEKA includes three GUIs:

◦ Explorer

 Interactive pre-processing, data classification and algorithm configuration

◦ Experimenter

 Development of applications as cascaded modules and

environment for organizing multiple executions of algorithms (on different data sets or using different random seeds) with final generation of a summary of the results.

◦ Knowledge Flow

 GUI for ‘visual’ building of an application

as well as a Command-Line Interface from which it is possible to run the different modules as independent applications.

(5)

Introduction to Weka



Explorer

◦

Main functions and options



An example

◦

Classification of 2-dimensional data (easy to visualize!)



Bayes Classifiers

◦

Naïve Bayes

 Though it is based on the hypothesis of data independence, which is very rarely true, it often has reasonable

performances and is very often taken as «baseline reference» for more sophisticated classifiers

(6)

An example

Classify the dataset saved in the files circletrain.arff // training set circletest.arff // test set

This is a binary problem: one needs to distinguish between patterns belonging to 2 classes.

Problem

distinguish the points belonging to:

1. circle of radius 1, centered in 0,0 2. square with sides of length L=2.5,

centered in (0,0) from which region 1 has been removed.

(7)

Recalling what we said in the previous lesson, the classical learning scheme based on

training/(validation)/test sets may still provide very inaccurate estimates of a classifier’s performance.

•

Too few data

•

Data unevenly distributed among classes and

between the training and test set, with some over- or under-represented class

•

Presence of outliers

N-fold Crossvalidation

(8)

In the circle/square example the training and test sets have been sampled under the best possible conditions:

•

Samples drawn randomly according to a uniform

distribution for both x and y coordinates => x and y are statistically independent (that is why the naïve Bayes classifier works reasonably well)

•

No noise added to features (x and y)

•

No wrong/noisy expected output/label (aka teaching input)

N-fold Crossvalidation

(9)

Still we have

•

53/47 (circle/square) training patterns … and 60/40 (circle/square) testing patterns versus an expected ratio of

/(6.25-)  3.14/3.11 = 1.01

It is no surprise we obtain different results on the training and test sets …

And that different random drawings of the training and test sets also lead to rather different estimates of the classifier’s performance.

N-fold Crossvalidation

(10)

What improvements can we introduce ?

•

Stratification: the drawing is made such that the distribution of samples of different classes is the same in each set

This solves one problem… but cannot do much when data are few and different drawings can apparently tell different stories. In particular, one outlier obviously has a much stronger impact on the estimates, the smaller the sample size.

N-fold Crossvalidation

(11)

After all:

•

We have just a small data set

•

If we use all data for training, we will definitely overfit data.

So what can we do if we want to take the most from our data ?

N-fold Crossvalidation

(12)

•

Divide the data set into N subsets

•

For N times:

•

Leave one subset out and use it as test set, training a classifier using the other N-1 as test set

•

Estimate the performance of the classifier over the union of the results obtained on the N disjoint test sets.

Extreme case: ‘leave-one-out’, in which one has as many folds as samples.

N-fold Crossvalidation

(13)

Observations

•

All data used as test when not involved in training

•

The procedure can be used only to evaluate the method (estimate a classifier’s error, compare it to others) but not to design a classifier: we have

produced N different classifiers, none of which has a reason for being preferred to the others.

N-fold Crossvalidation

(14)

The decision taken by a binary classifier produces one of four possible outcomes:

True Positive (TP): decision= 1, correct decision C=1 True Negative (TN): decision= 0, correct decision C=0 False Positive (FP): decision= 1, correct decision C=0 False Negative (FN): decision= 0, correct decision C=1

Quality criteria for binary classifiers

(see also Prof. Bononi’s notes on Bayesian Prediction)

(15)

Given a data set with N_d patterns to which one applies a binary classifier, then TP+TN+FP+FN = N_d

Accuracy : 100*(TP+TN)/(TP+TN+FP+FN) = (TP+TN)/N_d Estimate of P(C). Correct decision rate.

Recall / Sensitivity / TP rate: 100 * TP/(TP+FN)

Estimate of P(1|_c=1):. Correct decision rate when C=1 Specificity =1-FPRate: 100 * TN/(TN+FP)

Estimate of P(0|_c=0). Negative rate when C= 0

Precision / Positive predictivity: 100 * TP/(TP+FP) Estimates P(C=1|₁). Positive rate when decision =1

F-Measure = 2 * Precision * Recall / (Precision + Recall) Global performance index. F-Measure ϵ [0,1]

Quality criteria for binary classifiers

(see also Prof. Bononi’s notes on Bayesian Prediction)

(16)

The Confusion Matrix (CM) associated with an N-class classifier is a square NxN matrix whose element A_ij represents the number (frequency, if normalized by the number of samples of class i) of patterns belonging to class i classified as belonging to class j.

Proprietà:

• The CM associated with an ideal classifier is diagonal

•

S

_i

A

_i* equates the number of patterns belonging to class i (or 1, if frequency is used)

• Using the elements A_i* it is possible to compute the quality indices for a binary classifier whose decision is ‘ i / not i ’

• TP_i= A_ii (Recall if frequency is used)

• FN_i = S_jA_ij (1- d_ij ) (Homework: derive TN_i and FP_i)

Quality criteria for binary classifiers

(see also Prof. Bononi’s notes on Bayesian Prediction)

(17)

C1 C2 C3 C4 N_i

C1 190 6 2 2 200

C2 16 356 8 20 400

C3 1 1 48 0 50

C4 4 4 1 91 100

Decision

Class

Confusion Matrix

(18)

C1 C2 C3 C4 N_i

C1 190 6 2 2 200

C2 16 356 8 20 400

C3 1 1 48 0 50

C4 4 4 1 91 100

Decision

Class

Confusion Matrix

(19)

C1 C2 C3 C4 C1 95% 3% 1% 1%

C2 4% 89% 2% 5%

C3 2% 2% 96% 0%

C4 4% 4% 1% 91%

Decision

Class

Confusion Matrix (%)

(20)

1 0

1 TP FN

0 FP TN

Normalizing by Np = TP + FN and Nn = row-wise, respectively, one obtains the TP (FN, FP, TN) rates.

Decision

Class

Confusion Matrix (Binary Classifier)

(21)

The general goal for a classifier is to maximize accuracy.

However, that single figure does not tell the whole story.

Only if accuracy  1 we can be sure that the classifier is actually good.

As an example, consider a classifier that must recognize the figure ‘8’ (positive) against all others (negatives).

If we have a uniform sample, the priors are P(n) = 0.9 P(p) = 0.1

So, a “classifier” which always decides 0, has an accuracy of 90%, but it is definitely not a good classifier!

ROC (receiver operating characteristic) curve

(22)

In terms of optimization, designing a classifier is a multi- objective optimization maximize two criteria: Sensitivity (Recall) and Specificity, i.e.

Maximize TP (Minimize FN) Minimize FP (Maximize TN)

ROC (receiver operating characteristic) curve

(23)

ROC curves are used in signal detection to show tradeoff between hit rate (TP rate) and false alarm rate (FP rate) over noisy channels.

Defined when radar theory was first developed.

y axis represents the percentage of true positives in sample x axis represents the percentage of false positives in sample.

It is an estimate of the actual ROC curve :

y axis represents P(^ŷ=1|y=1) (Sensitivity/ TP rate) x axis represents P(^ŷ=1|y=0) (FP rate)

its slope is equal to the likelihood ratio used in the Bayes decision rule (see Prof. Bononi’s slides, page B13)

ROC (receiver operating characteristic) curve

(24)

(R)OC (receiver operating characteristic) curve

What we obtain, formally, is not the ROC, but the so-called OC (operating characteristic) curve.

In fact, there are cases (see Duda-Hart, page …), e.g., with multi-modal distributions, in which using a single threshold is not the right way to go for receiver/classifier design…

Still a curve (which may not even be convex) can always be built according to the same rules.

(25)

Imagine the following processing chain is used

V 0/1

The curve is obtained by plotting the points (FP rate, TP rate)

one would obtain by sweeping the threshold

from + (ŷ = 0 for all patterns, FPrate=TPrate=0) to - (ŷ = 1 for all patterns, FPrate=TPrate=1)

(R)OC (receiver operating characteristic) curve

signal

Preprocessing Thresholding

(26)

(R)OC (receiver operating characteristic) curve

How can we estimate the curve if we cannot “set” the threshold and we have no explicit model of the classifier behaviour ?

We can use the results obtained in the test set.

Supposing the classifier (as most do) outputs, for each

pattern, a value V which is closer to 1, the more likely the pattern is to be a positive one, one can:

1. Sort {V_{i ,}i=1, Np} in descending order

2. For i=1, Np

- move up1/Np if y_i = 1

- move 1/Np to the right if y_i = 0

(27)

(R)OC (receiver operating characteristic) curve

If we use data from a single test set, we will obtain a jagged curve; if we use cross-validation the curve should be

smoother.

(28)

(R)OC (receiver operating characteristic) curve

Using (for instance) the logistic regression, the likelihood  of each observation is the corresponding value of the

logistic function. All observations with >0.5 (located to the right of the dotted line) are classified as 1.

(29)

(R)OC (receiver operating characteristic) curve

All points to the right of the dotted line are classified as positive, all points to its left are classified as negative.

o’s to the left = TN o’s to the right =FP x’s to the right = TP x’s to the left = FN

(30)

(R)OC (receiver operating characteristic) curve

If we shift the logistic function to the extreme right the dotted line will shift with it and the classifier will always output 0. If we then start sweeping the curve leftwards, the likeliest observations will start to be classified as

positive, so the OC curve will grow vertically whenever a positive (x) crosses the dotted line and horizontally whenever a negative (o) does.

(31)

(R)OC (receiver operating characteristic) curve

Reasoning with distributions:

(32)

(R)OC (receiver operating characteristic) curve

The area under the curve (AUC) thus constructed is a quality index for the classifier: the more it is close to one, the

better.

In fact, an ideal classifier separates positive from negatives perfectly.

In that case, the OC Curve will actually be a square, since it will grow first only vertically until it reaches (FP rate, TP

rate) (0,1), then it will grow only horizontally to reach (1,1).

Homework: write a simple program (very easy in Matlab) to compute the AUC for a classifier.

(33)

Kappa statistic

Given two confusion matrices for a 3-class problem:

actual predictor (left) vs. random predictor (right)

Number of successes: sum of entries in diagonal (D)

● Kappa statistic: D_observed− D_random Dperfect − D_random

measures relative improvement over random predictor

Most probably, same value obtained if AUC is used instead of D.

Quality criteria for binary classifiers

(used by WEKA)

Random Predictor P(a|x) = P(a) P(b|x) = P(b) P(c|x) = P(c)

(34)

Difference: error measures

Actual target values: a₁ a₂ …a_n

Predicted target values: p₁ p₂ … p_n Mean Squared error

(p₁−a₁)² + ... + (p_n−a_n)² n

Root Mean Squared (RMS) error (p₁−a₁)² + ... + (p_n−a_n)² n

Mostly relevant for regression problems (continuous target values)

Quality criteria for binary classifiers

(used by WEKA)



(35)

Difference: error measures

Actual target values: a₁ a₂ …a_n

Predicted target values: p₁ p₂ … p_n

Mean Absolute error

|p₁−a₁| + ... + |p_n− a_n| n

Less sensitive to outliers.

Quality criteria for binary classifiers

(used by WEKA)

(36)

Relative errors

Relative Absolute error

|p₁− a₁| + ... + |p_n− a_n| | a − a₁| + ... + | a − a_n| a = average of a_1….. a_n

Relative Squared error

(p₁− a₁)² + ... + (p_n− a_n)² (a − a₁)² + ... + (a − a_n)²

Quality criteria for binary classifiers

(used by WEKA)

(37)



First two units of the online course (6+6

modules) on Weka: they include a revision of some concepts we introduced in the previous lesson.

