• Non ci sono risultati.

The phases in which classifier’s design can be divided Weka

N/A
N/A
Protected

Academic year: 2023

Condividi "The phases in which classifier’s design can be divided Weka"

Copied!
37
0
0

Testo completo

(1)

The phases in which classifier’s design can be divided are reflected in WEKA’s Explorer structure:

Data pre-processing (filtering) and representation

Supervised Learning/Classification, or

Unsupervised Learning / Clustering, or

Rule-based classification

Visualization

Weka

( http://www.cs.waikato.ac.nz/ml/weka/ )

(2)

Introduction to Weka

Online course on WEKA (http://weka.wakaito.ac.nz)

5 lessons of 6 modules each (7-10 minutes + exercises): about 15 hours of total engagement including time for assignments.

Lessons are also accessible on YouTube

(3)

Weka

WEKA lets one choose among a huge number of modules which implement the main steps necessary to develop a classification/data mining application (filters, classifiers, clustering algorithms, graphical analysis of results)

Modules can either be accessed :

using a command-line interface

through one of the GUIs provided by Weka

Within one’s own code as functions of a JAVA

package

(4)

Weka

WEKA includes three GUIs:

◦ Explorer

Interactive pre-processing, data classification and algorithm configuration

◦ Experimenter

Development of applications as cascaded modules and

environment for organizing multiple executions of algorithms (on different data sets or using different random seeds) with final generation of a summary of the results.

◦ Knowledge Flow

GUI for ‘visual’ building of an application

as well as a Command-Line Interface from which it is possible to run the different modules as independent applications.

(5)

Introduction to Weka

Explorer

Main functions and options

An example

Classification of 2-dimensional data (easy to visualize!)

Bayes Classifiers

Naïve Bayes

Though it is based on the hypothesis of data independence, which is very rarely true, it often has reasonable

performances and is very often taken as «baseline reference» for more sophisticated classifiers

(6)

An example

Classify the dataset saved in the files circletrain.arff // training set circletest.arff // test set

This is a binary problem: one needs to distinguish between patterns belonging to 2 classes.

Problem

distinguish the points belonging to:

1. circle of radius 1, centered in 0,0 2. square with sides of length L=2.5,

centered in (0,0) from which region 1 has been removed.

(7)

Recalling what we said in the previous lesson, the classical learning scheme based on

training/(validation)/test sets may still provide very inaccurate estimates of a classifier’s performance.

Too few data

Data unevenly distributed among classes and

between the training and test set, with some over- or under-represented class

Presence of outliers

N-fold Crossvalidation

(8)

In the circle/square example the training and test sets have been sampled under the best possible conditions:

Samples drawn randomly according to a uniform

distribution for both x and y coordinates => x and y are statistically independent (that is why the naïve Bayes classifier works reasonably well)

No noise added to features (x and y)

No wrong/noisy expected output/label (aka teaching input)

N-fold Crossvalidation

(9)

Still we have

53/47 (circle/square) training patterns … and 60/40 (circle/square) testing patterns versus an expected ratio of

/(6.25-)  3.14/3.11 = 1.01

It is no surprise we obtain different results on the training and test sets …

And that different random drawings of the training and test sets also lead to rather different estimates of the classifier’s performance.

N-fold Crossvalidation

(10)

What improvements can we introduce ?

Stratification: the drawing is made such that the distribution of samples of different classes is the same in each set

This solves one problem… but cannot do much when data are few and different drawings can apparently tell different stories. In particular, one outlier obviously has a much stronger impact on the estimates, the smaller the sample size.

N-fold Crossvalidation

(11)

After all:

We have just a small data set

If we use all data for training, we will definitely overfit data.

So what can we do if we want to take the most from our data ?

N-fold Crossvalidation

(12)

Divide the data set into N subsets

For N times:

Leave one subset out and use it as test set, training a classifier using the other N-1 as test set

Estimate the performance of the classifier over the union of the results obtained on the N disjoint test sets.

Extreme case: ‘leave-one-out’, in which one has as many folds as samples.

N-fold Crossvalidation

(13)

Observations

All data used as test when not involved in training

The procedure can be used only to evaluate the method (estimate a classifier’s error, compare it to others) but not to design a classifier: we have

produced N different classifiers, none of which has a reason for being preferred to the others.

N-fold Crossvalidation

(14)

The decision taken by a binary classifier produces one of four possible outcomes:

True Positive (TP): decision= 1, correct decision C=1 True Negative (TN): decision= 0, correct decision C=0 False Positive (FP): decision= 1, correct decision C=0 False Negative (FN): decision= 0, correct decision C=1

Quality criteria for binary classifiers

(see also Prof. Bononi’s notes on Bayesian Prediction)

(15)

Given a data set with Nd patterns to which one applies a binary classifier, then TP+TN+FP+FN = Nd

Accuracy : 100*(TP+TN)/(TP+TN+FP+FN) = (TP+TN)/Nd Estimate of P(C). Correct decision rate.

Recall / Sensitivity / TP rate: 100 * TP/(TP+FN)

Estimate of P(1|c=1):. Correct decision rate when C=1 Specificity =1-FPRate: 100 * TN/(TN+FP)

Estimate of P(0|c=0). Negative rate when C= 0

Precision / Positive predictivity: 100 * TP/(TP+FP) Estimates P(C=1|1). Positive rate when decision =1

F-Measure = 2 * Precision * Recall / (Precision + Recall) Global performance index. F-Measure ϵ [0,1]

Quality criteria for binary classifiers

(see also Prof. Bononi’s notes on Bayesian Prediction)

(16)

The Confusion Matrix (CM) associated with an N-class classifier is a square NxN matrix whose element Aij represents the number (frequency, if normalized by the number of samples of class i) of patterns belonging to class i classified as belonging to class j.

Proprietà:

The CM associated with an ideal classifier is diagonal

S

i

A

i* equates the number of patterns belonging to class i (or 1, if frequency is used)

Using the elements Ai* it is possible to compute the quality indices for a binary classifier whose decision is ‘ i / not i ’

TPi = Aii (Recall if frequency is used)

FNi = Sj Aij (1- dij ) (Homework: derive TNi and FPi )

Quality criteria for binary classifiers

(see also Prof. Bononi’s notes on Bayesian Prediction)

(17)

C1 C2 C3 C4 Ni

C1 190 6 2 2 200

C2 16 356 8 20 400

C3 1 1 48 0 50

C4 4 4 1 91 100

Decision

Class

Confusion Matrix

(18)

C1 C2 C3 C4 Ni

C1 190 6 2 2 200

C2 16 356 8 20 400

C3 1 1 48 0 50

C4 4 4 1 91 100

Decision

Class

Confusion Matrix

(19)

C1 C2 C3 C4 C1 95% 3% 1% 1%

C2 4% 89% 2% 5%

C3 2% 2% 96% 0%

C4 4% 4% 1% 91%

Decision

Class

Confusion Matrix (%)

(20)

1 0

1 TP FN

0 FP TN

Normalizing by Np = TP + FN and Nn = row-wise, respectively, one obtains the TP (FN, FP, TN) rates.

Decision

Class

Confusion Matrix (Binary Classifier)

(21)

The general goal for a classifier is to maximize accuracy.

However, that single figure does not tell the whole story.

Only if accuracy  1 we can be sure that the classifier is actually good.

As an example, consider a classifier that must recognize the figure ‘8’ (positive) against all others (negatives).

If we have a uniform sample, the priors are P(n) = 0.9 P(p) = 0.1

So, a “classifier” which always decides 0, has an accuracy of 90%, but it is definitely not a good classifier!

ROC (receiver operating characteristic) curve

(22)

In terms of optimization, designing a classifier is a multi- objective optimization maximize two criteria: Sensitivity (Recall) and Specificity, i.e.

Maximize TP (Minimize FN) Minimize FP (Maximize TN)

ROC (receiver operating characteristic) curve

(23)

ROC curves are used in signal detection to show tradeoff between hit rate (TP rate) and false alarm rate (FP rate) over noisy channels.

Defined when radar theory was first developed.

y axis represents the percentage of true positives in sample x axis represents the percentage of false positives in sample.

It is an estimate of the actual ROC curve :

y axis represents P(ŷ=1 |y=1) (Sensitivity/ TP rate) x axis represents P(ŷ=1 |y=0) (FP rate)

its slope is equal to the likelihood ratio used in the Bayes decision rule (see Prof. Bononi’s slides, page B13)

ROC (receiver operating characteristic) curve

(24)

(R)OC (receiver operating characteristic) curve

What we obtain, formally, is not the ROC, but the so-called OC (operating characteristic) curve.

In fact, there are cases (see Duda-Hart, page …), e.g., with multi-modal distributions, in which using a single threshold is not the right way to go for receiver/classifier design…

Still a curve (which may not even be convex) can always be built according to the same rules.

(25)

Imagine the following processing chain is used

V 0/1

The curve is obtained by plotting the points (FP rate, TP rate)

one would obtain by sweeping the threshold

from + (ŷ = 0 for all patterns, FPrate=TPrate=0) to - (ŷ = 1 for all patterns, FPrate=TPrate=1)

(R)OC (receiver operating characteristic) curve

signal

Preprocessing Thresholding

(26)

(R)OC (receiver operating characteristic) curve

How can we estimate the curve if we cannot “set” the threshold and we have no explicit model of the classifier behaviour ?

We can use the results obtained in the test set.

Supposing the classifier (as most do) outputs, for each

pattern, a value V which is closer to 1, the more likely the pattern is to be a positive one, one can:

1. Sort {Vi , i=1, Np} in descending order

2. For i=1, Np

- move up1/Np if yi = 1

- move 1/Np to the right if yi = 0

(27)

(R)OC (receiver operating characteristic) curve

If we use data from a single test set, we will obtain a jagged curve; if we use cross-validation the curve should be

smoother.

(28)

(R)OC (receiver operating characteristic) curve

Using (for instance) the logistic regression, the likelihood  of each observation is the corresponding value of the

logistic function. All observations with >0.5 (located to the right of the dotted line) are classified as 1.

(29)

(R)OC (receiver operating characteristic) curve

All points to the right of the dotted line are classified as positive, all points to its left are classified as negative.

o’s to the left = TN o’s to the right =FP x’s to the right = TP x’s to the left = FN

(30)

(R)OC (receiver operating characteristic) curve

If we shift the logistic function to the extreme right the dotted line will shift with it and the classifier will always output 0. If we then start sweeping the curve leftwards, the likeliest observations will start to be classified as

positive, so the OC curve will grow vertically whenever a positive (x) crosses the dotted line and horizontally whenever a negative (o) does.

(31)

(R)OC (receiver operating characteristic) curve

Reasoning with distributions:

(32)

(R)OC (receiver operating characteristic) curve

The area under the curve (AUC) thus constructed is a quality index for the classifier: the more it is close to one, the

better.

In fact, an ideal classifier separates positive from negatives perfectly.

In that case, the OC Curve will actually be a square, since it will grow first only vertically until it reaches (FP rate, TP

rate) (0,1), then it will grow only horizontally to reach (1,1).

Homework: write a simple program (very easy in Matlab) to compute the AUC for a classifier.

(33)

Kappa statistic

Given two confusion matrices for a 3-class problem:

actual predictor (left) vs. random predictor (right)

Number of successes: sum of entries in diagonal (D)

● Kappa statistic: Dobserved − Drandom Dperfect − Drandom

measures relative improvement over random predictor

Most probably, same value obtained if AUC is used instead of D.

Quality criteria for binary classifiers

(used by WEKA)

Random Predictor P(a|x) = P(a) P(b|x) = P(b) P(c|x) = P(c)

(34)

Difference: error measures

Actual target values: a1 a2 …an

Predicted target values: p1 p2 … pn Mean Squared error

(p1−a1)2 + ... + (pn −an)2 n

Root Mean Squared (RMS) error (p1−a1)2 + ... + (pn −an)2 n

Mostly relevant for regression problems (continuous target values)

Quality criteria for binary classifiers

(used by WEKA)

(35)

Difference: error measures

Actual target values: a1 a2 …an

Predicted target values: p1 p2 … pn

Mean Absolute error

|p1−a1| + ... + |pn − an| n

Less sensitive to outliers.

Quality criteria for binary classifiers

(used by WEKA)

(36)

Relative errors

Relative Absolute error

|p1− a1| + ... + |pn − an| | a − a1 | + ... + | a − an | a = average of a1….. an

Relative Squared error

(p1− a1)2 + ... + (pn − an)2 (a − a1)2 + ... + (a − an)2

Quality criteria for binary classifiers

(used by WEKA)

(37)

First two units of the online course (6+6

modules) on Weka: they include a revision of some concepts we introduced in the previous lesson.

Slides associated to chapter 5 of the book

“Data mining” by Witten, Frank, and Hall.

Online in our course site.

Suggested readings/videos

Riferimenti

Documenti correlati

Bayesian nonparametrics, Central limit theorem, Conditional iden- tity in distribution, Indian buffet process, Random measure, Random reinforcement, Stable

[r]

[r]

[r]

Solution proposed by Roberto Tauraso, Dipartimento di Matematica, Universit`a di Roma “Tor Vergata”, via della Ricerca Scientifica, 00133 Roma, Italy.. Given a quotient

Solution proposed by Roberto Tauraso, Dipartimento di Matematica, Universit`a di Firenze, viale Morgagni 67/A, 50134 Firenze, Italy.. We denote by σ(n) the sum of the divisors of

To receive full credit, show all of your work.. Neither calculators nor computers

To receive full credit, show all of your work.. Neither calculators nor computers