• Non ci sono risultati.

Laurea in INFORMATICA Laurea in INFORMATICA MAGISTRALE MAGISTRALE

N/A
N/A
Protected

Academic year: 2021

Condividi "Laurea in INFORMATICA Laurea in INFORMATICA MAGISTRALE MAGISTRALE"

Copied!
20
0
0

Testo completo

(1)

1

Learning / 2 Learning / 2

Machine Learning (ML) and Data Mining (DM) – Machine Learning (ML) and Data Mining (DM) – Concepts and Techniques for automatic knowledge Concepts and Techniques for automatic knowledge

acquisition acquisition

Laurea in INFORMATICA Laurea in INFORMATICA

MAGISTRALE MAGISTRALE

Corso di

INTELLIGENZA ARTIFICIALE Stefano Ferilli

Questi lucidi sono stati preparati per uso didattico. Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato il riferimento. Tutto o parte del materiale può essere fotocopiato per uso personale o didattico ma non può essere distribuito per uso commerciale. Qualunque altro uso richiede una specifica autorizzazione da parte dell'Università degli Studi di Bari e degli altri autori coinvolti.

Objectives

Framing the subject

Definitions and fundamental concepts

Overview of approaches, techniques and systems

YES features, peculiarities, differences

YES general strategies

NO technical details

NO exhaustive listing

Ability to choose depending on needs

Ability to run significant experiments

Types (1)

Sub-symbolic

Mathematics/Statistics

– Brain structure

Physical connections – Latent semantics

– Perception

Symbolic

Logics

– Mind structure

Concepts – Explicit semantics

– Reasoning

Cooperation!

Types (1)

Examples

Sub-symbolic

– Neural Networks

– Hidden Markov Models

– Clustering

– Frequent Itemsets

Symbolic

– Classification Rules

– Decision Trees

– Association Rules

Knowledge Representation

Several formalisms

2 large categories

0-order (propositional logics)

– Feature vectors

– Attribute-value pairs

1-order (predicate logics)

– Complex schemes

– Objects, Attributes, Relationships

0-order Learning

Geometric representation

d-dimensional Euclidean space, E

d

(pattern space)

Instance = point of coordinates (x

1

,x

2

, …, x

d

)

– or vector x from origin to the point

Set of points separated by “decision surfaces”

– Split E

d

in R

“decision regions”

2 dimensions

→ curves in E

2

(5,-3)

X

1

X

2

d=2 R=3

3

2

1

1

(2)

0-order Learning

A classifier’s decision surfaces may be implicitly defined by a set of R functions of pattern X

– g

1

(X), g

2

(X), …, g

R

(X) (discriminative functions) selected such that

– X  ℝ

i

: g

i

(X) > g

j

(X) for i,j = 1, …, R, i ≠ j

in i the i-th function has its maximum value ℝ

For R=2 (belonging or not belonging to a class c) the larger between g

1

(X) e g

2

(X) is selected

– are continuous through the decision surfaces

The decision surface separating two adjacent regions ℝ

i

and ℝ

j

is given by g

i

(X) – g

j

(X) = 0

Many ways for selecting discriminative functions

– Complete a-priori knowledge or qualitative knowledge about the pattners

0-order Learning

Assumption: Extremely poor a-priori or

background knowledge about the patterns to be classified

Adaptation process takes place during training phase

Large (or significant) number of typical patterns required as the “training set”

– i.e., the classification of such instances is known

1-order Learning

First-Order Logics

Variable number of objects

Relationships among objects

Inductive Logic Programming (ILP)

Example

Predict train direction (East bound / West bound)

1-order Learning

Arch problem [Patrick Winston]

– Learn a definition for “arch” in terms of supports and

“touch” relationships

/ \ > arch(A, B, C) :- ______ _______ / \ > brick(B), | 11 | | 31 | / 41 \ > left_of(B, C), ~~~~~~ ~~~~~~~ ~~~~~~~ > brick(C), |~| |~| |~| |~| | | | |~| |~| > supports(B, A), 12| |13 22| |23 32|33 42| |43 > supports(C, A), | | | | |_|___|_| | | | | | | | > not(touches(B,C)).

| | | | | 21 | | | | | | | | ~ ~ ~~~~~~~ ~ ~ ~ ~ / \ ______ _______ _______ / \ | 51 | | 61 | | 71 | / 81 \ ~~~~~~ ~~~~~~~ ~~~~~~~ ~~~~~~~

|~| |~| |~| |~| |~| |~| |~| |~|

52| |53 62| |63 72| |73 82| |83 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ~ ~ ~ ~ ~ ~ ~ ~

Mutagenesis (!)

Numeric and Symbolic Approaches

Different description language and, as a consequence, computational model

– Inputs (examples or observations)

Numeric: Feature Vectors

Symbolic: (attribute-value) Pairs, Logics – Experience or knowledge (in general)

Numeric: Heuristics (functions)

Symbolic: Theory/knowledge expressed in logic languages – Concepts or classes

Numeric: Extensional form, functions

Symbolic: Intensional form (heuristic)

Thus

– Numeric approach centered on 0-order representation with finite sets

– Symbolic approach centered on higher-order representations and potentially infinite classes of instances (logic “variables”)

Symbolic Approaches

Concept

Extensional approach: represents a category

– A set of objects (instances) sharing properties

Intensional approach:

– Classical view [Aristotle]: the name that corresponds to a set of necessary and sufficient conditions for belonging to a category

– Weakened view [Wittgenstein, 1954]: considers only sufficient conditions

– Heuristic view “by prototypes” [Rosch, 1973]: the

description of a (real or virtual) example characterized by

the “typical” properties appearing in the set of instances

(3)

“Conceptual” Inductive Learning

Goal: satisfy the

Understandability Postulate

– “Results of computer-made inductions should be symbolic description of entities, that are structurally and semantically similar to those that a human would produce.

– The components of such descriptions should be understandable as single information units.”

[Michalski, 1975]

“Conceptual” Inductive Learning

Exploits

A logic(-like) description language for inputs and outputs

Different systems to represent experience and background knowledge to assess classification rules that define, characterizing them, class patterns

Allows conceptual descriptions of patterns that are similar to those that human experts would do when observing the same data

“Conceptual” Inductive Learning

Definitions

Concept

Abstract description of a class – Formation

Grouping of instances in (any) significant set of classes – Attainment

Search for defining attributes that distinguish instances of different classes

Formation first step to Attainment

[J.S. Bruner, J.J. Goodnow, G.A. Austin “A Study of Thinking” - John Wiley & Sons, 1956, p. 22]

Data Mining

Branch of Artificial Intelligence derived from Machine Learning

Complementary to Machine Learning

Aimed at knowledge discovery

Specifically

– From big data

– For business purposes

Ensemble Learning

Learning of many models from one dataset

Use of portions of the data or of the features

Final classification = Combination of predictions of the learned models

Examples

Random Forests

Ensemble Learning

Bagging (Bootstrap Aggregation)

Goal: Reduce variance in predictions

Strategy: generate additional training data from the original dataset

– Combinations with replacement to produce multisets with the same cardinality as the original data

Selection of subset: Random

Combination: (weighted) average

Increasing the size of the training set

– Cannot improve the predictive ability of the model!

– Can only reduce variance

Make the prediction better fit the expected solution

Useful in numeric-statistical approaches

(4)

Ensemble Learning

Boosting

Goal: Improve predictive power

Selection of subset: Preference for instances that are misclassified by previous models

– Use of subsets of the original data to obtain a series of poorly performing models

– Use of a cost function (majority vote)

– Not random

Depends on the performance of the previous model – Every new subset includes the elements that were

(probably) misclassified by the previous models

Combination: (weighted) majority vote

– “boosts” performance of models by combining them with each other

Ensemble Learning

Stacking

– Similar to boosting

Goal: Bagging + Boosting

– Possibility of applying various models to the original data

Selection of subset: Various

Combination: meta-level, using another model/approch for weight estimation

– To determine which models have good performance and which ones have bad performance given the input data

– Not easy empirical formula for weight function

– Difficult in practice to define the meta-level

Logistic Regression

Deep Learning

Multi-level learning

Many concepts

Many representation levels

Hierarchically organized concepts

– Higher-level concepts defined in terms of lower-level ones

Generic definition, but

took over by neural approaches

Types (2)

Definitions

Observation

– Description of an object or situation in a given formalism

Example

– Class label associated to an element that is present in an observation

Types (2)

Unsupervised

Based on observations

– Autonomous

(Data Mining)

Supervised

Based on examples

– Trainer/Oracle

(Machine Learning)

Semi-supervised

Mix of labeled and unlabeled instances

Types (2)

Reinforcement

Feedback in response to the choices of the algorithm

– Maximization of cumulative rewards obtained by the agent while exploring the problem

Input-output pairs of known examples never shown

Sub-optimal choices never corrected

Active (semi-supervised)

Interactively asks an oracle about the correct label of unlabeled observations

– Particularly informative observations

– Useful when labeling the whole dataset would bee too

expensive

(5)

Types (2)

Example

Types (2)

Definitions

Relevance

Components of the observations of interest for a given goal – Direct

Explicitly indicated by the trainer for a specific observation – Indirect

Must be selected by the learner from the set of components associated to all available observations

Types (2)

Examples

Unsupervised

– Clustering

– Frequent Itemsets

– Association Rules

Semi-supervised

– Regression

Supervised

– Neural Networks

– Classification Rules

– Decision Trees

– k-Nearest Neighbors

Types (3)

Predictive

Identify a relationship between a set of predictive variables and a target variable

Supervised

Descriptive

Deepen knowledge of what happens in the data

Unsupervised

The best model may not be the most predictive one

Types (3)

Examples

Predictive

– Classification

– Regression

– Anomaly / Outlier detection

Descriptive

– Clustering

– Association Rules

– Sequential patterns

Types (4)

Lazy

Instance-based

– Fictitious

?“Industrious”?

Learn (explicit) models

– Real

(6)

Types (4)

Examples

Traditional

– Neural Networks

– Classification Rules

– Decision Trees

Lazy

– k-Nearest Neighbors

Types (5)

Batch

All observations/examples for learning available at the beginning of the learning process

Incremental

Observations/examples may become available subsequently

Types (5)

Batch

Simpler

Traditional

Training

Smart models

Incremental

More complex

Innovative

Tuning

Efficiency

Cooperation!

Types (5)

Definitions

Learning set

Set of (labeled) examples used for learning – Training set

Examples for batch learning – Tuning set

Examples for incremental learning

Test set

Set of (labeled) examples used for evaluating performance of a model (labeled)

Classification Observations

(Unlabeled) observations to which the model is to be applied

Types

Closed-loop system

A learning system that can take the learned knowledge as an input to another learning process

Open-loop system

Otherwise

Definitions

Order of examples

Selection

– Controlled by the learner

Batch

Reception

– Controlled by the context

Incremental

Learning strategies

Scanning

Focussing

– [J.S. Bruner, J.J. Goodnow, G.A. Austin “A Study of

Thinking” - John Wiley & Sons, 1956, p. 22]

(7)

Regression

Guessing of a function’s values

Regression equation

– Models the dependent variables as a function of the independent variables

Mutliple regression: many dependent variables

Parameters estimated so as to better describe the data – Many error terms

Represent noise

Variation cannot be controlled or predicted in the dependent variable

Modeled by a random variable

Learning

– Method of Least Squares

– ...

Regression

Use

Interpolation

– Guessing values within the range of the training ones

Extrapolation

– Guessing values outside the range of the training ones

Applications

Predict the sales volume revenues of a new product based on investments in advertisement

Predict the trend of stock market

Genetic Algorithms

Search for a solution inspired by biological evolution

Solution = cromosomes (feature vectors)

– Genes = bits or symbols that make up the vector

Initial population of solutions

Loop:

– Evaluation of solutions

Fitness measure

– Probabilistic selection of fittest ones

Seeds for producing the next generation – New generations

Selection of members for “reproduction”

Random mutation and Crossover operations

No guarantee to find the best solution

Genetic Algorithms

Mutation

Choice of genes random or aimed at improving fitness

Crossover

One-point

– Side portion of genes

Two-point

– Intermediate portion of genes

Uniform

– Random different genes

Arithmetic

– Logistic operations

Clustering

Grouping of given observations in internally homogeneous and mutually dishomogeneous subsets

Similarity / Distance

Aimed at organization

Concept formation

Number of subsets?

Use of quality measures

– Density, Cohesion, ...

Applications

Split customers in subsets to use as the targets of specific marketing activities

Clustering

“Direct”

– Directly distributes observations in a given number of groups

Hierarchical

– Generates a progressive partition tree from observations (dendrogram)

– Determines the “border” in the tree that defines the groups

Partitive

– Successive splittings starting from the whole set of observations

Agglomerative

– Successive groupings starging from the single

observations

(8)

Clustering

Dendrogram

Clustering

Fuzzy

An observation may be assigned to many groups

– Different membership degrees

Conceptual

Learns an explicit model for each identified group

Concept formation + attainment

Clustering

Representatives of a cluster

Centroid

– Central position (center of mass)

May be not computable for non-metric spaces

Attribute-value descriptions

May not correspond to any observation

Medoid

– “Most central” element

Useful when the centroid is not a valid element

Predicate logics

Always corresponds to an observation

Clustering

Training

Agglomerative: Cluster merging

– Similarity above (Distance below) a given threshold

Most distant elements (complete link)

Closest elements (shortest link)

Centroids/Medoids

Average of similarities/distances

K-means

– Initialization: k observations chosen at random (seeds)

– Aggregation of each remaining observation to the most similar seed

– Loop

Recompute seeds as centroids/medoids of the groups obtained

Redistribute remaining observations

B

A C

Clustering

Training

Agglomerative: Cluster merging

– Similarity above (Distance below) a given threshold

Most distant elements (complete link)

Closest elements (shortest link)

Centroids/Medoids

Average of similarities/distances

K-means

– Initialization: k observations chosen at random (seeds)

– Aggregation of each remaining observation to the most similar seed

– Loop

Recompute seeds as centroids/medoids of the groups obtained

Redistribute remaining observations

B

A C

Clustering

Training

Agglomerative: Cluster merging

– Similarity above (Distance below) a given threshold

Most distant elements (complete link)

Closest elements (shortest link)

Centroids/Medoids

Average of similarities/distances

K-means

– Initialization: k observations chosen at random (seeds)

– Aggregation of each remaining observation to the most similar seed

– Loop

Recompute seeds as centroids/medoids of the groups obtained

Redistribute remaining observations

B

A C

(9)

Clustering

Example

Process Mining

Extraction of process models from examples of process executions

Declarative

– Logic representation

– Constraints, rather than a rigid model

Uses

– Analysis

Discovery of criticalities – Standardization

– Supervision / Compliance check

– (Prediction)

– Simulation

of industrial or business processes

(of human activities)

Process Mining

Formalisms

(ASF, HMM)

Petri Nets

– Workflow Nets [van der Aalst]

Specific formalism

– WoMan

Systems

ProM

– https://svn.win.tue.nl/trac/prom/wiki/ProM66

WoMan

– Incremental

Frequent Itemsets

Discovery of subparts of descriptions that are frequent in a set of observations

Frequency threshold (% of observations)

– Support

Aimed at discovering regularities

Frequent Itemsets

Applications

Products that are often bougth together

– Propose them at the same time

Behaviors often happening simultaneously

– Devise targeted products

E.g.: {travels, uses_credit_card, uses_foreign_currency}

Features often happening simultaneously

– Devise targeted products

E.g.: {rich, entrepreneur, from_northern_italy}

Anomaly detection

– By contraposition

Frequent Itemsets

Learning (APRIORI)

If an itemset is not frequent, no superset of it can be either

– Start: Frequent elements

– Expansion: add an item to each frequent itemset obtained in the previous step

Remove expanded ones

Save those whose expansion is not frequent

– Termination: no itemset expanded in the last step

(10)

Frequent Itemsets

Example

Supermarket receipts

– pasta, diapers, beer, salt, wine

– cookies, diapers, talc, sauce, milk

– diapers, salt, talc, beer, babyfood, fries

– sugar, diapers, talc, beer, salt, fries

– wine, diapers, fries, babyfood, pasta

“many” receipts include

– diapers, talc (3/5)

– diapers, beer (3/5)

– diapers, fries (3/5)

Association Rules

Discovery of relevant co-occurrences of situations in a set of observations

Relevance threshold (% of co-occurrences)

– Confidence

Aimed at discovering regularities

A frequent itemset which is a subset of another and often appears in the same observations suggests that the difference of the latter from the former has some correlation with the former

Applications

Identify other products typically chosen by users that choose some products

– Advertise or recommend them together

Association Rules

Example

Supermarket receipts

– pasta, diapers, beer, salt, wine

– cookies, diapers, talc, sauce, milk

– diapers, salt, talc, beer, babyfood, fries

– sugar, diapers, talc, beer, salt, fries

– wine, diapers, fries, babyfood, pasta

“many” (3/5) receipts include

– diapers, talc

“many” (2/3) such receipts also include

– beer, fries

Association Rule:

– diapers, talc ⇒ beer, fries

Anomaly/Outlier Detection

Identification of anomalous behaviors and deviations

May exploit other learning techniques

– Unsupervised

Clustering, Frequent Itemsets, Association Rules – Supervised

K-NN, SVM, Ensemble

Systems

see ELKI (http://elki.dbs.ifi.lmu.de/)

Applications

Fraudulent use of credit cards

Illegal use of money

Anomalous financial products

Classification

Learning of models aimed at classifying new observations whose class is unknown

Aimed at prediction

Supervised

Binary / MultiClass

Mutually exclusive?

Classification

Applications

Predict customers that will subscribe a certain product

Predict reliability of a customer in order to decide whether granting a mortgage or a credit card

Predict fraudulent use of a credit card

Predict customers who are inclined to join a

competitor

(11)

k-Nearest Neighbor

“Learning” (lazy)

Collection of n classified instances

– Selection of most representative ones

Use

Compute the distance between the observation and each prototype instance

Consider the closest k

k odd, not large (~ √n)

– Euclidean distance, Cosine distance

What about First-Order Logics?

Assign the observation to the majority class among the k closest prototype instances

Neural Networks

Observations

Feature Vectors

– Attribute-value

Propositional

Model

Directed Acyclic Graph

– Nodes with no incominc arcs

= input attributes

– Internal nodes

= intermediate (latent) concepts

– Nodes with no outgoing arcs

= output classes

– (weighted) arcs = (degree of) influence of a node on another

Neural Networks

Can be straightforwardly implemented in electronics

Use

Apply attribute values to corresponding input nodes

Repeat

– For each node in the graph that recevies some input

If a pre-defined function of the input values passes a given threshold, propagate the result to the nodes to which it is connected by outgoing arcs

Collect the weight of the output nodes, expressing the likelihood of the corresponding class

Neural Networks

Learning

Setup of graph structure

Choice of function

Reinforcement of weight of arcs that lead from inputs to correct outputs

Deep Learning

Deep Neural Networks

– Many hidden layers

Bayesian Nets

Allow to catch probabilistic dependences among many variables

Independent

Dependent

Observations

Feature vectors

– Attribute-value

Propositional

Model

Directed Acyclic Graph

– Nodes = random variables

– Arcs (weighted) = conditional dependencies

Bayesian Nets

Example

Systems

Hugin

(12)

Naive Bayes

Likelihood of a class C given the features of an observation O

Bayes’ Theorem

– P(C ∧ O) = P(C | O) P(O) = P(O | C) P(C)

– P(C | O) = P(O | C) P(C) / P(O) =

P(O) fixed for any given O (indipendent of C)

= P(O | C) P(C)

P(O | C) =

O = {F

1

= o

1

, ..., F

n

= o

n

}

= P(F

1

= o

1

∧ ... ∧ F

n

= o

n

| C) =

Assumption: statistically independent features (!)

“naive”

= P(F

1

= o

1

| C) ... P(F

n

= o

n

| C)

P(F = o | C) = relative frequency in the training set

P(C) = relative frequency of C in the training set

Support Vector Machines

Binary classification based on

Structural Risk Minimization Principle

– Find a hypothesis that guarantees the minimum empirical error

Probability of making an error on an unknown test instance chosen at random

Vector Space

– Instances = points in the space

– Hypothesis = hyperplane (decision surface) that may best separate the points in 2 distinct classes

Area to find it limited by maximum margin

Support vector = any point within the maximum margin

Support Vector Machines

Learning

Identify optimal hyperplane

– Propose maximum distance which is closest to both classes

– Parallel to the margin hyperplanes

The problem might not be linearly separable

– Use of non-linear transformation functions (kernels) such that the problem in the transformed space is linearly separable

Use

Computation of instance’s position and distance with respect to the hyperplane

Classification Rules

Observations

Logistic conjunctions

– Propositional, First-Order

Model

Set of implications

– Premises: conditions to be checked

Conjunction [+ disjunctions, negation]

– Conclusion: class

Classification Rules

Application

Given a rule

– Check whether facts in the observation satisfy condition

If so, classify the observation according to the class reported in the conclusion

Forward

– For each rule

Backward

– For the rules of a given class

Classification Rules

Learning

Top-down

– Start: Most general rule

– Progressive expansion with additional conditions

Coverage of as many positive examples as possible

Non coverage of any negative example – Loop until termination of positive examples

Bottom-up

– Start: specific examples

– Progressive generalization, removing conditions

Monotonic: without ever covering negative examples

Non-monotonic: specialization in case of covered negative

examples, by adding conditions

(13)

Classification Rules

Systems

0-order

– AQ (Michalski 1969)

– CN2 (Clark & Niblett 1989)

AQ + ID3

1-order

– FOIL

– Progol

– Aleph

– InTheLEx

Incremental

Classification Rules

Example (InTheLEx)

class_elsevier(A) :-

numero_pagine(A,[13,26]), pagina_1(A,B),

frame(B,C),

height_medium_small(C), pos_center(C),

pos_upper(C), tipo_misto(C), on_top(C,D), pos_center(D), pos_middle(D), tipo_testo(D),

larghezza_pagina(B,[544.0,612.0]), altezza_pagina(B,[743.0,842.0]).

Decision Trees

Observations

Feature vectors

– Attribute-value

Propositional

Model

N-ary tree structures

– Non-leaf nodes = tests on attributes

– Arcs = possible results of tests

– Leaves = classes

Decision Trees

Use

Starting from the root

– Traverse the tree by running the tests indicated in the traversed nodes on the corresponding attributes of the observation

until reaching a leaf

– Classify the observation according to the class indicated by the leaf

Decision Trees

Training

Each node associated to a subset of examples

– Root: whole training set

– Children of a node: partition of the examples in the parent node based on the test associated to the node

Choice of the most discriminant attribute

Entropy, Information Gain, ...

– Leaves: subsets of examples all in the same class

– Pruning: removal of subtrees whose complexity does not justify the gain in accuracy

Simplifies the tree

Avoids overfitting

Decision Trees

Example: input

(14)

Decision Trees

Example: tree

Decision Trees

Example: Use

Outlook = Sunny

Temperature = Hot

Humidity = High

Wind = Weak

PlayTennis = ?

– No

Decision Trees and

Classification Rules

Every path from root to leaf in a decision tree generates a classification rule

Example

Outlook = Sunny AND Humidity = High ⇒ PlayTennis = No

Outlook = Sunny AND Humidity = Normal ⇒ PlayTennis = Yes

Outlook = Overcast ⇒ PlayTennis = Yes

Outlook = Rain AND Wind = Strong ⇒ PlayTennis = No

Outlook = Rain AND Wind = Weak ⇒ PlayTennis = Yes

Decision Trees

Systems

0-order

– ID3 (Quinlan 1986)

– C4.5 (Quinlan 1993)

– ITI (Utgoff 1989)

Incremental

1-order

– TILDE

– MRDTL

Random Forests

Sets of decision trees

Each learned on a subset of features selected at random

Classification obtained as a combination of the classifications proposed by single trees

Ensemble Learning method

Hidden Markov Models

Observations

Sequences of symbols

– Propositional

Model

Directed graph

– 2 types of nodes

States (not observable)

Symbols (observable)

– 2 types of arcs (weighted with probabilities)

State-State (transition)

State-Symbol (output)

(15)

Hidden Markov Models

Application

Given a sequence, compute probability that it is generated by the model

Given several models, choice of the more suitable to generate a sequence

Given a model, generate sequences for simulation purposes

– Example: speech recognition

Hidden Markov Models

Training

Parameters

– Weights of arcs that maximize the probability of obtaining the training sequences

Suitable numerical methods

Structure

– Open problem

First-Order Logics

Learning from Sequences

HMM LoHMM TildeCRF MLN Lynx

First-Order - X X X X

Relationships - - - X X

Multi-dimensional - - - ~ X

Concurrency - - - ~ X

Gap length - - - - -

Long descriptions - - ~ - ~

Readability ~ ~ ~ ~ ~

Systems

and associated features

Experimental Setting

Goal definition

Choice of dataset

(choice of representation)

Choice of systems

Choice of performance measures

Choice of validation procedure

Choice of statistical test

Running the experiment

Analysis of results

Choice of Dataset

Real

Obtaining data is often problematic

Representative of the domain

Often suffering from noise

Artificial

Manual creation often complex

Allows to focus on specific features

“clean” data

Choice of Representation

Level

Propositional

– May be automatic, immediate (from sensors)

Relational

– Manual, or requires post-processing

Features

Problems

– Irrelevance

– Redundancy

– Curse of dimensionality

Solutions

– Selection

– Extraction

(16)

Choice of Systems

Depends on

Size of dataset

Representation of instances

– Number/type of features

Learning task/goals

Unbalanced instances

Reliability of data (noise)

Evaluation of Results

Possible criteria

Common to numeric and symbolic approaches

– Predictive performance of the classifier

– Speed of the learning process

– Speed of the classifier

– ...

Relevant for the numerical/statistical approach

– Robustness (to noise, uncertainty, etc.)

Relevant in the symbolic/conceptual approach

– Interpretability of the results

Performance Measures

Depends on

Learning goals

Unbalanced instances

Parameters

TP(R) : True Positive (Ratio)

TN(R) : True Negative (Ratio)

FP(R) : False Positive (Ratio)

– Type I error: false alarm

FN(R) : False Negative (Ratio)

– Type II error: miss

Performance Measures

Accuracy

A = (TP + TN) / (TP + TN + FP + FN)

Error Rate

E = (FP + FN) / (TP + TN + FP + FN) = 1 – A

Specificity

TN / (TN + FP)

Fall-out

FP / (FP + TN) = 1 – Specificity

Performance Measures

Precision

P = TP / (TP + FP)

Recall (Hit Rate, Sensitivity)

R = TP / (TP + FN)

F-measure

F = (1 + b

2

) P R / (b

2

P + R)

F1-measure (b = 1)

F1 = 2 P R / (P + R)

– Harmonic mean between Precision and Recall

Choice of Evaluation Procedure

Problem

One dataset -> result of one run would be unreliable

Solution

Generation of several training/test sets from the same dataset -> many runs -> reliable statistics

Factors

Sample size (dataset)

A-priori knowledge of the distribution of instances in the dataset

– Stratification

(Kind of statistical test to be applied)

(17)

Relevance of the Sample Size

How many examples?

Theorem [Blumer, Ehrenfeucht, Haussler &

Warnruth]

– Consider a training set of m examples classified based on a theory L

– Suppose n hypotheses {H

i

}

i=1,...,n

including L

– Let H

k

be any hypothesis agreeing with L on all examples

– Supposing m ≤ (1/e) ln(n/d)

– The probability that H

k

and L differ for more than e is less than or equal to d

e is the discrepancy from the objective

1 – d is the confidence

Evaluation Procedures

k-fold Cross-Validation

Instances partitioned in k “folds”

For all i = 1, ..., k:

– Training set = Dataset – i-th fold

– Learn a model

– Evaluate model on i-th fold used as a Test set

Features

– Few variability in composition and sorting of instances

– Size of training and test set depending on the number of repetitions

More repetitions, smaller test set

Evaluation Procedures

Leave-One-Out

k-fold Cross Validation for k = n number of instances

– Each training set made up of all instances except one

– Each test set made up of just one instance (the remaining one)

Evaluation Procedures

Holdout (Random Split)

Define percentage p of instances to be used as a training set and number of repetitions r

For i = 1, ..., r

– Extract p% instances from the dataset to form the training set

– Learn a model

– Evaluate model on the remaining instances as the test set

Features

– Possibility of varying composition and order of instances

– Possibility of maintaining original distribution of classes in training and test sets

Evaluation Procedures

Prequential

Each instance used to test the model before updating it

Repeat

– Consider next instance

– Test previous version of the model

– (Update model)

Features

– Model always tested on new data

– Suitable for incremental learning

– No need for a test set

Statistical Test

Reliability value a required for outcomes

Typical values: 90%, 95%

Opposite hypotheses to check

Null hypothesis

– No difference between two “treatments”

– Performance value obtained is reliable

Satisfied if value of statistics is close to a

Alternate hypothesis

– There is a difference between two “treatments”

– Performance value obtained is not reliable

Satisfied if value of statistics is far from a

(18)

Choice of Statistical Test

T test (Student)

Parametric test to check hypothesis on mean value

– Pros

Effective for large data sample

Observations collected in an independent way – Cons

Depending on casual fluctuations in averages in case of small- sized samples

Effective only for a Normal distribution of data

Choice of Statistical Test

Kolmogorov-Smirnov test

Non-parametric test that does not require a-priori hypotheses about the distributions, used in general for qualitative features

Advisable for k-fold Cross-Validation technique

– Order of examples increases risk of asymmetry in data

– Pros

Suitable for small data samples (<40)

Robust compared to Student’s T test

Effective on asymmetric data – Cons

Loss in efficiency-power for large data samples

Choice of Statistical Test

Rank test (Wilcoxon)

Non-parametric test for checking hypotheses about the central trend

Advisable for Data Splitting Cross-Validation technique

– Casual distribution of data reduces the risk of asymmetry in the data

– Pros

Ensures more general validity conditions

Suitable for a small data sample (<20)

Robust compared to Student’s t test – Cons

Much suffers from asymmetry in the data

KNIME

www.knime.com

Allows to develop Data Science and Machine Learning solution using “drag and drop” of functional blocks

Used in a wide variety of areas

– Provides templates (or blue prints) for example applications for some of these areas

– Example server with workflows that demonstrate how to use some of the techniques

WEKA

Waikato Environment for Knowledge Acquisition

http://www.cs.waikato.ac.nz/ml/weka/

Collection of Machine Learning algorithms for Data Mining tasks

Open Source

– Interface (WEKA Knowledge Explorer)

– Java API

Functionality

– Pre-processing tools

– Learning tools

Classification, Regression, Clustering, Association Rules – Visualization tools

WEKA

“Preprocess” panel

Starting point for exploring knowledge

– Loading dataset

– Browsing features of attributes

– Applying any

combination of

unsupervised

filters to data

(19)

WEKA

“Classify” panel

Configuring and running any classifier on the current dataset

Running a test or cross-validation on a separate dataset

Visualizing errors in a pop-up tool

Graphic visualization of decision trees

WEKA

“Cluster” panel

Configuring and running any clusterer on the current dataset

Visualizing clusters in a pop-up tool

WEKA

“Associate” panel

Discovering association rules in the current dataset using associators

WEKA

“Select attributes” panel

Configuring and applying any combination of methods for attribute evaluation and search

– Selection of most appropriate ones

– If a selection scheme transforms the data, the transformed data may be visualized in a pop-up tool

WEKA

“Visualize” panel

Visualizing the confusion matrix of the current dataset

– Adjustable size of cells and points

Cursors at bottom – Changeable

number of cells

“Select Attributes”

button – On large dataset,

possibility of visualizing only a sub-sample

Visualization performance

WEKA

– Enlargement of graphics in the cells

Dataset visualization in one or two dimensions. If the coloring attribute is discrete, each value is visualized in a different color; if it is continuous, a spectrum is used to indicate the value – Summary of the

discriminating power of the single attributes

Visualization of

predictions in a

window of the

classification and

clustering panels

(20)

Bibliography

Tom Mitchell, “Machine Learning”, McGraw-Hill, 1997

Christopher M. Bishop, “Pattern Recognition and Machine Learning”, Springer, 2007

Ian H. Witten, Eibe Frank, Mark A. Hall: “Data

Mining: Practical Machine Learning Tools and

Techniques”, Morgan Kaufmann, 2011 (3rd ed.)

Riferimenti

Documenti correlati

–  Generate high confidence rules from each frequent itemset, where each rule is a binary par77oning of a frequent itemset. •  Frequent itemset generation is still computationally

• The network consists of an input layer of source neurons, at least one middle or hidden layer of computational neurons, and an output layer of computational neurons. • The

process used, the type of gas for the generation of plasma, the treatment time, and the type of food treated. As shown in

By means of commitments, thus, social relationships become first-class entities that are created and manipulated by the agents as resources, made available in their environment.. It

In particular, one of the most common causes of project failure is the very high degree of uncertainty that affects the expected performance of the project, especially when

In particolare, una delle cause pi` u comuni di fallimento dei progetti ` e proprio l’elevato grado d’incertezza che influisce sul previsto andamento del progetto, soprattutto

I podociti originano dalle cellule dell’epitelio colonnare e, all’interno del glomerulo, con i loro pedicelli danno vita a una struttura arborizzata connessa alla membrana

– Programming = describing the problem domain through facts and rules in the form of Horn clauses. ● Questions -&gt;