Laurea in INFORMATICA Laurea in INFORMATICA MAGISTRALE MAGISTRALE

(1)

1

Learning / 2 Learning / 2

Machine Learning (ML) and Data Mining (DM) – Machine Learning (ML) and Data Mining (DM) – Concepts and Techniques for automatic knowledge Concepts and Techniques for automatic knowledge

acquisition acquisition

Laurea in INFORMATICA Laurea in INFORMATICA

MAGISTRALE MAGISTRALE

Corso di

INTELLIGENZA ARTIFICIALE Stefano Ferilli

Questi lucidi sono stati preparati per uso didattico. Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato il riferimento. Tutto o parte del materiale può essere fotocopiato per uso personale o didattico ma non può essere distribuito per uso commerciale. Qualunque altro uso richiede una specifica autorizzazione da parte dell'Università degli Studi di Bari e degli altri autori coinvolti.

Objectives

●

Framing the subject

●

Definitions and fundamental concepts

●

Overview of approaches, techniques and systems

●

YES features, peculiarities, differences

●

YES general strategies

●

NO technical details

●

NO exhaustive listing

●

Ability to choose depending on needs

●

Ability to run significant experiments

Types (1)

●

Sub-symbolic

●

Mathematics/Statistics

– Brain structure

●

Physical connections – Latent semantics

– Perception

●

Symbolic

●

Logics

– Mind structure

●

Concepts – Explicit semantics

– Reasoning

Cooperation!

Types (1)

●

Examples

●

Sub-symbolic

– Neural Networks

– Hidden Markov Models

– Clustering

– Frequent Itemsets

●

Symbolic

– Classification Rules

– Decision Trees

– Association Rules

Knowledge Representation

●

Several formalisms

●

2 large categories

●

0-order (propositional logics)

– Feature vectors

– Attribute-value pairs

●

1-order (predicate logics)

– Complex schemes

– Objects, Attributes, Relationships

0-order Learning

●

Geometric representation

●

d-dimensional Euclidean space, E

^d

(pattern space)

●

Instance = point of coordinates (x

₁

,x

₂

, …, x

_d

)

– or vector x from origin to the point

●

Set of points separated by “decision surfaces”

– Split E

^d

in R

“decision regions”

●

2 dimensions

→ curves in E

²

(5,-3)

X

₁

X

₂

d=2 R=3

ℝ

₃

ℝ

2

ℝ

1

ℝ

1

(2)

0-order Learning

●

A classifier’s decision surfaces may be implicitly defined by a set of R functions of pattern X

– g

₁

(X), g

₂

(X), …, g

_R

(X) (discriminative functions) selected such that

– X  ℝ

i

: g

_i

(X) > g

_j

(X) for i,j = 1, …, R, i ≠ j

●

in i the i-th function has its maximum value ℝ

●

For R=2 (belonging or not belonging to a class c) the larger between g

1

(X) e g

2

(X) is selected

– are continuous through the decision surfaces

●

The decision surface separating two adjacent regions ℝ

_i

and ℝ

_j

is given by g

_i

(X) – g

_j

(X) = 0

●

Many ways for selecting discriminative functions

– Complete a-priori knowledge or qualitative knowledge about the pattners

0-order Learning

●

Assumption: Extremely poor a-priori or

background knowledge about the patterns to be classified

●

Adaptation process takes place during training phase

●

Large (or significant) number of typical patterns required as the “training set”

– i.e., the classification of such instances is known

1-order Learning

●

First-Order Logics

●

Variable number of objects

●

Relationships among objects

●

Inductive Logic Programming (ILP)

●

Example

●

Predict train direction (East bound / West bound)

1-order Learning

●

Arch problem [Patrick Winston]

– Learn a definition for “arch” in terms of supports and

“touch” relationships

/ \ > arch(A, B, C) :- ______ _______ / \ > brick(B), | 11 | | 31 | / 41 \ > left_of(B, C), ~~~~~~ ~~~~~~~ ~~~~~~~ > brick(C), |~| |~| |~| |~| | | | |~| |~| > supports(B, A), 12| |13 22| |23 32|33 42| |43 > supports(C, A), | | | | |_|___|_| | | | | | | | > not(touches(B,C)).

| | | | | 21 | | | | | | | | ~ ~ ~~~~~~~ ~ ~ ~ ~ / \ ______ _______ _______ / \ | 51 | | 61 | | 71 | / 81 \ ~~~~~~ ~~~~~~~ ~~~~~~~ ~~~~~~~

|~| |~| |~| |~| |~| |~| |~| |~|

52| |53 62| |63 72| |73 82| |83 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ~ ~ ~ ~ ~ ~ ~ ~

●

Mutagenesis (!)

Numeric and Symbolic Approaches

●

Different description language and, as a consequence, computational model

– Inputs (examples or observations)

●

Numeric: Feature Vectors

●

Symbolic: (attribute-value) Pairs, Logics – Experience or knowledge (in general)

●

Numeric: Heuristics (functions)

●

Symbolic: Theory/knowledge expressed in logic languages – Concepts or classes

●

Numeric: Extensional form, functions

●

Symbolic: Intensional form (heuristic)

●

Thus

– Numeric approach centered on 0-order representation with finite sets

– Symbolic approach centered on higher-order representations and potentially infinite classes of instances (logic “variables”)

Symbolic Approaches

●

Concept

●

Extensional approach: represents a category

– A set of objects (instances) sharing properties

●

Intensional approach:

– Classical view [Aristotle]: the name that corresponds to a set of necessary and sufficient conditions for belonging to a category

– Weakened view [Wittgenstein, 1954]: considers only sufficient conditions

– Heuristic view “by prototypes” [Rosch, 1973]: the

description of a (real or virtual) example characterized by

the “typical” properties appearing in the set of instances

(3)

“Conceptual” Inductive Learning

●

Goal: satisfy the

●

Understandability Postulate

– “Results of computer-made inductions should be symbolic description of entities, that are structurally and semantically similar to those that a human would produce.

– The components of such descriptions should be understandable as single information units.”

●

[Michalski, 1975]

“Conceptual” Inductive Learning

●

Exploits

●

A logic(-like) description language for inputs and outputs

●

Different systems to represent experience and background knowledge to assess classification rules that define, characterizing them, class patterns

●

Allows conceptual descriptions of patterns that are similar to those that human experts would do when observing the same data

“Conceptual” Inductive Learning

●

Definitions

●

Concept

●

Abstract description of a class – Formation

●

Grouping of instances in (any) significant set of classes – Attainment

●

Search for defining attributes that distinguish instances of different classes

●

Formation first step to Attainment

●

[J.S. Bruner, J.J. Goodnow, G.A. Austin “A Study of Thinking” - John Wiley & Sons, 1956, p. 22]

Data Mining

●

Branch of Artificial Intelligence derived from Machine Learning

●

Complementary to Machine Learning

●

Aimed at knowledge discovery

●

Specifically

– From big data

– For business purposes

Ensemble Learning

●

Learning of many models from one dataset

●

Use of portions of the data or of the features

●

Final classification = Combination of predictions of the learned models

●

Examples

●

Random Forests

Ensemble Learning

●

Bagging (Bootstrap Aggregation)

●

Goal: Reduce variance in predictions

●

Strategy: generate additional training data from the original dataset

– Combinations with replacement to produce multisets with the same cardinality as the original data

●

Selection of subset: Random

●

Combination: (weighted) average

●

Increasing the size of the training set

– Cannot improve the predictive ability of the model!

– Can only reduce variance

●

Make the prediction better fit the expected solution

●

Useful in numeric-statistical approaches

(4)

Ensemble Learning

●

Boosting

●

Goal: Improve predictive power

●

Selection of subset: Preference for instances that are misclassified by previous models

– Use of subsets of the original data to obtain a series of poorly performing models

– Use of a cost function (majority vote)

– Not random

●

Depends on the performance of the previous model – Every new subset includes the elements that were

(probably) misclassified by the previous models

●

Combination: (weighted) majority vote

– “boosts” performance of models by combining them with each other

Ensemble Learning

●

Stacking

– Similar to boosting

●

Goal: Bagging + Boosting

– Possibility of applying various models to the original data

●

Selection of subset: Various

●

Combination: meta-level, using another model/approch for weight estimation

– To determine which models have good performance and which ones have bad performance given the input data

– Not easy empirical formula for weight function

– Difficult in practice to define the meta-level

●

Logistic Regression

Deep Learning

●

Multi-level learning

●

Many concepts

●

Many representation levels

●

Hierarchically organized concepts

– Higher-level concepts defined in terms of lower-level ones

●

Generic definition, but

took over by neural approaches

Types (2)

●

Definitions

●

Observation

– Description of an object or situation in a given formalism

●

Example

– Class label associated to an element that is present in an observation

Types (2)

●

Unsupervised

●

Based on observations

– Autonomous

●

(Data Mining)

●

Supervised

●

Based on examples

– Trainer/Oracle

●

(Machine Learning)

●

Semi-supervised

●

Mix of labeled and unlabeled instances

Types (2)

●

Reinforcement

●

Feedback in response to the choices of the algorithm

– Maximization of cumulative rewards obtained by the agent while exploring the problem

●

Input-output pairs of known examples never shown

●

Sub-optimal choices never corrected

●

Active (semi-supervised)

●

Interactively asks an oracle about the correct label of unlabeled observations

– Particularly informative observations

– Useful when labeling the whole dataset would bee too

expensive

(5)

Types (2)

●

Example

Types (2)

●

Definitions

●

Relevance

●

Components of the observations of interest for a given goal – Direct

●

Explicitly indicated by the trainer for a specific observation – Indirect

●

Must be selected by the learner from the set of components associated to all available observations

Types (2)

●

Examples

●

Unsupervised

– Clustering

– Frequent Itemsets

– Association Rules

●

Semi-supervised

– Regression

●

Supervised

– Neural Networks

– Classification Rules

– Decision Trees

– k-Nearest Neighbors

Types (3)

●

Predictive

●

Identify a relationship between a set of predictive variables and a target variable

●

Supervised

●

Descriptive

●

Deepen knowledge of what happens in the data

●

Unsupervised

●

The best model may not be the most predictive one

Types (3)

●

Examples

●

Predictive

– Classification

– Regression

– Anomaly / Outlier detection

●

Descriptive

– Clustering

– Association Rules

– Sequential patterns

Types (4)

●

Lazy

●

Instance-based

– Fictitious

●

?“Industrious”?

●

Learn (explicit) models

– Real

(6)

Types (4)

●

Examples

●

Traditional

– Neural Networks

– Classification Rules

– Decision Trees

●

Lazy

– k-Nearest Neighbors

Types (5)

●

Batch

●

All observations/examples for learning available at the beginning of the learning process

●

Incremental

●

Observations/examples may become available subsequently

Types (5)

●

Batch

●

Simpler

●

Traditional

●

Training

●

Smart models

●

Incremental

●

More complex

●

Innovative

●

Tuning

●

Efficiency

Cooperation!

Types (5)

●

Definitions

●

Learning set

●

Set of (labeled) examples used for learning – Training set

●

Examples for batch learning – Tuning set

●

Examples for incremental learning

●

Test set

●

Set of (labeled) examples used for evaluating performance of a model (labeled)

●

Classification Observations

●

(Unlabeled) observations to which the model is to be applied

Types

●

Closed-loop system

●

A learning system that can take the learned knowledge as an input to another learning process

●

Open-loop system

●

Otherwise

Definitions

●

Order of examples

●

Selection

– Controlled by the learner

●

Batch

●

Reception

– Controlled by the context

●

Incremental

●

Learning strategies

●

Scanning

●

Focussing

– [J.S. Bruner, J.J. Goodnow, G.A. Austin “A Study of

Thinking” - John Wiley & Sons, 1956, p. 22]

(7)

Regression

●

Guessing of a function’s values

●

Regression equation

– Models the dependent variables as a function of the independent variables

●

Mutliple regression: many dependent variables

●

Parameters estimated so as to better describe the data – Many error terms

●

Represent noise

–

Variation cannot be controlled or predicted in the dependent variable

●

Modeled by a random variable

●

Learning

– Method of Least Squares

– ...

Regression

●

Use

●

Interpolation

– Guessing values within the range of the training ones

●

Extrapolation

– Guessing values outside the range of the training ones

●

Applications

●

Predict the sales volume revenues of a new product based on investments in advertisement

●

Predict the trend of stock market

Genetic Algorithms

●

Search for a solution inspired by biological evolution

●

Solution = cromosomes (feature vectors)

– Genes = bits or symbols that make up the vector

●

Initial population of solutions

●

Loop:

– Evaluation of solutions

●

Fitness measure

– Probabilistic selection of fittest ones

●

Seeds for producing the next generation – New generations

●

Selection of members for “reproduction”

●

Random mutation and Crossover operations

●

No guarantee to find the best solution

Genetic Algorithms

●

Mutation

●

Choice of genes random or aimed at improving fitness

●

Crossover

●

One-point

– Side portion of genes

●

Two-point

– Intermediate portion of genes

●

Uniform

– Random different genes

●

Arithmetic

– Logistic operations

Clustering

●

Grouping of given observations in internally homogeneous and mutually dishomogeneous subsets

●

Similarity / Distance

●

Aimed at organization

●

Concept formation

●

Number of subsets?

●

Use of quality measures

– Density, Cohesion, ...

●

Applications

●

Split customers in subsets to use as the targets of specific marketing activities

Clustering

●

“Direct”

– Directly distributes observations in a given number of groups

●

Hierarchical

– Generates a progressive partition tree from observations (dendrogram)

– Determines the “border” in the tree that defines the groups

●

Partitive

– Successive splittings starting from the whole set of observations

●

Agglomerative

– Successive groupings starging from the single

observations

(8)

Clustering

●

Dendrogram

Clustering

●

Fuzzy

●

An observation may be assigned to many groups

– Different membership degrees

●

Conceptual

●

Learns an explicit model for each identified group

●

Concept formation + attainment

Clustering

●

Representatives of a cluster

●

Centroid

– Central position (center of mass)

●

May be not computable for non-metric spaces

–

Attribute-value descriptions

●

May not correspond to any observation

●

Medoid

– “Most central” element

●

Useful when the centroid is not a valid element

–

Predicate logics

●

Always corresponds to an observation

Clustering

●

Training

●

Agglomerative: Cluster merging

– Similarity above (Distance below) a given threshold

●

Most distant elements (complete link)

●

Closest elements (shortest link)

●

Centroids/Medoids

●

Average of similarities/distances

●

K-means

– Initialization: k observations chosen at random (seeds)

– Aggregation of each remaining observation to the most similar seed

– Loop

●

Recompute seeds as centroids/medoids of the groups obtained

●

Redistribute remaining observations

B

A C

Clustering

●

Training

●

Agglomerative: Cluster merging

– Similarity above (Distance below) a given threshold

●

Most distant elements (complete link)

●

Closest elements (shortest link)

●

Centroids/Medoids

●

Average of similarities/distances

●

K-means

– Initialization: k observations chosen at random (seeds)

– Aggregation of each remaining observation to the most similar seed

– Loop

●

Recompute seeds as centroids/medoids of the groups obtained

●

Redistribute remaining observations

B

A C

Clustering

●

Training

●

Agglomerative: Cluster merging

– Similarity above (Distance below) a given threshold

●

Most distant elements (complete link)

●

Closest elements (shortest link)

●

Centroids/Medoids

●

Average of similarities/distances

●

K-means

– Initialization: k observations chosen at random (seeds)

– Aggregation of each remaining observation to the most similar seed

– Loop

●

Recompute seeds as centroids/medoids of the groups obtained

●

Redistribute remaining observations

B

A C

(9)

Clustering

●

Example

Process Mining

●

Extraction of process models from examples of process executions

●

Declarative

– Logic representation

– Constraints, rather than a rigid model

●

Uses

– Analysis

●

Discovery of criticalities – Standardization

– Supervision / Compliance check

– (Prediction)

– Simulation

●

of industrial or business processes

●

(of human activities)

Process Mining

●

Formalisms

●

(ASF, HMM)

●

Petri Nets

– Workflow Nets [van der Aalst]

●

Specific formalism

– WoMan

●

Systems

●

ProM

– https://svn.win.tue.nl/trac/prom/wiki/ProM66

●

WoMan

– Incremental

Frequent Itemsets

●

Discovery of subparts of descriptions that are frequent in a set of observations

●

Frequency threshold (% of observations)

– Support

●

Aimed at discovering regularities

Frequent Itemsets

●

Applications

●

Products that are often bougth together

– Propose them at the same time

●

Behaviors often happening simultaneously

– Devise targeted products

●

E.g.: {travels, uses_credit_card, uses_foreign_currency}

●

Features often happening simultaneously

– Devise targeted products

●

E.g.: {rich, entrepreneur, from_northern_italy}

●

Anomaly detection

– By contraposition

Frequent Itemsets

●

Learning (APRIORI)

●

If an itemset is not frequent, no superset of it can be either

– Start: Frequent elements

– Expansion: add an item to each frequent itemset obtained in the previous step

●

Remove expanded ones

●

Save those whose expansion is not frequent

– Termination: no itemset expanded in the last step

(10)

Frequent Itemsets

●

Example

●

Supermarket receipts

– pasta, diapers, beer, salt, wine

– cookies, diapers, talc, sauce, milk

– diapers, salt, talc, beer, babyfood, fries

– sugar, diapers, talc, beer, salt, fries

– wine, diapers, fries, babyfood, pasta

●

“many” receipts include

– diapers, talc (3/5)

– diapers, beer (3/5)

– diapers, fries (3/5)

Association Rules

●

Discovery of relevant co-occurrences of situations in a set of observations

●

Relevance threshold (% of co-occurrences)

– Confidence

●

Aimed at discovering regularities

●

A frequent itemset which is a subset of another and often appears in the same observations suggests that the difference of the latter from the former has some correlation with the former

●

Applications

●

Identify other products typically chosen by users that choose some products

– Advertise or recommend them together

Association Rules

●

Example

●

Supermarket receipts

– pasta, diapers, beer, salt, wine

– cookies, diapers, talc, sauce, milk

– diapers, salt, talc, beer, babyfood, fries

– sugar, diapers, talc, beer, salt, fries

– wine, diapers, fries, babyfood, pasta

●

“many” (3/5) receipts include

– diapers, talc

●

“many” (2/3) such receipts also include

– beer, fries

●

Association Rule:

– diapers, talc ⇒ beer, fries

Anomaly/Outlier Detection

●

Identification of anomalous behaviors and deviations

●

May exploit other learning techniques

– Unsupervised

●

Clustering, Frequent Itemsets, Association Rules – Supervised

●

K-NN, SVM, Ensemble

●

Systems

●

see ELKI (http://elki.dbs.ifi.lmu.de/)

●

Applications

●

Fraudulent use of credit cards

●

Illegal use of money

●

Anomalous financial products

Classification

●

Learning of models aimed at classifying new observations whose class is unknown

●

Aimed at prediction

●

Supervised

●

Binary / MultiClass

●

Mutually exclusive?

Classification

●

Applications

●

Predict customers that will subscribe a certain product

●

Predict reliability of a customer in order to decide whether granting a mortgage or a credit card

●

Predict fraudulent use of a credit card

●

Predict customers who are inclined to join a

competitor

(11)

k-Nearest Neighbor

●

“Learning” (lazy)

●

Collection of n classified instances

– Selection of most representative ones

●

Use

●

Compute the distance between the observation and each prototype instance

●

Consider the closest k

– k odd, not large (~ √n)

– Euclidean distance, Cosine distance

●

What about First-Order Logics?

●

Assign the observation to the majority class among the k closest prototype instances

Neural Networks

●

Observations

●

Feature Vectors

– Attribute-value

●

Propositional

●

Model

●

Directed Acyclic Graph

– Nodes with no incominc arcs

= input attributes

– Internal nodes

= intermediate (latent) concepts

– Nodes with no outgoing arcs

= output classes

– (weighted) arcs = (degree of) influence of a node on another

Neural Networks

●

Can be straightforwardly implemented in electronics

●

Use

●

Apply attribute values to corresponding input nodes

●

Repeat

– For each node in the graph that recevies some input

●

If a pre-defined function of the input values passes a given threshold, propagate the result to the nodes to which it is connected by outgoing arcs

●

Collect the weight of the output nodes, expressing the likelihood of the corresponding class

Neural Networks

●

Learning

●

Setup of graph structure

●

Choice of function

●

Reinforcement of weight of arcs that lead from inputs to correct outputs

●

Deep Learning

●

Deep Neural Networks

– Many hidden layers

Bayesian Nets

●

Allow to catch probabilistic dependences among many variables

●

Independent

●

Dependent

●

Observations

●

Feature vectors

– Attribute-value

●

Propositional

●

Model

●

Directed Acyclic Graph

– Nodes = random variables

– Arcs (weighted) = conditional dependencies

Bayesian Nets

●

Example

●

Systems

●

Hugin

(12)

Naive Bayes

●

Likelihood of a class C given the features of an observation O

●

Bayes’ Theorem

– P(C ∧ O) = P(C | O) P(O) = P(O | C) P(C)

– P(C | O) = P(O | C) P(C) / P(O) =

●

P(O) fixed for any given O (indipendent of C)

= P(O | C) P(C)

●

P(O | C) =

–

O = {F

₁

= o

₁

, ..., F

_n

= o

_n

}

= P(F

₁

= o

₁

∧ ... ∧ F

_n

= o

_n

| C) =

–

Assumption: statistically independent features (!)

●

“naive”

= P(F

1

= o

1

| C) ... P(F

n

= o

n

| C)

–

P(F = o | C) = relative frequency in the training set

●

P(C) = relative frequency of C in the training set

Support Vector Machines

●

Binary classification based on

●

Structural Risk Minimization Principle

– Find a hypothesis that guarantees the minimum empirical error

●

Probability of making an error on an unknown test instance chosen at random

●

Vector Space

– Instances = points in the space

– Hypothesis = hyperplane (decision surface) that may best separate the points in 2 distinct classes

●

Area to find it limited by maximum margin

●

Support vector = any point within the maximum margin

Support Vector Machines

●

Learning

●

Identify optimal hyperplane

– Propose maximum distance which is closest to both classes

– Parallel to the margin hyperplanes

●

The problem might not be linearly separable

– Use of non-linear transformation functions (kernels) such that the problem in the transformed space is linearly separable

●

Use

●

Computation of instance’s position and distance with respect to the hyperplane

Classification Rules

●

Observations

●

Logistic conjunctions

– Propositional, First-Order

●

Model

●

Set of implications

– Premises: conditions to be checked

●

Conjunction [+ disjunctions, negation]

– Conclusion: class

Classification Rules

●

Application

●

Given a rule

– Check whether facts in the observation satisfy condition

●

If so, classify the observation according to the class reported in the conclusion

●

Forward

– For each rule

●

Backward

– For the rules of a given class

Classification Rules

●

Learning

●

Top-down

– Start: Most general rule

– Progressive expansion with additional conditions

●

Coverage of as many positive examples as possible

●

Non coverage of any negative example – Loop until termination of positive examples

●

Bottom-up

– Start: specific examples

– Progressive generalization, removing conditions

●

Monotonic: without ever covering negative examples

●

Non-monotonic: specialization in case of covered negative

examples, by adding conditions

(13)

Classification Rules

●

Systems

●

0-order

– AQ (Michalski 1969)

– CN2 (Clark & Niblett 1989)

●

AQ + ID3

●

1-order

– FOIL

– Progol

– Aleph

– InTheLEx

●

Incremental

Classification Rules

●

Example (InTheLEx)

●

class_elsevier(A) :-

numero_pagine(A,[13,26]), pagina_1(A,B),

frame(B,C),

height_medium_small(C), pos_center(C),

pos_upper(C), tipo_misto(C), on_top(C,D), pos_center(D), pos_middle(D), tipo_testo(D),

larghezza_pagina(B,[544.0,612.0]), altezza_pagina(B,[743.0,842.0]).

Decision Trees

●

Observations

●

Feature vectors

– Attribute-value

●

Propositional

●

Model

●

N-ary tree structures

– Non-leaf nodes = tests on attributes

– Arcs = possible results of tests

– Leaves = classes

Decision Trees

●

Use

●

Starting from the root

– Traverse the tree by running the tests indicated in the traversed nodes on the corresponding attributes of the observation

●

until reaching a leaf

– Classify the observation according to the class indicated by the leaf

Decision Trees

●

Training

●

Each node associated to a subset of examples

– Root: whole training set

– Children of a node: partition of the examples in the parent node based on the test associated to the node

●

Choice of the most discriminant attribute

–

Entropy, Information Gain, ...

– Leaves: subsets of examples all in the same class

– Pruning: removal of subtrees whose complexity does not justify the gain in accuracy

●

Simplifies the tree

●

Avoids overfitting

Decision Trees

●

Example: input

(14)

Decision Trees

●

Example: tree

Decision Trees

●

Example: Use

●

Outlook = Sunny

●

Temperature = Hot

●

Humidity = High

●

Wind = Weak

●

PlayTennis = ?

– No

Decision Trees and

Classification Rules

●

Every path from root to leaf in a decision tree generates a classification rule

●

Example

●

Outlook = Sunny AND Humidity = High ⇒ PlayTennis = No

●

Outlook = Sunny AND Humidity = Normal ⇒ PlayTennis = Yes

●

Outlook = Overcast ⇒ PlayTennis = Yes

●

Outlook = Rain AND Wind = Strong ⇒ PlayTennis = No

●

Outlook = Rain AND Wind = Weak ⇒ PlayTennis = Yes

Decision Trees

●

Systems

●

0-order

– ID3 (Quinlan 1986)

– C4.5 (Quinlan 1993)

– ITI (Utgoff 1989)

●

Incremental

●

1-order

– TILDE

– MRDTL

Random Forests

●

Sets of decision trees

●

Each learned on a subset of features selected at random

●

Classification obtained as a combination of the classifications proposed by single trees

●

Ensemble Learning method

Hidden Markov Models

●

Observations

●

Sequences of symbols

– Propositional

●

Model

●

Directed graph

– 2 types of nodes

●

States (not observable)

●

Symbols (observable)

– 2 types of arcs (weighted with probabilities)

●

State-State (transition)

●

State-Symbol (output)

(15)

Hidden Markov Models

●

Application

●

Given a sequence, compute probability that it is generated by the model

●

Given several models, choice of the more suitable to generate a sequence

●

Given a model, generate sequences for simulation purposes

– Example: speech recognition

Hidden Markov Models

●

Training

●

Parameters

– Weights of arcs that maximize the probability of obtaining the training sequences

●

Suitable numerical methods

●

Structure

– Open problem

●

First-Order Logics

Learning from Sequences

HMM LoHMM TildeCRF MLN Lynx

First-Order - X X X X

Relationships - - - X X

Multi-dimensional - - - ~ X

Concurrency - - - ~ X

Gap length - - - - -

Long descriptions - - ~ - ~

Readability ~ ~ ~ ~ ~

●

Systems

●

and associated features

Experimental Setting

●

Goal definition

●

Choice of dataset

●

(choice of representation)

●

Choice of systems

●

Choice of performance measures

●

Choice of validation procedure

●

Choice of statistical test

●

Running the experiment

●

Analysis of results

Choice of Dataset

●

Real

●

Obtaining data is often problematic

●

Representative of the domain

●

Often suffering from noise

●

Artificial

●

Manual creation often complex

●

Allows to focus on specific features

●

“clean” data

Choice of Representation

●

Level

●

Propositional

– May be automatic, immediate (from sensors)

●

Relational

– Manual, or requires post-processing

●

Features

●

Problems

– Irrelevance

– Redundancy

– Curse of dimensionality

●

Solutions

– Selection

– Extraction

(16)

Choice of Systems

●

Depends on

●

Size of dataset

●

Representation of instances

– Number/type of features

●

Learning task/goals

●

Unbalanced instances

●

Reliability of data (noise)

Evaluation of Results

●

Possible criteria

●

Common to numeric and symbolic approaches

– Predictive performance of the classifier

– Speed of the learning process

– Speed of the classifier

– ...

●

Relevant for the numerical/statistical approach

– Robustness (to noise, uncertainty, etc.)

●

Relevant in the symbolic/conceptual approach

– Interpretability of the results

Performance Measures

●

Depends on

●

Learning goals

●

Unbalanced instances

●

Parameters

●

TP(R) : True Positive (Ratio)

●

TN(R) : True Negative (Ratio)

●

FP(R) : False Positive (Ratio)

– Type I error: false alarm

●

FN(R) : False Negative (Ratio)

– Type II error: miss

Performance Measures

●

Accuracy

●

A = (TP + TN) / (TP + TN + FP + FN)

●

Error Rate

●

E = (FP + FN) / (TP + TN + FP + FN) = 1 – A

●

Specificity

●

TN / (TN + FP)

●

Fall-out

●

FP / (FP + TN) = 1 – Specificity

Performance Measures

●

Precision

●

P = TP / (TP + FP)

●

Recall (Hit Rate, Sensitivity)

●

R = TP / (TP + FN)

●

F-measure

●

F = (1 + b

²

) P R / (b

²

P + R)

●

F1-measure (b = 1)

●

F1 = 2 P R / (P + R)

– Harmonic mean between Precision and Recall

Choice of Evaluation Procedure

●

Problem

●

One dataset -> result of one run would be unreliable

●

Solution

●

Generation of several training/test sets from the same dataset -> many runs -> reliable statistics

●

Factors

●

Sample size (dataset)

●

A-priori knowledge of the distribution of instances in the dataset

– Stratification

●

(Kind of statistical test to be applied)

(17)

Relevance of the Sample Size

●

How many examples?

●

Theorem [Blumer, Ehrenfeucht, Haussler &

Warnruth]

– Consider a training set of m examples classified based on a theory L

– Suppose n hypotheses {H

i

}

i=1,...,n

including L

– Let H

k

be any hypothesis agreeing with L on all examples

– Supposing m ≤ (1/e) ln(n/d)

– The probability that H

_k

and L differ for more than e is less than or equal to d

●

e is the discrepancy from the objective

●

1 – d is the confidence

Evaluation Procedures

●

k-fold Cross-Validation

●

Instances partitioned in k “folds”

●

For all i = 1, ..., k:

– Training set = Dataset – i-th fold

– Learn a model

– Evaluate model on i-th fold used as a Test set

●

Features

– Few variability in composition and sorting of instances

– Size of training and test set depending on the number of repetitions

●

More repetitions, smaller test set

Evaluation Procedures

●

Leave-One-Out

●

k-fold Cross Validation for k = n number of instances

– Each training set made up of all instances except one

– Each test set made up of just one instance (the remaining one)

Evaluation Procedures

●

Holdout (Random Split)

●

Define percentage p of instances to be used as a training set and number of repetitions r

●

For i = 1, ..., r

– Extract p% instances from the dataset to form the training set

– Learn a model

– Evaluate model on the remaining instances as the test set

●

Features

– Possibility of varying composition and order of instances

– Possibility of maintaining original distribution of classes in training and test sets

Evaluation Procedures

●

Prequential

●

Each instance used to test the model before updating it

●

Repeat

– Consider next instance

– Test previous version of the model

– (Update model)

●

Features

– Model always tested on new data

– Suitable for incremental learning

– No need for a test set

Statistical Test

●

Reliability value a required for outcomes

●

Typical values: 90%, 95%

●

Opposite hypotheses to check

●

Null hypothesis

– No difference between two “treatments”

– Performance value obtained is reliable

●

Satisfied if value of statistics is close to a

●

Alternate hypothesis

– There is a difference between two “treatments”

– Performance value obtained is not reliable

●

Satisfied if value of statistics is far from a

(18)

Choice of Statistical Test

●

T test (Student)

●

Parametric test to check hypothesis on mean value

– Pros

●

Effective for large data sample

●

Observations collected in an independent way – Cons

●

Depending on casual fluctuations in averages in case of small- sized samples

●

Effective only for a Normal distribution of data

Choice of Statistical Test

●

Kolmogorov-Smirnov test

●

Non-parametric test that does not require a-priori hypotheses about the distributions, used in general for qualitative features

●

Advisable for k-fold Cross-Validation technique

– Order of examples increases risk of asymmetry in data

– Pros

●

Suitable for small data samples (<40)

●

Robust compared to Student’s T test

●

Effective on asymmetric data – Cons

●

Loss in efficiency-power for large data samples

Choice of Statistical Test

●

Rank test (Wilcoxon)

●

Non-parametric test for checking hypotheses about the central trend

●

Advisable for Data Splitting Cross-Validation technique

– Casual distribution of data reduces the risk of asymmetry in the data

– Pros

●

Ensures more general validity conditions

●

Suitable for a small data sample (<20)

●

Robust compared to Student’s t test – Cons

●

Much suffers from asymmetry in the data

KNIME

●

www.knime.com

●

Allows to develop Data Science and Machine Learning solution using “drag and drop” of functional blocks

●

Used in a wide variety of areas

– Provides templates (or blue prints) for example applications for some of these areas

– Example server with workflows that demonstrate how to use some of the techniques

WEKA

●

Waikato Environment for Knowledge Acquisition

●

http://www.cs.waikato.ac.nz/ml/weka/

●

Collection of Machine Learning algorithms for Data Mining tasks

●

Open Source

– Interface (WEKA Knowledge Explorer)

– Java API

●

Functionality

– Pre-processing tools

– Learning tools

●

Classification, Regression, Clustering, Association Rules – Visualization tools

WEKA

●

“Preprocess” panel

●

Starting point for exploring knowledge

– Loading dataset

– Browsing features of attributes

– Applying any

combination of

unsupervised

filters to data

(19)

WEKA

●

“Classify” panel

●

Configuring and running any classifier on the current dataset

●

Running a test or cross-validation on a separate dataset

●

Visualizing errors in a pop-up tool

●

Graphic visualization of decision trees

WEKA

●

“Cluster” panel

●

Configuring and running any clusterer on the current dataset

●

Visualizing clusters in a pop-up tool

WEKA

●

“Associate” panel

●

Discovering association rules in the current dataset using associators

WEKA

●

“Select attributes” panel

●

Configuring and applying any combination of methods for attribute evaluation and search

– Selection of most appropriate ones

– If a selection scheme transforms the data, the transformed data may be visualized in a pop-up tool

WEKA

●

“Visualize” panel

●

Visualizing the confusion matrix of the current dataset

– Adjustable size of cells and points

●

Cursors at bottom – Changeable

number of cells

●

“Select Attributes”

button – On large dataset,

possibility of visualizing only a sub-sample

●

Visualization performance

WEKA

– Enlargement of graphics in the cells

●

Dataset visualization in one or two dimensions. If the coloring attribute is discrete, each value is visualized in a different color; if it is continuous, a spectrum is used to indicate the value – Summary of the

discriminating power of the single attributes

●

Visualization of

predictions in a

window of the

classification and

clustering panels

(20)

Bibliography

●

Tom Mitchell, “Machine Learning”, McGraw-Hill, 1997

●

Christopher M. Bishop, “Pattern Recognition and Machine Learning”, Springer, 2007

●