Laurea in INFORMATICA Laurea in INFORMATICA MAGISTRALE MAGISTRALE

(1)

1

Machine Learning / 1 Machine Learning / 1

Techniques for Automatic Knowledge Acquisition – Techniques for Automatic Knowledge Acquisition – Concept formation: ordering the hypotheses space for Concept formation: ordering the hypotheses space for search – Candidate elimination algorithm and Version search – Candidate elimination algorithm and Version

Space Space

Laurea in INFORMATICA Laurea in INFORMATICA

MAGISTRALE MAGISTRALE

Corso di

ARTIFICIAL INTELLIGENCE Stefano Ferilli

Questi lucidi sono stati preparati per uso didattico. Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato il riferimento. Tutto o parte del materiale può essere fotocopiato per uso personale o didattico ma non può essere distribuito per uso commerciale. Qualunque altro uso richiede una specifica autorizzazione da parte dell'Università degli Studi di Bari e degli altri autori coinvolti.

Knowledge Acquisition

●

Case 1

●

Expert interacts with knowledge engineer

KB System Expert Knowledge

Engineer

Inference Engine

Knowledge Base

Knowledge Acquisition

●

Case 2

●

Expert interacts directly with system

– Use of an editing program with dialogue capabilities and knowledge about the structure of the knowledge base

KB System Expert “intelligent” editing

program

Inference Engine

Knowledge Base

Knowledge Acquisition Bottleneck

●

Problem in Decision Support Systems

●

Experts not willing to elicit their knowledge

●

Experts unable to formalize their knowledge

●

Experts ignore details of their knowledge

●

Solution: Machine Learning (ML)

●

Learning

– Knowledge Acquisition

●

Machine

– Carried out by computers

Knowledge Acquisition

●

Case 3

●

Expert replaced by a program that induces from big quantities of data

– Expert builds knowledge by analyzing world facts and typical “cases” in the literature

KB System Data Induction

program

Inference Engine

Knowledge Base

Knowledge Acquisition

●

Case 3

●

Expert

– Builds knowledge by analyzing world facts and typical

“cases” in the literature

replaced by a program that induces from big quantities of data

– Developed in Machine Learning research areas

– May find, using statistical routines and conceptual analysis

●

relationships among data,

●

regularities between data

●

rules

–

Classification

–

Prediction

–

...

(2)

Knowledge Acquisition

●

Case 4

●

Knowledge acquisition directly from textbooks

– Use of programs for understanding texts in natural language

KB System Textbooks Text understanding

program

Inference Engine

Knowledge Base

Machine Learning

●

Motivations

●

Knowledge Acquisition Bottleneck

– More comfortable, cheaper (and sometimes reliable) to acquire knowledge automatically instead of encoding it manually

●

Advances in the area of algorithms and theoretical and experimental models

– More powerful machines

●

Relevant industrial applications

●

Increasing availability of on-line data

– Wikipedia, Wiktionary, Flickr, Twitter, &c.

– Blogs

– Social Networks

– ...

Learning

●

Extraction of deep knowledge

●

Not superficial

from a kind of situation occurred many times

●

Experience may modify the status of an organism in such a way that the new status makes the system capable of working better in a next situation

●

One of the features of intelligent systems

●

Adaptivity

Learning

●

Classical view

●

A system

– Natural or artificial

learns if, using its experience, changes something in its behavior/operation so as to improve its own performance in

– Solving a problem

– Attaining a goal

– Carrying out a task

Learning

●

Reference cognitive theories

●

Comportamentism

– Mental processes based on reactions to external stimuli

– Impossible building a theory

●

Cognitivism

– Interaction of mind and brain

– Study of psychological aspects of learning mechanisms

●

Constructivism

– Dynamic mental processes which create new knowledge

●

Explain learning mechanisms

Machine Learning

●

Reproduction on a computer of the intelligent learning capabilities

●

Emulation

●

Simulation

●

“That may be exactly what’s needed for anybody who wants to go into this field [AI], namely, blind optimism with no reasonable basis for it.”

●

Raj Reddy (Turing prize 1994)

(3)

Machine Learning

●

A program

●

Learns from experience E

●

With respect to a given set of tasks T

●

And with a performance measure P if

●

Its performance on tasks T

●

Measured through P

●

Improves with experience E

●

Any learning program must identify and define

●

The class of tasks

●

The performance measure to improve

●

The source of experience

Machine Learning

●

Successful applications

●

Word recognition

– Lee, 1989; Waibel, 1989

●

Autonomous drive

– Pomerlau, 1989

●

Classification of new astronomic structures

– Fayyad et al., 1995

●

World-class Backgammon

– Tesauro, 1992; 1995

●

...

●

Go game

– Deepmind, 2016

Learning Agent

●

Experience gained through the agent’s

perceptions used to improve the system’s ability to act in the future

●

Experience  new status  performance

●

Learning takes place

●

As a result of the interaction between agent and world

●

From the agent’s self-observation of the way in which it takes decisions

Learning Agent

●

Components

●

Learning Element

– Learns and improves behavior

●

Performance Element

– The agent itself knows what to do and can evaluate and improve what it does

●

Problem Generator

– Suggests alternative actions to explore and carry out

●

Critical evaluation element

– Provides feedback about how the agent is behaving

Model of Machine Learning System in AI

●

[Simon, 1981]

– Environment provides information to the Learning element, that uses it to improve the base of explicit knowledge

– Knowledge base expresses explicit knowledge in declarative form

●

Indeed, it is separate from the learning module

– Performance element uses the knowledge to carry out at best its task

●

Additionally, the information acquired in the various attempts to carry out the task is used as feedback to the learning element

Environment Learning

element Knowledge

Base

Element for improving performance

Environment

●

Typical assumption:

The system

– Is directly immersed in the environment from which, through its sensors, receives inputs, in the most appropriate form

●

Representation language, grain size

or

– Interacts through an expert that plays the role of the system’s trainer and chooses what and how to represent inputs

●

Flow

Environment



Feature Extraction



Space of Features

 Learning System



Rules or classification

models, Theories,

Relationships,

Analogies, &c.

(4)

Knowledge Base

●

IA considers explicit forms of knowlege

●

Production rules, logistic clauses, etc.

●

The various forms of knowledge representation must be evaluated based on

●

Expressiveness

●

Possibility of making inferences

●

Modifiability

●

Extensibility

Power, Understandability, Etc.

Update of knowledge guaranteeing conflicts are avoided

Possibility of evaluating and changing a representation by adding new terms and operators

Performance Evaluation

●

Parameters

●

Complexity

– Related to task

●

Classification or prediction

●

Single concept vs multi-concept

●

Conjunctive or disjunctive concepts

–

Polymorphism

●

...

– Related to integration

●

Possibility of adding / integrating new rules in a set of existing rules

– Related to incrementality

●

Possibility of adding new observations or new classes / concepts during learning

Performance Evaluation

●

Parameters

●

Feedback

– A learning system must be able to evaluate the hypotheses proposed by the learning element

●

E.g., in classification systems using the percentage of correct classification by exploiting a set of training examples and a set of testing examples

●

In other cases the utility / goodness / significance of the produced classifications can be evaluated

●

Transparency

– Possibility of the learning element to access internal actions of the performance element

●

Useful to

–

Understand/evaluate/weigh differently the rules/theories/models in the knowledge base

–

Compare alternate paths that might in the long term turn out to be better

Learning Programs

●

Tasks

●

Categorization

●

Modeling

●

Clustering

●

Classification

●

Prediction

Aims of Learning

●

In most cases the task is classification/prediction

●

Supervised learning problem

– Given from the trainer a set of training data as positive and negative examples of a given concept/class/category (belonging or not belonging)

– Find a classifier to predict wheter future data belong or not to a given concept/class/category

●

Unsupervised learning problem

– Given from the environment or a trainer a set of observations (training data) and a number K of concept/class/category to identify

– Find a classifier that defines how observations best group into the K concept/class/category and formulates the model to predict how future observations belong or not to a given concept/class/category

Machine Learning

●

Tasks, Goals, Systems

●

Classification and Prediction

– Goal: improve predictive accuracy of the system

●

Performance = recognition rate

●

One-step systems

●

Problem solving

– Goal: increase efficiency of the system that learns to solve problems

●

Multi-step inference systems

●

Knowledge discovery

(knowledge acquisition and concept formation)

– Goal: increase performance in understanding, controlling

and monitoring the environment

(5)

Machine Learning

●

What do machines learn?

●

For Classification and Prediction tasks:

– Mathematical models

●

Neural networks

●

Genetic algorithms

●

Bayesian learning – Logic models

●

Production rules

●

Decision trees

●

Logic theories

–

For classification, prediction, etc.

Machine Learning

●

Classic model

●

Machine Learning algorithms

– Autonomously discover relationships among variables of the system

●

Input, output, hidden

starting from observations of the system

– Inspired by several disciplines

●

Inferential statistics

●

Mathematics

●

Physics

●

Neurosciences

●

Theoretical computer science

●

...

Machine Learning

●

Classic model

●

Task: classification

System

… …

x 1

x 2

x

N

y 1

y 2

y

M

1 , ,..., 2

_K

h h h

 x x 1 , ,..., 2 x _N 

 x

 h h 1 , ,..., 2 h _K 

 h

 y y 1 , ,..., 2 y _K 

 y

Input : Hidden Variables:

Output :

Performance evaluator

Feedback

Trainable Pattern Classifiers

●

Traditional view of learning from examples

●

Basic model : a device with d input lines and an output line representing the class to which the pattern belongs (a signal that may assume R distinct values)

– Performance measure provides feedback on the learning function by adapting its parameters to improve its performance in time

●

Modeled as a black box that computes a classification function

– The trainer trains the system by providing pre-classified data (examples)

– The system can infer the model (mathematical function) and make predictions on new data

Trainable Pattern Classifiers

●

Some examples

Task Input data Output

Weather forecast Weather measures Forecasts

Recognition of handwritten characters Optical signals Name of character

Medical diagnosis Symptoms Disease

Speech recognition Acoustic waveforms Name of the word

Speaker recognition Acoustic waveforms Name of the speaker

Image classification Optical signals Image category

x1 x2

… x_d

i0 = 1 or 2 or 3 or… R

[N.J. Nilsson, “Mathematical Foundations of Learning Machines”, (1 ed. 1965), new ed. 1990, Morgan Kaufmann]

Pattern (Data to classify)

Outcome (classification)

Machine Learning of Mathematical Models from Observations

(UNsupervised)

●

Black box model also applicable to systems that learn in an unsupervised way

●

To infer the classification model

●

The system starts from unclassified data

●

The trainer must provide the number of classes and define utility

●

The system can identify the best partitioning of the

space of data, determine a mathematical function

for classification and subsequently classify new

observations

(6)

Machine Learning

Some methods and techniques

●

Neural Modelling (1955-1965) (1980-...)

– Gradual change of probabilities that neuron-type elements transmit a signal

– Based on mathematical-statistical models

●

Techniques from Decision Theory (1955-1965)

– Estimation and updating of parameters in a classification and prediction model, starting from a set of training examples

●

Conceptual learning (1962-1983)

– Use of logics and more complex forms of knowledge representation to obtain high-level descriptions of concepts

●

Intensive knowledge learning (1976-...)

– Use of “deep” knowledge. Exploration of different learning strategies

– Learning with integration of different strategies

Machine Learning in Artificial Intelligence

●

Learning aims at

●

Acquiring knowledge in declarative form

●

Developing “abilities” through education and practice

●

Organizing knowledge in more effective and efficient representations

●

Discovering new facts/theories through observation and experimentation

Lessons Learned thanks to Artificial Intelligence

●

[Charniak E., McDermott D. “Artificial Intelligence”, Addison Wesley, 1984]

●

An organism may learn if it already owns much knowledge

●

Learning starts with organized knowledge that grows and becomes more and more organized

●

Nothing can be learned unless strong hints about what one is to learn are already available

Induction

●

The most powerful inference to automatically acquire knowledge

●

Goal: formulate general laws of the world starting from partial observations of the world itself

●

What do machines learn?

– Typical situation

●

Learning concept descriptions from examples

●

Learning classification functions from examples

– Or

●

Generating rules that make accurate predictions on future observations

Inductive Method

●

Given a set F of facts, and, possibly, a set of inference rules having general validity

●

Find an inductive hypothesis that is true on all facts F and is true in the future on all and only those facts that are similar to those in F

●

Example

– Facts

●

Socrates, Aristotle and Plato were philosophers

●

Socrates, Aristotle and Plato all died – Inference rules

●

Philosophers are men – Possible hypotheses

●

All philosophers will die

●

All men will die

Inductive Method

●

Used both in numerical/statistical methods and in symbolic/conceptual methods

– Stages

●

Self-organizing systems

–

Change to adapt to the environment (Rosenblatt, 1957;

Friedbeg, 1958; Fogel, Owens & Walsh, 1966)

–

A famous paper by Minsky & Papert (1969) highlighted their limits. Research on adaptive systems became a branch of research on linear systems theory

●

Knowledge-based systems

–

To improve their performance, they must own specific knowledge of the domain in which they operate (Winston, 1970; Buchanan & Mitchell, 1978; Lenat, 1976)

●

Systems for automatic knowledge acquisition

–

Research focuses on how to use several forms of learning to build, refine and maintain knowledge bases in expert systems

●

(see cognitive theories)

(7)

Deduction vs Induction

●

Deductive Arguments

●

If all premises are true, the conclusion is necessarily true

●

All information that is implicit in the

conclusion is already included in the premises

●

Inductive Arguments

●

If all premises are true, the conclusion is probably true

●

The conclusion includes information that is not included, not even implicitly, in the premises

Facts Observations Measures Symptoms

Theories Hypoteses Rules Algorithms Deductive Inference

Inductive Inference

Stages of Learning

●

Training

●

Initial learning process

– Use of a large (or significant) number of typical instances, whose classification is known

●

Training set

●

Testing

●

Validation of the learned classification functions

– Use of additional instances whose classification is known

●

Test set

●

Tuning

●

Further adaptation (in incremental approaches)

– Use of further instances whose classification is known, that become available after a model was learned

Partitioning of Examples

●

Examples must be suitably split in different sets

●

Training set

– on which learning takes place

●

Test set

– including examples not included in the training set, on which checking what was learned

●

Validation set

– Similar role as the test set, but aimed at searching the best parameters for the learning algorithm

– Used when learning depends on empirical parameters

●

Tuning set

– Used in incremental approaches

Example-driven

Inductive Concept Learning

●

Model

– Given

●

A language for describing examples

●

A language for describing concepts

–

Intensional class descriptions

●

A set of positive and negative examples of the concept to learn

●

A function that checks if an example belongs to a concept (matching or coverage)

●

A background knowledge (that includes a preference criterion for hypotheses, known constraints among variables, generalization and specialization rules)

– Determine

●

A hypothesis in the hypotheses space that is consistent and complete with respect to the examples and fulfils the constraints set on the background knowledge

Matching

●

Inductive Inference requires the use of a matching procedure that may be based on a covering relation between a generalized description j and an instance f, of the kind

j covers f iff j(f) = true

●

Computationally feasible problem at the propositional level

●

May be a NP-complete problem in first-order

Terminology

●

A hypothesis h covers a positive instance x if it correctly classifies that example as

●

(h(x) = c(x)) = True

●

An example satisfies a hypothesis if it is compliant with the conditions

●

Independently of its being labeled as a positive or negative example for the target concept

●

A hypothesis is consistent with

●

an example <x ,c(x)>, if h(x)=c(x)

●

the training set D, if it is consistent with each

example xD

(8)

Supervised Learning Requirements/Properties

●

Correctness

●

Completeness

– All positive examples for a class

are covered

●

I.e, they MUST satisfy the class description

●

Consistency

– No negative example for a class

is covered

●

I.e., satisfies the class description

Supervised Learning

●

Need for a trade-off between simplicity and predictiveness

●

Disjunctive concepts

– Polymorphism or overfitting?

+ + + + + + + + + + + + + + + +

-- - - - - - - - - - - - - -

- -

-- - -

-- --- -- - --- - + + + + ++ + + + +

Learning Process Fundamental Elements

●

How to represent

●

Instances-facts (input)

●

Hypotheses-concepts (output)

●

Choose a hypothesis

●

Define a criterion to order them

●

Define good search procedures in the knowledge/hypotheses space

●

Define effective and efficient matching procedures

Representation

●

The role of the language used to describe facts and hypotheses is relevant

●

Questions to devise a suitable description

●

Is a set of object attributes (feature vectors,

attribute-value pairs) sufficient for the learning task?

Are there relationships among objects?

●

How many and what attributes and/or relationships is it necessary and sufficient to include in the representation?

●

Can the inductive assertion (concept) be expressed as the value of an attribute, as a predicate or as the argument of a predicate?

Representation

●

Input

●

0-order logic languages

– Feature vectors

●

(v

₁

, v

₂

, ..., v

_n

) [shape,color,size,area] -> [tri,med,red,0.75]

– (Attribute-Value) pairs

●

(shape = triangle)  (size = medium)  (color = red)  (area = 0.75)

●

Some problems require more sophisticated formalisms

– Not just attributes, but also relationships between objects

– It is convenient to use more powerful notations that allow to describe the objects with their properties and to express relationships between objects

●

Sets of relational literals

Representation

●

Output

●

To represent the acquired knowledge, many

“symbolic” learning systems, often defined

“conceptual”, express output as conjunctive concepts

– These systems usually make categorizations and, consequently, predictions

●

Concept =

Intensional definition of a class (category)

– Represented extensionally by all instances (observations)

in that class

(9)

Description Languages

●

Unstructured domains

●

Input

– Instance vector or <attribute,value> pairs

●

Example

– e

₁

= <color,red>, <shape,square>,

<size,small>, <side_length,2>, <texture,diagonal>

– NO NEED TO EXPRESS RELATIONSHIPS BETWEEN OBJECTS AND CONCEPTS

e

₁

Description Languages

●

Unstructured domains

●

Output

– Hypotheses/concepts

●

Propositional calculus sufficient to represent descriptions that are combinations of values of the attributes

– Example

●

The concept underlying many geometrical shapes in red color

can be expressed as

●

j = [color = red]  [shape = square triangle]

●

In conceptual learning, some representations for the outputs, can be reduced to these ones

– Decision trees

– Production rules

Description Languages

●

Structured domains

●

Input

– Instance parts, attributes, relationships

– Example

●

e

₁

= square(a)  triangle(b)  large(a)  small(b)  white(a)  red(b)  ontable(a)  on(b,a) – NEED TO EXPRESS RELATIONSHIPS AMONG

OBJECTS

e

1

b a

Description Languages

●

Structured domains

●

Output

– Hypotheses

●

1st order calculus needed to express descriptions that are combinations of attribute values and relationships among parts – Example

●

j = red(Y)  triangle(Y)  square(X)  on(Y,X)

●

A first example of logic language used in learning is VS [Michalski, 1980]

– A first-order language with multi-valued logics, in variable number

Number of Hypotheses Consistent with a Given Number of Examples

●

Consider

●

A universe including m objects

●

A training set of m objects

●

k classes

– Concepts identify classes

●

The number of concepts consistent with the training set is k ^m–m

●

Examples

– Chess end-game

●

1.8 million positions

●

2 classes (win, lose) in 2 moves

●

2

^900K

= 10

^270,000

possible hypotheses

– Using reduced languages

●

39 binary attributes per task

●

551 distinct descriptions

●

2 classes

●

Again 2

²⁸⁰

= 10

⁸⁴

consistent hypotheses...

Inductive Learning as Search

●

Learning = goal-driven process to improve the knowledge of the learning system, by exploring the experience and a-priori knowledge of the system

– May be cast as a search in the space of knowledge and inductive hypotheses

– May exploit any kind of inference

– Must take into account the system’s background knowledge and terminates after several “learning loops” if the new knowledge satisfies the learning goal

●

Learning = Inference + Storage

Input Inferential mechanism

Goals

Background Knowledge Memory

New

knowledge

(10)

Inductive Learning as Search

●

Learning can be cast as a problem of Heuristic search in the knowledge space

●

Search space = Hypotheses space

●

Operators = Generalization rules + abduction + abstraction + ...

●

Final state = A hypothesis that is consistent and complete with respect to

the available observations

Inductive Learning as Search

●

Ordering hypotheses

●

“More general than...” criterion

– Extensional case

●

Given j and  hypotheses

 is more general than j j<  (j implies )

–

Example

●

<color = red> is more general than

<color = red>  <shape = square>

– Intensional case

General Specific

–

Example

●

G

₂

= { (?, ?, circle) (large, ?, ?) }

●

G

₁

= { (large, red, circle) (large, ?, ?) }

●

G

₃

= { (?, ?, circle) (large, blu, ?) }

 j

G

₂

G

₃

G

1

Ordering Hypotheses

●

Intensional case with first-order languages

●

Implication

– j<   j  

–  more general than j if and only if j implies 

●

q-subsumption (Plotkin)

– j = red(a)  small(a)  on(b,a)

–  = red(X)  on(Y,X)

– q = {a/X, b/Y}

– q  j

●

Generalized subsumption (Buntine)

●

q

_OI

-subsumption

Inductive Learning as Search

●

Search for the “best” hypothesis in the space of hypotheses

●

Several exploration strategies

– Bottom-up or observation-driven

●

Initially select one or more examples, formulate a hypothesis covering them and then generalize (or sometimes specialize) it in order to explain the remaining examples

– Top-down or hypotheses-driven

●

Initially formulate a very general hypothesis and then specialize (or sometimes generalize) it in order to adapt it as much as possible to the observed data

– Bi-directional

Inductive Learning as Search

●

...but what is the “best” hypothesis?

●

Since induction is a search in the space of descriptions, the goal of the search must be specified

●

In addition to completeness and consistency, descriptions of a class of objects in the context of a fixed set of other classes of objects are required to be

– Discriminative

●

A discriminative description defines all properties of the objects in that class necessary to distinguish them from objects in other classes

– Characterizing

●

A characterizing description defines facts and properties that are true for all objects in a class

Inductive Learning as Search

●

Discrimination

●

Characterization

●

MSC (Maximally Specific Characterization)

– The most detailed description that is true for all known objects belonging to the (conjunctive) class

+ + + + + + + + + + + + +

MSC

- - -

- - - - - - - - - - - - - - - - - - - + + + + + +

+ + + + + + + + +

- - - -

- - -

- - - -

- -

(11)

Learning Logic Formulas

●

Class description =

Logic formulas + class they belong to

●

<F

_i

, class>

– (F

_i

often ground)

– Predicates in F

_i

= attributes of classes

– F

i

consistent with examples

●

True on examples

●

False on counterexamples

– Extension of F

_i

: set of instances for which F

_i

is true

Learning Logic Formulas

●

Example

●

Class “wait at restaurant”

–

∀X wait(X) ↔

–

noClients(X,some) ∨

–

(noClients(X,many) ¬hungry(X) type(X,fastfood)) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∨

–

(noClients(X,many) ¬hungry(X) type(X,thai) fri-sat(X)) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∨

–

(noClients(X,many) ¬hungry(X) type(X,burger)) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨

●

Instance

–

alternative(x1) ¬bar(x1) ¬fri-sat(x1) hungry(x1) … CLASSIF wait(x1) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨

Learning Logic Formulas

●

Generality ordering

●

Given hypotheses H and H’

with extensions e(H) and e(H’)

– H is more specific than H’ (H’ is more general than H)

●

H is a specialization of H’ (H’ is a generalization of H) if and only if e(H)  e(H’) ( H - H’ )

●

Example

– cube(X) red(X) more specific than cube(X) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨

– cube(X) more specific than cube(X) pyramid(X) ∨

Learning Logic Formulas

●

Generalization and Specialization to modify formulas

●

Generalization operators

– Dropping condition

●

From cube(X) red(X) to cube(X) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ – Replacing with more general predicates

●

From cube(X) to geometricShape(X)

–

If cube(X) is_a geometricShape(X)

●

Specialization operators (dual)

– Adding conditions

– Replacing with more specific predicates

Inductive Learning as Search

●

Generalization

●

Positive examples

– Ex+1: “a white triangle”

●

(shape = triangle)  (color = white) – Ex+2: “a white square”

●

(shape = square)  (color = white)

●

To produce a hypotheses one may generalize

– “object X is white”

●

H : (color = white)

or, if the system knows that triangle and square are both specializations of polygon, another hypothesis can be formulated

●

H : (color = white)  (shape = polygon)

Inductive Learning as Search

●

Specialization

●

If negative examples (or counterexamples) are available, generalization on positive examples may yield too general concept descriptions, that cover negative examples (over-generalization)

– In such a case, one may specialize

●

Negative example

●

H : (color = white)  (shape = polygon) – Ex–1: “a white pentagon”

●

(shape = pentagon)  (color = white)

– A more specific hypothesis must be produced, e.g.

●

(shape = polygon)  (color = white)  (no_sides(X) < 5)

(12)

Inductive Learning as Search

●

How to reduce the number of consistent hypotheses?

●

Bias

– Any reason to prefer one among different hypotheses that are consistent with the training set

●

[Mitchell, Utgoff, Haussler, Rendell]

– Examples

●

Considering only hypotheses that can be expressed in a given language

●

Preferring hypotheses that have a shorter, simpler description

–

Occam’s razor

●

Preferring the first consistent hypothesis found

●

Assuming that the relative frequency of objects of class C in the training set is approximately the same as that of the objects of class C in the universe

Concept Learning

●

Inferring a boolean function from examples

●

Given < P, N, C, L >

– P set of positive examples (instances) of the target concept

–

N set of negative examples of the target concept

– C set of concepts (depending on conceptual bias)

– L available language (depending on logic-linguistic bias)

●

Define a relationship r acceptable iff definable in terms of concepts C in language L

●

r characteristic iff satisfied by positive examples

●

r discriminant iff not satisfied by any negative example

●

r admissible iff characteristic and discriminant

●

Inductive bias: the kind of assumptions needed about the nature of the target function (e.g., Occam’s Razor)

Concept Learning

●

Positive and negative examples (instances) of the target concept “enjoy sport”

●

Task:

●

Learn to predict the value of “enjoy sport” for an arbitrary day

Sky Air

Temp Humid Wind Water Forecst Enjoy Sport Sunny Warm Normal Strong Warm Same Yes Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No Sunny Warm High Strong Cool Change Yes

Concept Learning

●

Representing the hypotheses

●

Many possible representations

●

A hypothesis h is a conjunction of constraints on the attributes of instances

– Each constraint may be:

●

A specific value (“Water=Warm”)

●

Don’t care (“Water=?”)

●

No allowed value (“Water=Ø”)

Sky Temp Humid Wind Water Forecst

? Cold High ? ? ?

Formalization

●

Given

●

Instances X:

– Possible days describes by attributes

●

Sky, Temp, Humidity, Wind, Water, Forecast

●

Training examples D: <x

_i

, c(x

_i

)>

●

Target concept c:

– EnjoySport : X  {0,1}

●

c(x)=1 if EnjoySport=yes

●

c(x)=0 if EnjoySport=no

●

Find

●

Hypothesis H:

– Expressed as a conjunction of constraints on attributes

●

Instances X

● x

1 = <Sunny, Warm, High, Strong, Cool, Same>

● x

2 = <Sunny, Warm, High, Light, Warm, Same>

●

Hypotheses H

● h

1 = <Sunny, ?, ?, Strong, ?, ?>

● h₂ = <Sunny, ?, ?, ?, ?, ?>

● h₃ = <Sunny, ?, ?, ?, Cool, ?>

x

1

x

2

General

Specific h

1

h

3

h

₂

(13)

●

One may take advantage from the general-to- specific ordering:

●

Starting with the most specific hypothesis possible and generalizing it each time it fails in covering an observed training instance

– Bottom-up search

●

Finding a maximally specific hypothesis

Instances X

 



General

Specific Hypotheses H

h

0

= Ø, Ø, Ø, Ø, Ø, Ø  h

₀

Instances X

 



General

Specific Hypotheses H

h

0

= Ø, Ø, Ø, Ø, Ø, Ø  h

0

h

1

x

1

x

1

= Sunny, Warm, Normal, Strong, Warm, Same

h

1

= Sunny, Warm, Normal, Strong, Warm, Same

Instances X

 



General

Specific Hypotheses H

h

0

= Ø, Ø, Ø, Ø, Ø, Ø  h

0

h

1

x

1

x

1

= Sunny, Warm, Normal, Strong, Warm, Same

h

1

= Sunny, Warm, Normal, Strong, Warm, Same

h

2,3

x

2

x

2

= Sunny, Warm, High, Strong, Warm, Same

h

2

= Sunny, Warm, ?, Strong, Warm, Same

Instances X

 



General

Specific Hypotheses H

h

0

= Ø, Ø, Ø, Ø, Ø, Ø  h

₀

h

1

x

1

x

1

= Sunny, Warm, Normal, Strong, Warm, Same

h

1

= Sunny, Warm, Normal, Strong, Warm, Same

h

_2,3

x

2

x

2

= Sunny, Warm, High, Strong, Warm, Same

h

₂

= Sunny, Warm, ?, Strong, Warm, Same

x

3

x

3

= Rainy, Cold, High, Strong, Warm, Change

h

3

= Sunny, Warm, ?, Strong, Warm, Same

Instances X

 



General

Specific Hypotheses H

h

0

= Ø, Ø, Ø, Ø, Ø, Ø  h

0

h

1

x

1

x

1

= Sunny, Warm, Normal, Strong, Warm, Same

h

1

= Sunny, Warm, Normal, Strong, Warm, Same

h

2,3

x

2

x

2

= Sunny, Warm, High, Strong, Warm, Same

h

2

= Sunny, Warm, ?, Strong, Warm, Same

x

3

x

3

= Rainy, Cold, High, Strong, Warm, Change

h

3

= Sunny, Warm, ?, Strong, Warm, Same

h

₄

x

4

x

4

= Sunny, Warm, High, Strong, Cool, Change

h

4

= Sunny, Warm, ?, Strong, ?, ?

(14)

●

Let h  <Ø, Ø , Ø , Ø , Ø , Ø> be the hypothesis

– Consider the first example in D:

●

<Sunny, Warm, Normal, Strong, Warm, Same>, + – The maximally specific hypothesis is

●

h  <Sunny, Warm, Normal, Strong, Warm, Same>

– Consider the second example in D:

●

<Sunny, Warm, High, Strong, Warm, Same>, + – The maximally specific hypothesis is

●

h  <Sunny, Warm, ?, Strong, Warm, Same>

– The third example is ignored because it is negative

●

<Rainy, Cold, High, Strong, Warm, Change>, - – Consider the fourth example in D:

●

<Sunny, Warm, High, Strong, Cool, Change>, + – The maximally specific hypothesis is

●

h  <Sunny, Warm, ?, Strong, ?, ?>

Find-S Algorithm

●

Initialize h as the most specific hypothesis in H

●

For each positive example x

●

For each attribute a

_i

in x

– If constraint a

_i

in h is satisfied by x then

●

Do nothing – else

●

Replace a

_i

in h with the next more general constraint satisfied by x

●

Return hypothesis h as the result

Negative Examples

●

Why no revision occurs on negative examples?

●

Basic assumptions

– The concept to be learned (target) c is in H

– There are no errors in the training data

●

h is the most specific hypothesis in H

– Thus, c 

_g

h

●

But

– c will never be satisfied by a negative example

– so, neither h will

Negative Examples

●

Use

●

No valid reason to prefer the most specific consistent hypothesis

– How to find the set of all consistent hypotheses with the training set? (Version Space)

– For the sake of simplicity, assume the Version Space has a hypothesis most specific of all

●

I.e., the one returned by Find-S

– A hypothesis in the Version Space is more general or equivalent to that returned by Find-S; moreover, it must be consistent with all negative examples

Candidate-Elimination Algorithm

●

Intuition

●

Start with an initial candidate Version Space equivalent to the whole Hypotheses Space

– No example presented yet

●

Then

●

Use positive examples to remove too specific hypotheses,

●

and negative examples to remove too general hypotheses

Candidate-Elimination Algorithm

●

Example

●

[r,s] to denote cards

– r value: 1..13

●

1..10 numbers

●

J, Q, K figures – s seed

●

Red

–

Hearts H

–

Diamonds D

●

Black

–

Flowers F

–

Clubs C

(15)

Example

●

Cards described by unary relationships

– num, figure, red, black

●

G

– ∀n z ∀ n≤10  num([n,z])

– ∀n z ∀ n>10  figure([n,z])

– ∀n z ∀ (z=C v z=F)  black([n,z])

– ∀n z ∀ (z=D v z=H )  red([n,z])

●

D : “winning” and “non-winning” cards

– win([4,F])

– win([7,F])

– win([2,C])

–

non- win([5,H])

–

non- win([J,D])

Simple inductive conclusion:

“All black, non-figure cards win”

∀x num(x)  black(x)  win(x)

Example

●

Start building a graph to represent the space of hypotheses that are consistent with the examples

– For the first positive instance (4,F) the graph is

– Additional examples help to reduce the space

(-,-)

(Number,-) (-,Black)

(-,F) (Number,Black)

(Number,F)

(4,F) (4,Black) (4,-)

Example

– Next positive example (7,F); the updated graph is

– Next negative instance (5,H); the graph becomes

(Number,F) (Number,Black)

(-,-)

(-,Black)

(-,F) (Number,-)

(Number,F) (-,Black)

(-,F) (Number,Black)

Example

– Next positive example (2,C)reduces the graph to

– Finally, (J,C) as a negative example yields the concept- node

(-,Black)

(Number,Black)

Version Space

●

Version space for a concept formation problem

●

The set of all admissible (characteristic and discriminant) relationships for the problem

– Example

●

Let (4,F) be the only positive instance

●

There do not exist negative instances

●

It is possible to build a graph of the versions

●

Directed Acyclic Graph

●

Nodes = elements in the version space

●

Arc from p to q IFF

– p is less general than q

– There is no node r more general than p and less general than q

Version Space

●

A version space is well-structured if and only if each chain of relationships has a minimum and a maximum element

●

A relationship is maximally specific and minimum element in the version space if and only if no other less general relationship exists in the space

●

Similarly, a maximum element denotes a maximally general relationship

●

Given a well-structured space V

●

Specific Boundary Set S: the set of minimum elements

●

General Boundary Set G: the set of maximum

elements

(16)

Version Space

●

A representation in which:

●

Each node is associated to a model

– Write operations are available to connect a node to a model

– Read operations are available to produce a model of a node

●

Links between nodes denote specialization and generalization relationships between the corresponding models

●

There is a specialization tree and a generalization tree

– A node in the generalization tree is connected to a model which covers everything

– A node in the specialization tree is connected to a model which covers a single case

Candidate Elimination Algorithm

●

[Mitchell, 1982]

●

During search, considers only a subset of the space of hypotheses

– The one consisting of all hypotheses consistent and complete with the observed examples (Version Space)

●

Only the top and bottom “borders” of such a space are stored

– S = set of most_specific consistent and complete hypotheses

– G = set of most_general consistent and complete hypotheses

●

A generalization H is in the version space (S,G) IFF

– H is more_specific than (or equal to) a given element in G

–

H is more_specific than (or equal to) a given element in S

Candidate Elimination Algorithm

●

Boundary sets are updated when a new positive or negative instance is available

●

Positive example x

– pg(x,S,G) = { g  G | g(x) }

●

Update general boudary set pg(x,S,G) by removing the elements that do not cover the new example:

– ps(x,S,G) = { r | pup(x,S,G,r)}

●

Positive updating (pup): update specific boundary set by including a relationship if and only if:

–

It is an element of the old specific boundary set or a generalization thereof

–

It is a specialization of the new general boundary set

–

It covers the new instance

–

There is no more specific relationship with the above properties

Candidate Elimination Algorithm

– Analogously/symmetrically,

●

Negative example

– ns(x,S,G) = { s  S | s(x)}

●

Boundary sets updated by pruning the old (S,G) to remove elements that cover the negative instance

– ng(x,S,G)= { r | nup(x,S,G,r)}

●

Negative updating (nup): include a relationship in the general boundary set ng(x,S,G) if and only if:

–

It is in the old general (S,G) or a specialization thereof

–

It is a generalization of an element in the new specific

boundary set

–

It does not cover the new instance

–

There is no more general relationship with the above properties

Candidate Elimination Algorithm

●

Basic theorems

●

Given a concept formation problem <P,N,C,L> with a well- structured version space V and Boundary Sets (S,G)

– Boundary Set Theorem

●

A relationship r is in V if and only if it is connected downwards to an element in S and upwards to an element in G

– Candidate Elimination Theorem

●

ps(x,S,G) and pg(x,S,G) are the Boundary Sets for the updated version space <P{x},N,C,L>

●

ns(x,S,G) and ng(x,S,G) are the Boundary Sets for the new problem

<P,N {x},C,L>

(17)

Version Space

Candidate Elimination Algorithm

●

Summary

– Initialize G as the set of maximally general hypotheses and S as the set of the maximally specific hypotheses

– While there are examples

●

Consider the next example

●

If negative e–

–

Keep in S the generalizations that do not cover e-

–

Specialize hypotheses in G that cover e-, so as to exclude e- and keep them more general than other hypotheses in S

–

Remove from G the hypotheses that are more specific than

othes that are in G

●

If positive e+

–

Keep in G the generalizations that cover e+

–

Generalize the hypotheses in S that do not cover e+ so as to cover e+ and keep them more specific than other hypotheses in G

–

Remove from S the hypotheses that are more general than others that are in S

Candidate Elimination Algorithm

●

Advantages

●

Incremental

●

Bi-directional search

●

Storage optimization

●

Disadvantages

●

Needs a complete set of examples

●

Does not produce disjunctive concepts

References

●

Carnap R., Jeffrey R.C.

– “Studies in Inductive logic and Probability” Berkley, University of California Press, 1971

●

Mitchell, T.M.

– “Version spaces: an approach to concept learning”, 5th IJCAI, MIT, Cambridge, MA, 305-310, 1977

– “Generalization as Search”, Artificial Intelligence, 1982

●

Hirsh H.

– “Incremental version space merging: a general framework for concept learning”, Kluwer, 1990

– “Generalizing version spaces”, Machine Learning, 1994

●

Vere, S.A.

– “Induction of concepts in predicate calculus”, 4th IJCAI, Tbilisi, USSR, 281-287, 1975

References

●

T.M. Mitchell

●

"Machine Learning", McGraw Hill, 1997

●

J.C. Carbonell, R.S. Michalski, T.M. Mitchell

●