1
Machine Learning / 1 Machine Learning / 1
Techniques for Automatic Knowledge Acquisition – Techniques for Automatic Knowledge Acquisition – Concept formation: ordering the hypotheses space for Concept formation: ordering the hypotheses space for search – Candidate elimination algorithm and Version search – Candidate elimination algorithm and Version
Space Space
Laurea in INFORMATICA Laurea in INFORMATICA
MAGISTRALE MAGISTRALE
Corso di
ARTIFICIAL INTELLIGENCE Stefano Ferilli
Questi lucidi sono stati preparati per uso didattico. Essi contengono materiale originale di proprietà dell'Università degli Studi di Bari e/o figure di proprietà di altri autori, società e organizzazioni di cui e' riportato il riferimento. Tutto o parte del materiale può essere fotocopiato per uso personale o didattico ma non può essere distribuito per uso commerciale. Qualunque altro uso richiede una specifica autorizzazione da parte dell'Università degli Studi di Bari e degli altri autori coinvolti.
Knowledge Acquisition
●
Case 1
●
Expert interacts with knowledge engineer
KB System Expert Knowledge
Engineer
Inference Engine
Knowledge Base
Knowledge Acquisition
●
Case 2
●
Expert interacts directly with system
– Use of an editing program with dialogue capabilities and knowledge about the structure of the knowledge base
KB System Expert “intelligent” editing
program
Inference Engine
Knowledge Base
Knowledge Acquisition Bottleneck
●
Problem in Decision Support Systems
●
Experts not willing to elicit their knowledge
●
Experts unable to formalize their knowledge
●
Experts ignore details of their knowledge
●
Solution: Machine Learning (ML)
●
Learning
– Knowledge Acquisition
●
Machine
– Carried out by computers
Knowledge Acquisition
●
Case 3
●
Expert replaced by a program that induces from big quantities of data
– Expert builds knowledge by analyzing world facts and typical “cases” in the literature
KB System Data Induction
program
Inference Engine
Knowledge Base
Knowledge Acquisition
●
Case 3
●
Expert
– Builds knowledge by analyzing world facts and typical
“cases” in the literature
replaced by a program that induces from big quantities of data
– Developed in Machine Learning research areas
– May find, using statistical routines and conceptual analysis
●
relationships among data,
●
regularities between data
●
rules
–
Classification
–Prediction
–...
Knowledge Acquisition
●
Case 4
●
Knowledge acquisition directly from textbooks
– Use of programs for understanding texts in natural language
KB System Textbooks Text understanding
program
Inference Engine
Knowledge Base
Machine Learning
●
Motivations
●
Knowledge Acquisition Bottleneck
– More comfortable, cheaper (and sometimes reliable) to acquire knowledge automatically instead of encoding it manually
●
Advances in the area of algorithms and theoretical and experimental models
– More powerful machines
●
Relevant industrial applications
●
Increasing availability of on-line data
– Wikipedia, Wiktionary, Flickr, Twitter, &c.
– Blogs
– Social Networks
– ...
Learning
●
Extraction of deep knowledge
●
Not superficial
from a kind of situation occurred many times
●
Experience may modify the status of an organism in such a way that the new status makes the system capable of working better in a next situation
●
One of the features of intelligent systems
●
Adaptivity
Learning
●
Classical view
●
A system
– Natural or artificial
learns if, using its experience, changes something in its behavior/operation so as to improve its own performance in
– Solving a problem
– Attaining a goal
– Carrying out a task
Learning
●
Reference cognitive theories
●
Comportamentism
– Mental processes based on reactions to external stimuli
– Impossible building a theory
●
Cognitivism
– Interaction of mind and brain
– Study of psychological aspects of learning mechanisms
●
Constructivism
– Dynamic mental processes which create new knowledge
●
Explain learning mechanisms
Machine Learning
●
Reproduction on a computer of the intelligent learning capabilities
●
Emulation
●
Simulation
●
“That may be exactly what’s needed for anybody who wants to go into this field [AI], namely, blind optimism with no reasonable basis for it.”
●
Raj Reddy (Turing prize 1994)
Machine Learning
●
A program
●
Learns from experience E
●
With respect to a given set of tasks T
●
And with a performance measure P if
●
Its performance on tasks T
●
Measured through P
●
Improves with experience E
●
Any learning program must identify and define
●
The class of tasks
●
The performance measure to improve
●
The source of experience
Machine Learning
●
Successful applications
●
Word recognition
– Lee, 1989; Waibel, 1989
●
Autonomous drive
– Pomerlau, 1989
●
Classification of new astronomic structures
– Fayyad et al., 1995
●
World-class Backgammon
– Tesauro, 1992; 1995
●
...
●
Go game
– Deepmind, 2016
Learning Agent
●
Experience gained through the agent’s
perceptions used to improve the system’s ability to act in the future
●
Experience new status performance
●
Learning takes place
●
As a result of the interaction between agent and world
●
From the agent’s self-observation of the way in which it takes decisions
Learning Agent
●
Components
●
Learning Element
– Learns and improves behavior
●
Performance Element
– The agent itself knows what to do and can evaluate and improve what it does
●
Problem Generator
– Suggests alternative actions to explore and carry out
●
Critical evaluation element
– Provides feedback about how the agent is behaving
Model of Machine Learning System in AI
●
[Simon, 1981]
– Environment provides information to the Learning element, that uses it to improve the base of explicit knowledge
– Knowledge base expresses explicit knowledge in declarative form
●
Indeed, it is separate from the learning module
– Performance element uses the knowledge to carry out at best its task
●
Additionally, the information acquired in the various attempts to carry out the task is used as feedback to the learning element
Environment Learning
element Knowledge
Base
Element for improving performance
Environment
●
Typical assumption:
The system
– Is directly immersed in the environment from which, through its sensors, receives inputs, in the most appropriate form
●
Representation language, grain size
or
– Interacts through an expert that plays the role of the system’s trainer and chooses what and how to represent inputs
●
Flow
Environment
Feature Extraction
Space of Features
Learning System
Rules or classification
models, Theories,
Relationships,
Analogies, &c.
Knowledge Base
●
IA considers explicit forms of knowlege
●
Production rules, logistic clauses, etc.
●
The various forms of knowledge representation must be evaluated based on
●
Expressiveness
●
Possibility of making inferences
●
Modifiability
●
Extensibility
Power, Understandability, Etc.
Update of knowledge guaranteeing conflicts are avoided
Possibility of evaluating and changing a representation by adding new terms and operators
Performance Evaluation
●
Parameters
●
Complexity
– Related to task
●
Classification or prediction
●
Single concept vs multi-concept
●
Conjunctive or disjunctive concepts
–Polymorphism
●
...
– Related to integration
●
Possibility of adding / integrating new rules in a set of existing rules
– Related to incrementality
●
Possibility of adding new observations or new classes / concepts during learning
Performance Evaluation
●
Parameters
●
Feedback
– A learning system must be able to evaluate the hypotheses proposed by the learning element
●
E.g., in classification systems using the percentage of correct classification by exploiting a set of training examples and a set of testing examples
●
In other cases the utility / goodness / significance of the produced classifications can be evaluated
●
Transparency
– Possibility of the learning element to access internal actions of the performance element
●
Useful to
–
Understand/evaluate/weigh differently the rules/theories/models in the knowledge base
–
Compare alternate paths that might in the long term turn out to be better
Learning Programs
●
Tasks
●
Categorization
●
Modeling
●
Clustering
●
Classification
●
Prediction
Aims of Learning
●
In most cases the task is classification/prediction
●
Supervised learning problem
– Given from the trainer a set of training data as positive and negative examples of a given concept/class/category (belonging or not belonging)
– Find a classifier to predict wheter future data belong or not to a given concept/class/category
●
Unsupervised learning problem
– Given from the environment or a trainer a set of observations (training data) and a number K of concept/class/category to identify
– Find a classifier that defines how observations best group into the K concept/class/category and formulates the model to predict how future observations belong or not to a given concept/class/category
Machine Learning
●
Tasks, Goals, Systems
●
Classification and Prediction
– Goal: improve predictive accuracy of the system
●
Performance = recognition rate
●
One-step systems
●
Problem solving
– Goal: increase efficiency of the system that learns to solve problems
●
Multi-step inference systems
●
Knowledge discovery
(knowledge acquisition and concept formation)
– Goal: increase performance in understanding, controlling
and monitoring the environment
Machine Learning
●
What do machines learn?
●
For Classification and Prediction tasks:
– Mathematical models
●
Neural networks
●
Genetic algorithms
●
Bayesian learning – Logic models
●
Production rules
●
Decision trees
●
Logic theories
–
For classification, prediction, etc.
Machine Learning
●
Classic model
●
Machine Learning algorithms
– Autonomously discover relationships among variables of the system
●
Input, output, hidden
starting from observations of the system
– Inspired by several disciplines
●
Inferential statistics
●
Mathematics
●
Physics
●
Neurosciences
●
Theoretical computer science
●
...
Machine Learning
●
Classic model
●
Task: classification
System
… …
x 1
x 2
x
Ny 1
y 2
y
M1 , ,..., 2
Kh h h
x x 1 , ,..., 2 x N
x
h h 1 , ,..., 2 h K
h
y y 1 , ,..., 2 y K
y
Input : Hidden Variables:
Output :
Performance evaluator
Feedback
Trainable Pattern Classifiers
●
Traditional view of learning from examples
●
Basic model : a device with d input lines and an output line representing the class to which the pattern belongs (a signal that may assume R distinct values)
– Performance measure provides feedback on the learning function by adapting its parameters to improve its performance in time
●
Modeled as a black box that computes a classification function
– The trainer trains the system by providing pre-classified data (examples)
– The system can infer the model (mathematical function) and make predictions on new data
Trainable Pattern Classifiers
●
Some examples
Task Input data Output
Weather forecast Weather measures Forecasts
Recognition of handwritten characters Optical signals Name of character
Medical diagnosis Symptoms Disease
Speech recognition Acoustic waveforms Name of the word
Speaker recognition Acoustic waveforms Name of the speaker
Image classification Optical signals Image category
x1 x2
… xd
i0 = 1 or 2 or 3 or… R
[N.J. Nilsson, “Mathematical Foundations of Learning Machines”, (1 ed. 1965), new ed. 1990, Morgan Kaufmann]
Pattern (Data to classify)
Outcome (classification)
Machine Learning of Mathematical Models from Observations
(UNsupervised)
●
Black box model also applicable to systems that learn in an unsupervised way
●
To infer the classification model
●
The system starts from unclassified data
●
The trainer must provide the number of classes and define utility
●
The system can identify the best partitioning of the
space of data, determine a mathematical function
for classification and subsequently classify new
observations
Machine Learning
Some methods and techniques
●
Neural Modelling (1955-1965) (1980-...)
– Gradual change of probabilities that neuron-type elements transmit a signal
– Based on mathematical-statistical models
●
Techniques from Decision Theory (1955-1965)
– Estimation and updating of parameters in a classification and prediction model, starting from a set of training examples
●
Conceptual learning (1962-1983)
– Use of logics and more complex forms of knowledge representation to obtain high-level descriptions of concepts
●
Intensive knowledge learning (1976-...)
– Use of “deep” knowledge. Exploration of different learning strategies
– Learning with integration of different strategies
Machine Learning in Artificial Intelligence
●
Learning aims at
●
Acquiring knowledge in declarative form
●
Developing “abilities” through education and practice
●
Organizing knowledge in more effective and efficient representations
●
Discovering new facts/theories through observation and experimentation
Lessons Learned thanks to Artificial Intelligence
●
[Charniak E., McDermott D. “Artificial Intelligence”, Addison Wesley, 1984]
●
An organism may learn if it already owns much knowledge
●
Learning starts with organized knowledge that grows and becomes more and more organized
●
Nothing can be learned unless strong hints about what one is to learn are already available
Induction
●
The most powerful inference to automatically acquire knowledge
●
Goal: formulate general laws of the world starting from partial observations of the world itself
●
What do machines learn?
– Typical situation
●
Learning concept descriptions from examples
●
Learning classification functions from examples
– Or
●
Generating rules that make accurate predictions on future observations
Inductive Method
●
Given a set F of facts, and, possibly, a set of inference rules having general validity
●
Find an inductive hypothesis that is true on all facts F and is true in the future on all and only those facts that are similar to those in F
●
Example
– Facts
●
Socrates, Aristotle and Plato were philosophers
●
Socrates, Aristotle and Plato all died – Inference rules
●
Philosophers are men – Possible hypotheses
●
All philosophers will die
●
All men will die
Inductive Method
●
Used both in numerical/statistical methods and in symbolic/conceptual methods
– Stages
●
Self-organizing systems
–
Change to adapt to the environment (Rosenblatt, 1957;
Friedbeg, 1958; Fogel, Owens & Walsh, 1966)
–
A famous paper by Minsky & Papert (1969) highlighted their limits. Research on adaptive systems became a branch of research on linear systems theory
●
Knowledge-based systems
–
To improve their performance, they must own specific knowledge of the domain in which they operate (Winston, 1970; Buchanan & Mitchell, 1978; Lenat, 1976)
●
Systems for automatic knowledge acquisition
–
Research focuses on how to use several forms of learning to build, refine and maintain knowledge bases in expert systems
●
(see cognitive theories)
Deduction vs Induction
●
Deductive Arguments
●
If all premises are true, the conclusion is necessarily true
●
All information that is implicit in the
conclusion is already included in the premises
●
Inductive Arguments
●
If all premises are true, the conclusion is probably true
●
The conclusion includes information that is not included, not even implicitly, in the premises
Facts Observations Measures Symptoms
Theories Hypoteses Rules Algorithms Deductive Inference
Inductive Inference
Stages of Learning
●
Training
●
Initial learning process
– Use of a large (or significant) number of typical instances, whose classification is known
●
Training set
●
Testing
●
Validation of the learned classification functions
– Use of additional instances whose classification is known
●
Test set
●
Tuning
●
Further adaptation (in incremental approaches)
– Use of further instances whose classification is known, that become available after a model was learned
Partitioning of Examples
●
Examples must be suitably split in different sets
●
Training set
– on which learning takes place
●
Test set
– including examples not included in the training set, on which checking what was learned
●
Validation set
– Similar role as the test set, but aimed at searching the best parameters for the learning algorithm
– Used when learning depends on empirical parameters
●
Tuning set
– Used in incremental approaches
Example-driven
Inductive Concept Learning
●
Model
– Given
●
A language for describing examples
●
A language for describing concepts
–Intensional class descriptions
●
A set of positive and negative examples of the concept to learn
●
A function that checks if an example belongs to a concept (matching or coverage)
●
A background knowledge (that includes a preference criterion for hypotheses, known constraints among variables, generalization and specialization rules)
– Determine
●
A hypothesis in the hypotheses space that is consistent and complete with respect to the examples and fulfils the constraints set on the background knowledge
Matching
●
Inductive Inference requires the use of a matching procedure that may be based on a covering relation between a generalized description j and an instance f, of the kind
j covers f iff j(f) = true
●
Computationally feasible problem at the propositional level
●
May be a NP-complete problem in first-order
Terminology
●
A hypothesis h covers a positive instance x if it correctly classifies that example as
●
(h(x) = c(x)) = True
●
An example satisfies a hypothesis if it is compliant with the conditions
●
Independently of its being labeled as a positive or negative example for the target concept
●
A hypothesis is consistent with
●
an example <x ,c(x)>, if h(x)=c(x)
●
the training set D, if it is consistent with each
example xD
Supervised Learning Requirements/Properties
●
Correctness
●
Completeness
– All positive examples for a class
are covered
●
I.e, they MUST satisfy the class description
●
Consistency
– No negative example for a class
is covered
●
I.e., satisfies the class description
Supervised Learning
●
Need for a trade-off between simplicity and predictiveness
●
Disjunctive concepts
– Polymorphism or overfitting?
+ + + + + + + + + + + + + + + +
-- - - - - - - - - - - - - -
- -
-- - -
-- --- -- - --- - + + + + ++ + + + +
Learning Process Fundamental Elements
●
How to represent
●
Instances-facts (input)
●
Hypotheses-concepts (output)
●
Choose a hypothesis
●
Define a criterion to order them
●
Define good search procedures in the knowledge/hypotheses space
●
Define effective and efficient matching procedures
Representation
●
The role of the language used to describe facts and hypotheses is relevant
●
Questions to devise a suitable description
●
Is a set of object attributes (feature vectors,
attribute-value pairs) sufficient for the learning task?
Are there relationships among objects?
●
How many and what attributes and/or relationships is it necessary and sufficient to include in the representation?
●
Can the inductive assertion (concept) be expressed as the value of an attribute, as a predicate or as the argument of a predicate?
Representation
●
Input
●
0-order logic languages
– Feature vectors
●
(v
1, v
2, ..., v
n) [shape,color,size,area] -> [tri,med,red,0.75]
– (Attribute-Value) pairs
●
(shape = triangle) (size = medium) (color = red) (area = 0.75)
●
Some problems require more sophisticated formalisms
– Not just attributes, but also relationships between objects
– It is convenient to use more powerful notations that allow to describe the objects with their properties and to express relationships between objects
●
Sets of relational literals
Representation
●
Output
●
To represent the acquired knowledge, many
“symbolic” learning systems, often defined
“conceptual”, express output as conjunctive concepts
– These systems usually make categorizations and, consequently, predictions
●
Concept =
Intensional definition of a class (category)
– Represented extensionally by all instances (observations)
in that class
Description Languages
●
Unstructured domains
●
Input
– Instance vector or <attribute,value> pairs
●
Example
– e
1= <color,red>, <shape,square>,
<size,small>, <side_length,2>, <texture,diagonal>
– NO NEED TO EXPRESS RELATIONSHIPS BETWEEN OBJECTS AND CONCEPTS
e
1Description Languages
●
Unstructured domains
●
Output
– Hypotheses/concepts
●
Propositional calculus sufficient to represent descriptions that are combinations of values of the attributes
– Example
●
The concept underlying many geometrical shapes in red color
can be expressed as
●
j = [color = red] [shape = square triangle]
●
In conceptual learning, some representations for the outputs, can be reduced to these ones
– Decision trees
– Production rules
Description Languages
●
Structured domains
●
Input
– Instance parts, attributes, relationships
– Example
●
e
1= square(a) triangle(b) large(a) small(b) white(a) red(b) ontable(a) on(b,a) – NEED TO EXPRESS RELATIONSHIPS AMONG
OBJECTS
e
1b a
Description Languages
●
Structured domains
●
Output
– Hypotheses
●
1st order calculus needed to express descriptions that are combinations of attribute values and relationships among parts – Example
●
j = red(Y) triangle(Y) square(X) on(Y,X)
●
A first example of logic language used in learning is VS [Michalski, 1980]
– A first-order language with multi-valued logics, in variable number
Number of Hypotheses Consistent with a Given Number of Examples
●
Consider
●
A universe including m objects
●
A training set of m objects
●
k classes
– Concepts identify classes
●
The number of concepts consistent with the training set is k m–m
●
Examples
– Chess end-game
●
1.8 million positions
●
2 classes (win, lose) in 2 moves
●
2
900K= 10
270,000possible hypotheses
– Using reduced languages
●
39 binary attributes per task
●
551 distinct descriptions
●
2 classes
●
Again 2
280= 10
84consistent hypotheses...
Inductive Learning as Search
●
Learning = goal-driven process to improve the knowledge of the learning system, by exploring the experience and a-priori knowledge of the system
– May be cast as a search in the space of knowledge and inductive hypotheses
– May exploit any kind of inference
– Must take into account the system’s background knowledge and terminates after several “learning loops” if the new knowledge satisfies the learning goal
●
Learning = Inference + Storage
Input Inferential mechanism
Goals
Background Knowledge Memory
New
knowledge
Inductive Learning as Search
●
Learning can be cast as a problem of Heuristic search in the knowledge space
●
Search space = Hypotheses space
●
Operators = Generalization rules + abduction + abstraction + ...
●
Final state = A hypothesis that is consistent and complete with respect to
the available observations
Inductive Learning as Search
●
Ordering hypotheses
●
“More general than...” criterion
– Extensional case
●
Given j and hypotheses
is more general than j j< (j implies )
–
Example
●
<color = red> is more general than
<color = red> <shape = square>
– Intensional case
General Specific
–Example
●
G
2= { (?, ?, circle) (large, ?, ?) }
●
G
1= { (large, red, circle) (large, ?, ?) }
●
G
3= { (?, ?, circle) (large, blu, ?) }
j
G
2G
3G
1Ordering Hypotheses
●
Intensional case with first-order languages
●
Implication
– j< j
– more general than j if and only if j implies
●
q-subsumption (Plotkin)
– j = red(a) small(a) on(b,a)
– = red(X) on(Y,X)
– q = {a/X, b/Y}
– q j
●
Generalized subsumption (Buntine)
●
q
OI-subsumption
Inductive Learning as Search
●
Search for the “best” hypothesis in the space of hypotheses
●
Several exploration strategies
– Bottom-up or observation-driven
●
Initially select one or more examples, formulate a hypothesis covering them and then generalize (or sometimes specialize) it in order to explain the remaining examples
– Top-down or hypotheses-driven
●
Initially formulate a very general hypothesis and then specialize (or sometimes generalize) it in order to adapt it as much as possible to the observed data
– Bi-directional
Inductive Learning as Search
●
...but what is the “best” hypothesis?
●
Since induction is a search in the space of descriptions, the goal of the search must be specified
●
In addition to completeness and consistency, descriptions of a class of objects in the context of a fixed set of other classes of objects are required to be
– Discriminative
●
A discriminative description defines all properties of the objects in that class necessary to distinguish them from objects in other classes
– Characterizing
●
A characterizing description defines facts and properties that are true for all objects in a class
Inductive Learning as Search
●
Discrimination
●
Characterization
●
MSC (Maximally Specific Characterization)
– The most detailed description that is true for all known objects belonging to the (conjunctive) class
+ + + + + + + + + + + + +
MSC
- - -
- - - - - - - - - - - - - - - - - - - + + + + + +
+ + + + + + + + +
- - - -
- - -
- - - -
- -
Learning Logic Formulas
●
Class description =
Logic formulas + class they belong to
●
<F
i, class>
– (F
ioften ground)
– Predicates in F
i= attributes of classes
– F
i
consistent with examples
●
True on examples
●
False on counterexamples
– Extension of F
i: set of instances for which F
iis true
Learning Logic Formulas
●
Example
●
Class “wait at restaurant”
–
∀X wait(X) ↔
–noClients(X,some) ∨
–
(noClients(X,many) ¬hungry(X) type(X,fastfood)) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∨
–(noClients(X,many) ¬hungry(X) type(X,thai) fri-sat(X)) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∨
–(noClients(X,many) ¬hungry(X) type(X,burger)) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨
●
Instance
–
alternative(x1) ¬bar(x1) ¬fri-sat(x1) hungry(x1) … CLASSIF wait(x1) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨
Learning Logic Formulas
●
Generality ordering
●
Given hypotheses H and H’
with extensions e(H) and e(H’)
– H is more specific than H’ (H’ is more general than H)
●
H is a specialization of H’ (H’ is a generalization of H) if and only if e(H) e(H’) ( H - H’ )
●
Example
– cube(X) red(X) more specific than cube(X) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨
– cube(X) more specific than cube(X) pyramid(X) ∨
Learning Logic Formulas
●
Generalization and Specialization to modify formulas
●
Generalization operators
– Dropping condition
●
From cube(X) red(X) to cube(X) ∧ ¬hungry(X) ∧ type(X,fastfood)) ∨ – Replacing with more general predicates
●
From cube(X) to geometricShape(X)
–If cube(X) is_a geometricShape(X)
●
Specialization operators (dual)
– Adding conditions
– Replacing with more specific predicates
Inductive Learning as Search
●
Generalization
●
Positive examples
– Ex+1: “a white triangle”
●
(shape = triangle) (color = white) – Ex+2: “a white square”
●
(shape = square) (color = white)
●
To produce a hypotheses one may generalize
– “object X is white”
●
H : (color = white)
or, if the system knows that triangle and square are both specializations of polygon, another hypothesis can be formulated
●
H : (color = white) (shape = polygon)
Inductive Learning as Search
●
Specialization
●
If negative examples (or counterexamples) are available, generalization on positive examples may yield too general concept descriptions, that cover negative examples (over-generalization)
– In such a case, one may specialize
●
Negative example
●
H : (color = white) (shape = polygon) – Ex–1: “a white pentagon”
●
(shape = pentagon) (color = white)
– A more specific hypothesis must be produced, e.g.
●
(shape = polygon) (color = white) (no_sides(X) < 5)
Inductive Learning as Search
●
How to reduce the number of consistent hypotheses?
●
Bias
– Any reason to prefer one among different hypotheses that are consistent with the training set
●
[Mitchell, Utgoff, Haussler, Rendell]
– Examples
●
Considering only hypotheses that can be expressed in a given language
●
Preferring hypotheses that have a shorter, simpler description
–Occam’s razor
●
Preferring the first consistent hypothesis found
●
Assuming that the relative frequency of objects of class C in the training set is approximately the same as that of the objects of class C in the universe
Concept Learning
●
Inferring a boolean function from examples
●
Given < P, N, C, L >
– P set of positive examples (instances) of the target concept
–
N set of negative examples of the target concept
– C set of concepts (depending on conceptual bias)
– L available language (depending on logic-linguistic bias)
●
Define a relationship r acceptable iff definable in terms of concepts C in language L
●
r characteristic iff satisfied by positive examples
●
r discriminant iff not satisfied by any negative example
●
r admissible iff characteristic and discriminant
●
Inductive bias: the kind of assumptions needed about the nature of the target function (e.g., Occam’s Razor)
Concept Learning
●
Positive and negative examples (instances) of the target concept “enjoy sport”
●
Task:
●
Learn to predict the value of “enjoy sport” for an arbitrary day
Sky Air
Temp Humid Wind Water Forecst Enjoy Sport Sunny Warm Normal Strong Warm Same Yes Sunny Warm High Strong Warm Same Yes Rainy Cold High Strong Warm Change No Sunny Warm High Strong Cool Change Yes
Concept Learning
●
Representing the hypotheses
●
Many possible representations
●
A hypothesis h is a conjunction of constraints on the attributes of instances
– Each constraint may be:
●
A specific value (“Water=Warm”)
●
Don’t care (“Water=?”)
●
No allowed value (“Water=Ø”)
Sky Temp Humid Wind Water Forecst
? Cold High ? ? ?
Formalization
●
Given
●
Instances X:
– Possible days describes by attributes
●
Sky, Temp, Humidity, Wind, Water, Forecast
●
Training examples D: <x
i, c(x
i)>
●
Target concept c:
– EnjoySport : X {0,1}
●
c(x)=1 if EnjoySport=yes
●
c(x)=0 if EnjoySport=no
●
Find
●
Hypothesis H:
– Expressed as a conjunction of constraints on attributes
●
Instances X
● x
1 = <Sunny, Warm, High, Strong, Cool, Same>
● x
2 = <Sunny, Warm, High, Light, Warm, Same>
●
Hypotheses H
● h
1 = <Sunny, ?, ?, Strong, ?, ?>
● h2 = <Sunny, ?, ?, ?, ?, ?>
● h3 = <Sunny, ?, ?, ?, Cool, ?>
x
1x
2General
Specific h
1h
3h
2●
One may take advantage from the general-to- specific ordering:
●
Starting with the most specific hypothesis possible and generalizing it each time it fails in covering an observed training instance
– Bottom-up search
●
Finding a maximally specific hypothesis
Instances X
General
Specific Hypotheses H
h
0= Ø, Ø, Ø, Ø, Ø, Ø h
0Instances X
General
Specific Hypotheses H
h
0= Ø, Ø, Ø, Ø, Ø, Ø h
0h
1x
1x
1= Sunny, Warm, Normal, Strong, Warm, Same
h
1= Sunny, Warm, Normal, Strong, Warm, Same
Instances X
General
Specific Hypotheses H
h
0= Ø, Ø, Ø, Ø, Ø, Ø h
0h
1x
1x
1= Sunny, Warm, Normal, Strong, Warm, Same
h
1= Sunny, Warm, Normal, Strong, Warm, Same
h
2,3x
2x
2= Sunny, Warm, High, Strong, Warm, Same
h
2= Sunny, Warm, ?, Strong, Warm, Same
Instances X
General
Specific Hypotheses H
h
0= Ø, Ø, Ø, Ø, Ø, Ø h
0h
1x
1x
1= Sunny, Warm, Normal, Strong, Warm, Same
h
1= Sunny, Warm, Normal, Strong, Warm, Same
h
2,3x
2x
2= Sunny, Warm, High, Strong, Warm, Same
h
2= Sunny, Warm, ?, Strong, Warm, Same
x
3x
3= Rainy, Cold, High, Strong, Warm, Change
h
3= Sunny, Warm, ?, Strong, Warm, Same
Instances X
General
Specific Hypotheses H
h
0= Ø, Ø, Ø, Ø, Ø, Ø h
0h
1x
1x
1= Sunny, Warm, Normal, Strong, Warm, Same
h
1= Sunny, Warm, Normal, Strong, Warm, Same
h
2,3x
2x
2= Sunny, Warm, High, Strong, Warm, Same
h
2= Sunny, Warm, ?, Strong, Warm, Same
x
3x
3= Rainy, Cold, High, Strong, Warm, Change
h
3= Sunny, Warm, ?, Strong, Warm, Same
h
4x
4x
4= Sunny, Warm, High, Strong, Cool, Change
h
4= Sunny, Warm, ?, Strong, ?, ?
●
Let h <Ø, Ø , Ø , Ø , Ø , Ø> be the hypothesis
– Consider the first example in D:
●
<Sunny, Warm, Normal, Strong, Warm, Same>, + – The maximally specific hypothesis is
●
h <Sunny, Warm, Normal, Strong, Warm, Same>
– Consider the second example in D:
●
<Sunny, Warm, High, Strong, Warm, Same>, + – The maximally specific hypothesis is
●
h <Sunny, Warm, ?, Strong, Warm, Same>
– The third example is ignored because it is negative
●
<Rainy, Cold, High, Strong, Warm, Change>, - – Consider the fourth example in D:
●
<Sunny, Warm, High, Strong, Cool, Change>, + – The maximally specific hypothesis is
●
h <Sunny, Warm, ?, Strong, ?, ?>
Find-S Algorithm
●
Initialize h as the most specific hypothesis in H
●
For each positive example x
●
For each attribute a
iin x
– If constraint a
iin h is satisfied by x then
●
Do nothing – else
●
Replace a
iin h with the next more general constraint satisfied by x
●
Return hypothesis h as the result
Negative Examples
●
Why no revision occurs on negative examples?
●
Basic assumptions
– The concept to be learned (target) c is in H
– There are no errors in the training data
●
h is the most specific hypothesis in H
– Thus, c
gh
●
But
– c will never be satisfied by a negative example
– so, neither h will
Negative Examples
●
Use
●
No valid reason to prefer the most specific consistent hypothesis
– How to find the set of all consistent hypotheses with the training set? (Version Space)
– For the sake of simplicity, assume the Version Space has a hypothesis most specific of all
●
I.e., the one returned by Find-S
– A hypothesis in the Version Space is more general or equivalent to that returned by Find-S; moreover, it must be consistent with all negative examples
Candidate-Elimination Algorithm
●
Intuition
●
Start with an initial candidate Version Space equivalent to the whole Hypotheses Space
– No example presented yet
●
Then
●
Use positive examples to remove too specific hypotheses,
●
and negative examples to remove too general hypotheses
Candidate-Elimination Algorithm
●
Example
●
[r,s] to denote cards
– r value: 1..13
●
1..10 numbers
●
J, Q, K figures – s seed
●
Red
–
Hearts H
–Diamonds D
●
Black
–Flowers F
–
Clubs C
Example
●
Cards described by unary relationships
– num, figure, red, black
●
G
– ∀n z ∀ n≤10 num([n,z])
– ∀n z ∀ n>10 figure([n,z])
– ∀n z ∀ (z=C v z=F) black([n,z])
– ∀n z ∀ (z=D v z=H ) red([n,z])
●
D : “winning” and “non-winning” cards
– win([4,F])
– win([7,F])
– win([2,C])
–
non- win([5,H])
–
non- win([J,D])
Simple inductive conclusion:
“All black, non-figure cards win”
∀x num(x) black(x) win(x)
Example
●
Start building a graph to represent the space of hypotheses that are consistent with the examples
– For the first positive instance (4,F) the graph is
– Additional examples help to reduce the space
(-,-)
(Number,-) (-,Black)
(-,F) (Number,Black)
(Number,F)
(4,F) (4,Black) (4,-)
Example
– Next positive example (7,F); the updated graph is
– Next negative instance (5,H); the graph becomes
(Number,F) (Number,Black)
(-,-)
(-,Black)
(-,F) (Number,-)
(Number,F) (-,Black)
(-,F) (Number,Black)
Example
– Next positive example (2,C)reduces the graph to
– Finally, (J,C) as a negative example yields the concept- node
(-,Black)
(Number,Black)
(Number,Black)
Version Space
●
Version space for a concept formation problem
●
The set of all admissible (characteristic and discriminant) relationships for the problem
– Example
●
Let (4,F) be the only positive instance
●
There do not exist negative instances
●
It is possible to build a graph of the versions
●
Directed Acyclic Graph
●
Nodes = elements in the version space
●
Arc from p to q IFF
– p is less general than q
– There is no node r more general than p and less general than q
Version Space
●
A version space is well-structured if and only if each chain of relationships has a minimum and a maximum element
●
A relationship is maximally specific and minimum element in the version space if and only if no other less general relationship exists in the space
●
Similarly, a maximum element denotes a maximally general relationship
●
Given a well-structured space V
●
Specific Boundary Set S: the set of minimum elements
●
General Boundary Set G: the set of maximum
elements
Version Space
●
A representation in which:
●
Each node is associated to a model
– Write operations are available to connect a node to a model
– Read operations are available to produce a model of a node
●
Links between nodes denote specialization and generalization relationships between the corresponding models
●
There is a specialization tree and a generalization tree
– A node in the generalization tree is connected to a model which covers everything
– A node in the specialization tree is connected to a model which covers a single case
Candidate Elimination Algorithm
●
[Mitchell, 1982]
●
During search, considers only a subset of the space of hypotheses
– The one consisting of all hypotheses consistent and complete with the observed examples (Version Space)
●
Only the top and bottom “borders” of such a space are stored
– S = set of most_specific consistent and complete hypotheses
– G = set of most_general consistent and complete hypotheses
●
A generalization H is in the version space (S,G) IFF
– H is more_specific than (or equal to) a given element in G
–
H is more_specific than (or equal to) a given element in S
Candidate Elimination Algorithm
●
Boundary sets are updated when a new positive or negative instance is available
●
Positive example x
– pg(x,S,G) = { g G | g(x) }
●
Update general boudary set pg(x,S,G) by removing the elements that do not cover the new example:
– ps(x,S,G) = { r | pup(x,S,G,r)}
●
Positive updating (pup): update specific boundary set by including a relationship if and only if:
–
It is an element of the old specific boundary set or a generalization thereof
–
It is a specialization of the new general boundary set
–It covers the new instance
–
There is no more specific relationship with the above properties
Candidate Elimination Algorithm
– Analogously/symmetrically,
●
Negative example
– ns(x,S,G) = { s S | s(x)}
●
Boundary sets updated by pruning the old (S,G) to remove elements that cover the negative instance
– ng(x,S,G)= { r | nup(x,S,G,r)}
●
Negative updating (nup): include a relationship in the general boundary set ng(x,S,G) if and only if:
–
It is in the old general (S,G) or a specialization thereof
–It is a generalization of an element in the new specific
boundary set
–
It does not cover the new instance
–
There is no more general relationship with the above properties
Candidate Elimination Algorithm
Candidate Elimination Algorithm
●
Basic theorems
●
Given a concept formation problem <P,N,C,L> with a well- structured version space V and Boundary Sets (S,G)
– Boundary Set Theorem
●
A relationship r is in V if and only if it is connected downwards to an element in S and upwards to an element in G
– Candidate Elimination Theorem
●
ps(x,S,G) and pg(x,S,G) are the Boundary Sets for the updated version space <P{x},N,C,L>
●
ns(x,S,G) and ng(x,S,G) are the Boundary Sets for the new problem
<P,N {x},C,L>
Version Space
Candidate Elimination Algorithm
●
Summary
– Initialize G as the set of maximally general hypotheses and S as the set of the maximally specific hypotheses
– While there are examples
●
Consider the next example
●
If negative e–
–
Keep in S the generalizations that do not cover e-
–
Specialize hypotheses in G that cover e-, so as to exclude e- and keep them more general than other hypotheses in S
–Remove from G the hypotheses that are more specific than
othes that are in G
●
If positive e+
–
Keep in G the generalizations that cover e+
–
Generalize the hypotheses in S that do not cover e+ so as to cover e+ and keep them more specific than other hypotheses in G
–
Remove from S the hypotheses that are more general than others that are in S
Candidate Elimination Algorithm
●
Advantages
●
Incremental
●
Bi-directional search
●
Storage optimization
●
Disadvantages
●
Needs a complete set of examples
●
Does not produce disjunctive concepts
References
●
Carnap R., Jeffrey R.C.
– “Studies in Inductive logic and Probability” Berkley, University of California Press, 1971
●
Mitchell, T.M.
– “Version spaces: an approach to concept learning”, 5th IJCAI, MIT, Cambridge, MA, 305-310, 1977
– “Generalization as Search”, Artificial Intelligence, 1982
●
Hirsh H.
– “Incremental version space merging: a general framework for concept learning”, Kluwer, 1990
– “Generalizing version spaces”, Machine Learning, 1994
●
Vere, S.A.
– “Induction of concepts in predicate calculus”, 4th IJCAI, Tbilisi, USSR, 281-287, 1975
References
●
T.M. Mitchell
●
"Machine Learning", McGraw Hill, 1997
●
J.C. Carbonell, R.S. Michalski, T.M. Mitchell
●