Bayesian Learning
[Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theorem MAP, ML hypotheses MAP learnersMinimum description length principle
Bayes optimal classier
Naive Bayes learner
Example: Learning over text data
Bayesian belief networks
Two Roles for Bayesian Methods
Provides practical learning algorithms:
Naive Bayes learning
Bayesian belief network learning
Combine prior knowledge (prior probabilities)
with observed data
Requires prior probabilities
Provides useful conceptual framework
Provides \gold standard" for evaluating other
learning algorithms
Bayes Theorem
P
(h
jD
) =P
(D
jh
)P
(h
)P
(D
)
P
(h
) = prior probability of hypothesish
P
(D
) = prior probability of training dataD
P
(h
jD
) = probability ofh
givenD
P
(D
jh
) = probability ofD
givenh
Choosing Hypotheses
P
(h
jD
) =P
(D
jh
)P
(h
)P
(D
)Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis
h
MAP:h
MAP = argmaxh 2HP
(h
jD
) = argmaxh 2HP
(D
jh
)P
(h
)P
(D
) = argmaxh 2HP
(D
jh
)P
(h
)If assume
P
(h
i) =P
(h
j) then can further simplify,and choose the Maximum likelihood (ML)
hypothesis
h
ML = arg maxhi 2H
Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the
cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present.
Furthermore,
:
008 of the entire populationhave this cancer.
P
(cancer
) =P
(:cancer
) =P
(+jcancer
) =P
(?jcancer
) =Basic Formulas for Probabilities
Product Rule: probability
P
(A
^B
) of aconjunction of two events A and B:
P
(A
^B
) =P
(A
jB
)P
(B
) =P
(B
jA
)P
(A
) Sum Rule: probability of a disjunction of twoevents A and B:
P
(A
_B
) =P
(A
) +P
(B
) ?P
(A
^B
) Theorem of total probability: if eventsA
1
;:::;A
nare mutually exclusive with P
ni=1
P
(A
i) = 1, thenP
(B
) = Xni=1
Brute Force MAP Hypothesis Learner
1. For each hypothesis
h
inH
, calculate theposterior probability
P
(h
jD
) =P
(D
jh
)P
(h
)P
(D
)2. Output the hypothesis
h
MAP with the highestposterior probability
h
MAP = argmaxh2H
Relation to Concept Learning
Consider our usual concept learning task
instance space
X
, hypothesis spaceH
, trainingexamples
D
consider the FindS learning algorithm (outputs
most specic hypothesis from the version space
V S
H;D)What would Bayes rule produce as the MAP hypothesis?
Relation to Concept Learning
Assume xed set of instances h
x
1
;:::;x
m iAssume
D
is the set of classicationsD
= hc
(x
1)
;:::;c
(x
m) iRelation to Concept Learning
Assume xed set of instances h
x
1
;:::;x
m iAssume
D
is the set of classicationsD
= hc
(x
1);:::;c
(x
m) i ChooseP
(D
jh
)P
(D
jh
) = 1 ifh
consistent withD
P
(D
jh
) = 0 otherwiseChoose
P
(h
) to be uniform distribution
P
(h
) = 1 jHj for allh
inH
Then,P
(h
jD
) = 8 > > > > > > > < > > > > > > > : 1 jV S H ;D j ifh
is consistent withD
0 otherwiseEvolution of Posterior Probabilities
hypotheses hypotheses hypotheses
P(h|D1, D2) P(h|D1)
P h )(
a
Characterizing Learning Algorithms by
Equivalent MAP Learners
Inductive system Output hypotheses Output hypotheses Brute force MAP learner Candidate Elimination Algorithm Prior assumptions made explicit P(h) uniform P(D|h) = 0 if inconsistent, = 1 if consistent
Equivalent Bayesian inference system Training examples D
Hypothesis space H
Hypothesis space H Training examples D
Learning A Real Valued Function
hML f e y xConsider any real-valued target function
f
Training examples h
x
i;d
ii, whered
i is noisytraining value
d
i =f
(x
i) +e
i
e
i is random variable (noise) drawnindependently for each
x
i according to someGaussian distribution with mean=0
Then the maximum likelihood hypothesis
h
ML isthe one that minimizes the sum of squared errors:
h
ML = argminh 2H m X i (d
i ?h
(x
i)) 2Learning A Real Valued Function
h
ML = argmaxh 2Hp
(D
jh
) = argmaxh 2H m Y i=1p
(d
ijh
) = argmaxh 2H m Y i=1 1 p 22e
? 1 2 ( d i ?h(x i ) ) 2Maximize natural log of this instead...
h
ML = argmaxh 2H m X i=1 ln 1p 22 ? 1 2 0 B B @d
i ?h
(x
i) 1 C C A 2 = argmaxh 2H m X i=1 ? 1 2 0 B B @d
i ?h
(x
i) 1 C C A 2 = argmaxh 2H m X i=1 ?(d
i ?h
(x
i)) 2 = argminh 2H m X i=1 (d
i ?h
(x
i)) 2Learning to Predict Probabilities
Consider predicting survival probability from patient data
Training examples h
x
i;d
ii, whered
i is 1 or 0Want to train neural network to output a
probability given
x
i (not a 0 or 1)In this case can show
h
ML = argmaxh 2H m X i=1d
i lnh
(x
i) + (1 ?d
i)ln(1 ?h
(x
i))Weight update rule for a sigmoid unit:
w
jkw
jk +w
jkwhere
w
jk = Xmi=1
Minimum Description Length Principle
Occam's razor: prefer the shortest hypothesis
MDL: prefer the hypothesis
h
that minimizesh
MDL = argminh2H
L
C1(h
) +L
C2(D
jh
)where
L
C(x
) is the description length ofx
underencoding
C
Example:
H
= decision trees,D
= training datalabels
L
C1(
h
) is # bits to describe treeh
L
C2(
D
j
h
) is # bits to describeD
givenh
{
NoteL
C2(D
j
h
) = 0 if examples classiedperfectly by
h
. Need only describe exceptionsHence
h
MDL trades o tree size for trainingMinimum Description Length Principle
h
MAP = arg maxh2HP
(D
jh
)P
(h
) = arg maxh 2H log2P
(D
jh
) + log 2P
(h
) = arg minh 2H ? log 2P
(D
jh
) ? log 2P
(h
) (1)Interesting fact from information theory: The optimal (shortest expected coding
length) code for an event with probability
p
is? log
2
p
bits.So interpret (1):
? log
2
P
(h
) is length ofh
under optimal code ? log2
P
(D
j
h
) is length ofD
givenh
underoptimal code
! prefer the hypothesis that minimizes
Most Probable Classication of New Instances
So far we've sought the most probable hypothesis
given the data
D
(i.e.,h
MAP)Given new instance
x
, what is its most probableclassication?
h
MAP(x
) is not the most probable classication!Consider:
Three possible hypotheses:
P
(h
1 jD
) =:
4; P
(h
2 jD
) =:
3; P
(h
3 jD
) =:
3Given new instance
x
,h
1(x
) = +; h
2(x
) =?
; h
3(
x
) = ?Bayes Optimal Classier
Bayes optimal classication:
arg maxv j 2V X hi 2H
P
(v
jjh
i)P
(h
ijD
) Example:P
(h
1 jD
) =:
4; P
(?jh
1) = 0; P
(+ jh
1) = 1P
(h
2 jD
) =:
3; P
(?jh
2) = 1; P
(+ jh
2) = 0P
(h
3 jD
) =:
3; P
(?jh
3) = 1; P
(+ jh
3) = 0 therefore X hi 2HP
(+jh
i)P
(h
ijD
) =:
4 X hi 2HP
(?jh
i)P
(h
ijD
) =:
6 and argmaxv j 2V X hi 2HP
(v
jjh
i)P
(h
ijD
) = ?Gibbs Classier
Bayes optimal classier provides best result, but can be expensive if many hypotheses.
Gibbs algorithm:
1. Choose one hypothesis at random, according to
P
(h
jD
)2. Use this to classify new instance
Surprising fact: Assume target concepts are drawn
at random from
H
according to priors onH
. Then:E
[error
Gibbs] 2E
[error
BayesOptimal]Suppose correct, uniform prior distribution over
H
,then
Pick any hypothesis from VS, with uniform
probability
Its expected error no worse than twice Bayes
Naive Bayes Classier
Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods. When to use
Moderate or large training set available
Attributes that describe instances are
conditionally independent given classication Successful applications:
Diagnosis
Naive Bayes Classier
Assume target function
f
:X
!V
, where eachinstance
x
described by attributes ha
1
;a
2:::a
n i.Most probable value of
f
(x
) is:v
MAP = argmaxv j 2VP
(v
jja
1;a
2:::a
n)v
MAP = argmaxv j 2VP
(a
1;a
2:::a
n jv
j)P
(v
j)P
(a
1;a
2:::a
n) = argmaxv j 2VP
(a
1;a
2:::a
n jv
j)P
(v
j)Naive Bayes assumption:
P
(a
1;a
2:::a
nj
v
j) = Yi
P
(a
ijv
j)which gives
Naive Bayes classier:
v
NB = argmaxvj 2V
P
(v
j)YNaive Bayes Algorithm
Naive Bayes Learn(
examples
)For each target value
v
j^
P
(v
j) estimateP
(v
j)For each attribute value
a
i of each attributea
^
P
(a
ijv
j) estimateP
(a
ijv
j)Classify New Instance(
x
)v
NB = argmaxv j 2V ^P
(v
j) Y ai 2x ^P
(a
ijv
j)Naive Bayes: Example
Consider PlayTennis again, and new instance
h
Outlk
=sun;Temp
=cool;Humid
=high;Wind
=strong
iWant to compute:
v
NB = argmaxv j 2VP
(v
j)Y iP
(a
ijv
j)P
(y
)P
(sun
jy
)P
(cool
jy
)P
(high
jy
)P
(strong
jy
) =:
005P
(n
)P
(sun
jn
)P
(cool
jn
)P
(high
jn
)P
(strong
jn
) =:
021 !v
NB =n
Naive Bayes: Subtleties
1. Conditional independence assumption is often violated
P
(a
1;a
2:::a
nj
v
j) = Yi
P
(a
ijv
j)...but it works surprisingly well anyway. Note
don't need estimated posteriors ^
P
(v
jjx
) to becorrect; need only that argmaxv j 2V ^
P
(v
j)Y iP
^(a
ijv
j) = argmax vj 2VP
(v
j)P
(a
1:::;a
n jv
j) see [Domingos & Pazzani, 1996] for analysisNaive Bayes posteriors often unrealistically
Naive Bayes: Subtleties
2. what if none of the training instances with target value
v
j have attribute valuea
i? Then^
P
(a
ijv
j) = 0, and...^
P
(v
j) Yi
P
^(a
ijv
j) = 0Typical solution is Bayesian estimate for ^
P
(a
ijv
j)^
P
(a
ijv
j)n
c +mp
n
+m
where
n
is number of training examples for whichv
=v
j,
n
c number of examples for whichv
=v
j anda
=a
i
p
is prior estimate for ^P
(a
ijv
j)
m
is weight given to prior (i.e. number ofLearning to Classify Text
Why?
Learn which news articles are of interest
Learn to classify web pages by topic
Naive Bayes is among most eective algorithms What attributes shall we use to represent text documents??
Learning to Classify Text
Target concept
Interesting
? :Document
! f+;
?g1. Represent each document by vector of words
one attribute per word position in document
2. Learning: Use training examples to estimate
P
(+)P
(?)
P
(doc
j+)P
(doc
j?)Naive Bayes conditional independence assumption
P
(doc
jv
j) =length(doc) Y
i=1
P
(a
i =w
kjv
j)where
P
(a
i =w
kjv
j) is probability that word inposition
i
isw
k, givenv
jone more assumption:
Learn naive Bayes text(
Examples;V
)1. collect all words and other tokens that occur in
Examples
V ocabulary
all distinct words and othertokens in
Examples
2. calculate the required
P
(v
j) andP
(w
kjv
j)probability terms
For each target value
v
j inV
do{
docs
j subset ofExamples
for which thetarget value is
v
j{
P
(v
j) jdocs jj jExamplesj
{
Text
j a single document created byconcatenating all members of
docs
j{
n
total number of words inText
j (countingduplicate words multiple times)
{
for each wordw
k inV ocabulary
n
k number of times wordw
k occurs inText
j
P
(w
kjv
j)nk +1
Classify naive Bayes text(
Doc
)
positions
all word positions inDoc
thatcontain tokens found in
V ocabulary
Return
v
NB, wherev
NB = argmaxv j 2VP
(v
j) Y i2positionsP
(a
ijv
j)Twenty NewsGroups
Given 1000 training documents from each group Learn to classify new documents according to which newsgroup it came from
comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey alt.atheism sci.space soc.religion.christian sci.crypt talk.religion.misc sci.electronics talk.politics.mideast sci.med talk.politics.misc talk.politics.guns
Article from rec.sport.hockey
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu From: xxx@yyy.zzz.edu (John Doe)
Subject: Re: This year's biggest and worst (opinion)... Date: 5 Apr 93 09:53:39 GMT
I can only comment on the Kings, but the most obvious candidate for pleasant surprise is Alex Zhitnik. He came highly touted as a defensive
defenseman, but he's clearly much more than that. Great skater and hard shot (though wish he were more accurate). In fact, he pretty much allowed the Kings to trade away that huge defensive
liability Paul Coffey. Kelly Hrudey is only the biggest disappointment if you thought he was any good to begin with. But, at best, he's only a mediocre goaltender. A better choice would be Tomas Sandstrom, though not through any fault of his own, but because some thugs in Toronto decided
Learning Curve for 20 Newsgroups
0 10 20 30 40 50 60 70 80 90 100 100 1000 10000 20News Bayes TFIDF PRTFIDFBayesian Belief Networks
Interesting because:
Naive Bayes assumption of conditional
independence too restrictive
But it's intractable without some such
assumptions...
Bayesian Belief networks describe conditional
independence among subsets of variables
! allows combining prior knowledge about
(in)dependencies among variables with observed training data
Conditional Independence
Denition:
X
is conditionally independent ofY
givenZ
if the probability distributiongoverning
X
is independent of the value ofY
given the value of
Z
; that is, if(8
x
i;y
j;z
k)P
(X
=x
ijY
=y
j;Z
=z
k) =P
(X
=x
ijZ
=z
k)more compactly, we write
P
(X
jY;Z
) =P
(X
jZ
)Example:
Thunder
is conditionally independent ofRain
, givenLightning
P
(Thunder
jRain;Lightning
) =P
(Thunder
jLightning
)Naive Bayes uses cond. indep. to justify
P
(X;Y
jZ
) =P
(X
jY;Z
)P
(Y
jZ
)Bayesian Belief Network
Storm Campfire Lightning Thunder ForestFire Campfire C ¬C ¬S,B ¬S,¬B 0.4 0.6 0.1 0.9 0.8 0.2 0.2 0.8 S,¬B BusTourGroup S,BNetwork represents a set of conditional independence assertions:
Each node is asserted to be conditionally
independent of its nondescendants, given its immediate predecessors.
Bayesian Belief Network
Storm Campfire Lightning Thunder ForestFire Campfire C ¬C ¬S,B ¬S,¬B 0.4 0.6 0.1 0.9 0.8 0.2 0.2 0.8 S,¬B BusTourGroup S,BRepresents joint probability distribution over all variables e.g.,
P
(Storm;BusTourGroup;:::;ForestFire
) in general,P
(y
1;:::;y
n) = n Y i=1P
(y
ijParents
(Y
i))where
Parents
(Y
i) denotes immediatepredecessors of
Y
i in graphso, joint distribution is fully dened by graph,
Inference in Bayesian Networks
How can one infer the (probabilities of) values of one or more network variables, given observed values of others?
Bayes net contains all information needed for
this inference
If only one variable with unknown value, easy to
infer it
In general case, problem is NP hard
In practice, can succeed in many cases
Exact inference methods work well for some
network structures
Monte Carlo methods \simulate" the network
Learning of Bayesian Networks
Several variants of this learning task
Network structure might be known or unknown
Training examples might provide values of all
network variables, or just some
If structure known and observe all variables
Learning Bayes Nets
Suppose structure known, variables partially observable
e.g., observe ForestFire, Storm, BusTourGroup,
Thunder, but not Lightning, Campre...
Similar to training neural network with hidden
units
In fact, can learn network conditional
probability tables using gradient ascent!
Converge to network
h
that (locally) maximizesGradient Ascent for Bayes Nets
Let
w
ijk denote one entry in the conditionalprobability table for variable
Y
i in the networkw
ijk =P
(Y
i =y
ijjParents
(Y
i) = the listu
ik of values)e.g., if
Y
i =Campfire
, thenu
ik might beh
Storm
=T;BusTourGroup
=F
iPerform gradient ascent by repeatedly 1. update all
w
ijk using training dataD
w
ijkw
ijk + Xd2D
P
h(y
ij;u
ikjd
)w
ijk2. then, renormalize the
w
ijk to assureP
j
w
ijk = 1More on Learning Bayes Nets
EM algorithm can also be used. Repeatedly:
1. Calculate probabilities of unobserved variables,
assuming
h
2. Calculate new
w
ijk to maximizeE
[lnP
(D
jh
)]where
D
now includes both observed and(calculated probabilities of) unobserved variables When structure unknown...
Algorithms use greedy search to add/substract
edges and nodes
Summary: Bayesian Belief Networks
Combine prior knowledge with observed data
Impact of prior knowledge (when correct!) is to
lower the sample complexity
Active research area
{
Extend from boolean to real-valued variables{
Parameterized distributions instead of tables{
Extend to rst-order instead of propositionalsystems
{
More eective inference methodsExpectation Maximization (EM)
When to use:
Data is only partially observable
Unsupervised clustering (target value
unobservable)
Supervised learning (some instance attributes
unobservable) Some uses:
Train Bayesian Belief Networks
Unsupervised clustering (AUTOCLASS)
Generating Data from Mixture of
k
Gaussians
p(x)
x
Each instance
x
generated by1. Choosing one of the
k
Gaussians with uniformprobability
2. Generating an instance at random according to that Gaussian
EM for Estimating
k
Means
Given:
Instances from
X
generated by mixture ofk
Gaussian distributions
Unknown means h
1
;:::;
ki of the
k
GaussiansDon't know which instance
x
i was generated bywhich Gaussian Determine:
Maximum likelihood estimates of h
1
;:::;
k iThink of full description of each instance as
y
i = hx
i;z
i 1;z
i2 i, wherez
ij is 1 ifx
i generated byj
th Gaussianx
i observablez
ij unobservableEM for Estimating
k
Means
EM Algorithm: Pick random initial
h
= h1
;
2 i,then iterate
E step: Calculate the expected value
E
[z
ij] of eachhidden variable
z
ij, assuming the currenthypothesis
h
= h 1;
2 i holds.E
[z
ij] =p
(x
=x
ij = j) P 2 n=1p
(x
=x
i j = n) =e
? 1 2 2 (x i ? j ) 2 P 2 n=1e
? 1 2 2 (x i ? n ) 2M step: Calculate a new maximum likelihood hypothesis
h
0 = h 0 1;
0 2i, assuming the value taken on by
each hidden variable
z
ij is its expected valueE
[z
ij] calculated above. Replaceh
= h1
;
2 i byh
0 = h 0 1;
0 2 i. j P mi=1E
[z
ij]x
i P mi=1E
[z
ij]EM Algorithm
Converges to local maximum likelihood
h
and provides estimates of hidden variables
z
ijIn fact, local maximum in
E
[lnP
(Y
jh
)]
Y
is complete (observable plus unobservablevariables) data
Expected value is taken over possible values of
General EM Problem
Given: Observed dataX
= fx
1;:::;x
m g Unobserved dataZ
= fz
1;:::;z
m gParameterized probability distribution
P
(Y
jh
),where
{
Y
= fy
1
;:::;y
mg is the full data
y
i =x
i [z
i{
h
are the parametersDetermine:
h
that (locally) maximizesE
[lnP
(Y
jh
)]Many uses:
Train Bayesian belief networks
Unsupervised clustering (e.g.,
k
means)General EM Method
Dene likelihood function
Q
(h
0j
h
) which calculatesY
=X
[Z
using observedX
and currentparameters
h
to estimateZ
Q
(h
0 jh
)E
[lnP
(Y
jh
0) jh;X
] EM Algorithm:Estimation (E) step: Calculate
Q
(h
0j
h
) using thecurrent hypothesis
h
and the observed dataX
toestimate the probability distribution over
Y
.Q
(h
0j
h
)E
[lnP
(Y
jh
0)j
h;X
]Maximization (M) step: Replace hypothesis
h
bythe hypothesis
h
0 that maximizes thisQ
function.
h
argmaxh0