Fundamentals of Artiﬁcial Intelligence Chapter 14: Probabilistic Reasoning Roberto Sebastiani

(1)

DISI, Università di Trento, Italy – roberto.sebastiani@unitn.it http://disi.unitn.it/rseba/DIDATTICA/fai_2020/

Teaching assistant: Mauro Dragoni – dragoni@fbk.eu http://www.maurodragoni.com/teaching/fai/

M.S. Course “Artificial Intelligence Systems”, academic year 2020-2021

Last update: Friday 18^thDecember, 2020, 16:39

Copyright notice: Most examples and images displayed in the slides of this course are taken from [Russell & Norwig, “Artificial Intelligence, a Modern Approach”, 3^rded., Pearson],

including explicitly figures from the above-mentioned book, so that their copyright is detained by the authors.

A few other material (text, figures, examples) is authored by (in alphabetical order):Pieter Abbeel,Bonnie J. Dorr,Anca Dragan,Dan Klein,Nikita Kitaev,Tom Lenaerts,Michela Milano,Dana Nau,Maria Simi, who detain its copyright.

These slides cannot can be displayed in public without the permission of the author.

1 / 35

(2)

1

Bayesian Networks

2

Constructing Bayesian Networks

3

Exact Inference with Bayesian Networks

2 / 35

(3)

1

Bayesian Networks

2

Constructing Bayesian Networks

3

Exact Inference with Bayesian Networks

3 / 35

(4)

Bayesian Networks

Bayesian Networks (aka Belief Networks):

allow for compact specification of full joint distributions represent explicit conditional dependencies among variables:

an arc from X to Y means that X has a direct influence on Y Syntax: a directed acyclic graph (DAG):

each node represents a random variable (discrete or continuous) directed arcs connect pairs of nodes: X → Y (X is a parent of Y) a conditional distribution P(X

_i

|Parents(X

_i

)) for each node X

_i

Conditional distribution represented as a conditional probability table (CPT)

distribution over X

_i

for each combination of parent values Topology encodes conditional independence assertions:

Toothache and Catch conditionally independent given Cavity

Tootchache, Catch depend on Cavity

Weather independent from others No arc ⇐⇒ independence

4 / 35

(5)

Bayesian Networks

Bayesian Networks (aka Belief Networks):

allow for compact specification of full joint distributions represent explicit conditional dependencies among variables:

an arc from X to Y means that X has a direct influence on Y Syntax: a directed acyclic graph (DAG):

each node represents a random variable (discrete or continuous) directed arcs connect pairs of nodes: X → Y (X is a parent of Y) a conditional distribution P(X

_i

|Parents(X

_i

)) for each node X

_i

Conditional distribution represented as a conditional probability table (CPT)

distribution over X

_i

for each combination of parent values Topology encodes conditional independence assertions:

Toothache and Catch conditionally independent given Cavity

Tootchache, Catch depend on Cavity

Weather independent from others No arc ⇐⇒ independence

4 / 35

(6)

Bayesian Networks

Bayesian Networks (aka Belief Networks):

allow for compact specification of full joint distributions represent explicit conditional dependencies among variables:

an arc from X to Y means that X has a direct influence on Y Syntax: a directed acyclic graph (DAG):

each node represents a random variable (discrete or continuous) directed arcs connect pairs of nodes: X → Y (X is a parent of Y) a conditional distribution P(X

_i

|Parents(X

_i

)) for each node X

_i

Conditional distribution represented as a conditional probability table (CPT)

distribution over X

_i

for each combination of parent values Topology encodes conditional independence assertions:

Toothache and Catch conditionally independent given Cavity

Tootchache, Catch depend on Cavity

Weather independent from others No arc ⇐⇒ independence

4 / 35

(7)

Syntax: a directed acyclic graph (DAG):

each node represents a random variable (discrete or continuous) directed arcs connect pairs of nodes: X → Y (X is a parent of Y) a conditional distribution P(X

_i

|Parents(X

_i

)) for each node X

_i

Conditional distribution represented as a conditional probability table (CPT)

distribution over X

_i

for each combination of parent values Topology encodes conditional independence assertions:

Toothache and Catch conditionally independent given Cavity

Tootchache, Catch depend on Cavity

Weather independent from others

No arc ⇐⇒ independence

( c S. Russell & P. Norwig, AIMA) 4 / 35

(8)

Syntax: a directed acyclic graph (DAG):

each node represents a random variable (discrete or continuous) directed arcs connect pairs of nodes: X → Y (X is a parent of Y) a conditional distribution P(X

_i

|Parents(X

_i

)) for each node X

_i

Conditional distribution represented as a conditional probability table (CPT)

distribution over X

_i

for each combination of parent values Topology encodes conditional independence assertions:

Toothache and Catch conditionally independent given Cavity

Tootchache, Catch depend on Cavity

Weather independent from others

No arc ⇐⇒ independence

(9)

Example (from Judea Pearl, UCLA)

“The burglary alarm goes off very likely on burglary and occasionally on earthquakes. John and Mary are neighbors who agreed to call when the alarm goes off. Their reliability is different ...”

Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects “causal” knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call CPTs:

alarm setoff if bunglar in 94% of cases alarm setoff if hearthq.

in 29% of cases false alarm setoff in 0.1% of cases

5 / 35

(10)

Network topology reflects “causal” knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call CPTs:

alarm setoff if bunglar in 94% of cases alarm setoff if hearthq.

in 29% of cases false alarm setoff in 0.1% of cases

( c S. Russell & P. Norwig, AIMA)

5 / 35

(11)

Network topology reflects “causal” knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call CPTs:

alarm setoff if bunglar in 94% of cases alarm setoff if hearthq.

in 29% of cases false alarm setoff in 0.1% of cases

5 / 35

(12)

Network topology reflects “causal” knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call CPTs:

alarm setoff if bunglar in 94% of cases alarm setoff if hearthq.

in 29% of cases false alarm setoff in 0.1% of cases

5 / 35

(13)

A CPT for Boolean X

_i

with k Boolean parents has 2

^k

rows for the combinations of parent values each row requires one number p for P(X

_i

= true) (P(X

_i

= false) = 1 − P(X

_i

= true))

=⇒ If each variable has no more than k parents, the complete network requires O(n · 2

^k

) numbers

a full joint distribution requires 2

ⁿ

− 1 numbers linear vs. exponential!

Ex: for burglary example:

1 + 1 + 4 + 2 + 2 = 10 numbers vs. 2

⁵

− 1 = 31

6 / 35

(14)

A CPT for Boolean X

_i

with k Boolean parents has 2

^k

rows for the combinations of parent values each row requires one number p for P(X

_i

= true) (P(X

_i

= false) = 1 − P(X

_i

= true))

=⇒ If each variable has no more than k parents, the complete network requires O(n · 2

^k

) numbers

a full joint distribution requires 2

ⁿ

− 1 numbers linear vs. exponential!

Ex: for burglary example:

1 + 1 + 4 + 2 + 2 = 10 numbers vs. 2

⁵

− 1 = 31

6 / 35

(15)

A CPT for Boolean X

_i

with k Boolean parents has 2

^k

rows for the combinations of parent values each row requires one number p for P(X

_i

= true) (P(X

_i

= false) = 1 − P(X

_i

= true))

=⇒ If each variable has no more than k parents, the complete network requires O(n · 2

^k

) numbers

a full joint distribution requires 2

ⁿ

− 1 numbers linear vs. exponential!

Ex: for burglary example:

1 + 1 + 4 + 2 + 2 = 10 numbers vs. 2

⁵

− 1 = 31

6 / 35

(16)

A CPT for Boolean X

_i

with k Boolean parents has 2

^k

rows for the combinations of parent values each row requires one number p for P(X

_i

= true) (P(X

_i

= false) = 1 − P(X

_i

= true))

=⇒ If each variable has no more than k parents, the complete network requires O(n · 2

^k

) numbers

a full joint distribution requires 2

ⁿ

− 1 numbers linear vs. exponential!

Ex: for burglary example:

1 + 1 + 4 + 2 + 2 = 10 numbers vs. 2

⁵

− 1 = 31

6 / 35

(17)

of the local conditional distributions:

P(X

₁

, ..., X

_N

) = Q

i=1

P(X

_i

|parents(X

_i

))

if X

i

has no parent, then prior probability P(X

_i

)

Intuition: order X

₁

, ..., X

_n

s.t. parents(X

_i

) ≺ X

_i

for each i:

P(X

₁

, ..., X

n

) = Q

n

i=1

P(X

_i

|X

₁

, ..., X

_i−1

)) // chain rule

= Q

i=1

P(X

_I

|parents(X

_i

)) // conditional independence

=⇒ A Bayesian network is a distributed representation of the full joint distribution

7 / 35

(18)

of the local conditional distributions:

P(X

₁

, ..., X

_N

) = Q

i=1

P(X

_i

|parents(X

_i

))

if X

i

has no parent, then prior probability P(X

_i

)

Intuition: order X

₁

, ..., X

_n

s.t. parents(X

_i

) ≺ X

_i

for each i:

P(X

₁

, ..., X

n

) = Q

n

i=1

P(X

_i

|X

₁

, ..., X

_i−1

)) // chain rule

= Q

i=1

P(X

_I

|parents(X

_i

)) // conditional independence

=⇒ A Bayesian network is a distributed representation of the full joint distribution

7 / 35

(19)

of the local conditional distributions:

P(X

₁

, ..., X

_N

) = Q

i=1

P(X

_i

|parents(X

_i

))

if X

i

has no parent, then prior probability P(X

_i

)

Intuition: order X

₁

, ..., X

_n

s.t. parents(X

_i

) ≺ X

_i

for each i:

P(X

₁

, ..., X

n

) = Q

n

i=1

P(X

_i

|X

₁

, ..., X

_i−1

)) // chain rule

= Q

i=1

P(X

_I

|parents(X

_i

)) // conditional independence

=⇒ A Bayesian network is a distributed representation of the full joint distribution

7 / 35

(20)

of the local conditional distributions:

P(X

₁

, ..., X

_N

) = Q

i=1

P(X

_i

|parents(X

_i

))

if X

i

has no parent, then prior probability P(X

_i

)

Intuition: order X

₁

, ..., X

_n

s.t. parents(X

_i

) ≺ X

_i

for each i:

P(X

₁

, ..., X

n

) = Q

n

i=1

P(X

_i

|X

₁

, ..., X

_i−1

)) // chain rule

= Q

i=1

P(X

_I

|parents(X

_i

)) // conditional independence

=⇒ A Bayesian network is a distributed representation of the full joint distribution

7 / 35

(21)

of the local conditional distributions:

P(X

₁

, ..., X

_N

) = Q

i=1

P(X

_i

|parents(X

_i

))

if X

i

has no parent, then prior probability P(X

_i

)

Intuition: order X

₁

, ..., X

_n

s.t. parents(X

_i

) ≺ X

_i

for each i:

P(X

₁

, ..., X

n

) = Q

n

i=1

P(X

_i

|X

₁

, ..., X

_i−1

)) // chain rule

= Q

i=1

P(X

_I

|parents(X

_i

)) // conditional independence

=⇒ A Bayesian network is a distributed representation of the full joint distribution

7 / 35

(22)

Global Semantics: Example

P(X

₁

, ..., X

_N

) = Q

i=1

P(X

_i

|parents(X

_i

))

Ex: “Prob. that both John and Mary call, the alarm sets off but no burglary nor earthquake”

P(j ∧ m ∧ a ∧ ¬b ∧ ¬e) =

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

8 / 35

(23)

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

(24)

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

(25)

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

(26)

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

(27)

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

(28)

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

(29)

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

(30)

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

(31)

0.9 · 0.7 · 0.001 · 0.999 · 0.998

≈ 0.00063

(32)

Compute:

The probability that John calls and Mary does not, the alarm is not set off with a burglar entering during an earthquake

The probability that John calls and Mary does not, given a burglar entering the house

The probability of an earthquake given the fact that John has called

...

9 / 35

(33)

Theorem: Local semantics holds iff global semantics holds:

P(X

₁

, ..., X

_N

) = Q

i=1

P(X

_i

|parents(X

i

))

10 / 35

(34)

Theorem: Local semantics holds iff global semantics holds:

P(X

₁

, ..., X

_N

) = Q

i=1

P(X

_i

|parents(X

i

))

10 / 35

(35)

11 / 35

(36)

P(X |U

₁

, .., U

m

, Y

₁

, .., Y

n

, Z

_1j

, ..., Z

_nj

), for each X

(37)

13 / 35

(38)

Verify numerically the two previous examples:

Local Semantics Markov Blanket

14 / 35

(39)

1

Bayesian Networks

2

Constructing Bayesian Networks

3

Exact Inference with Bayesian Networks

15 / 35

(40)

Given a set of random variables 1. Choose an ordering {X

₁

, ..., X

_n

}

in principle, any ordering will work (but some may cause blowups) general rule: follow causality, X ≺ Y if X ∈ causes(Y )

2. For i=1 to n do

1. add X

_i

to the network

2. as Parents(X

_i

), choose a subset of {X

₁

, ..., X

_i−1

} s.t.

P(X

_i

|Parents(X

_i

)) = P(X

_i

|X

₁

, ..., X

_i−1

)

=⇒ Guarantees the global semantics by construction P(X

1

, ..., X

N

) = Q

i=1

P(X

i

|parents(X

_i

))

16 / 35

(41)

Given a set of random variables 1. Choose an ordering {X

₁

, ..., X

_n

}

in principle, any ordering will work (but some may cause blowups) general rule: follow causality, X ≺ Y if X ∈ causes(Y )

2. For i=1 to n do

1. add X

_i

to the network

2. as Parents(X

_i

), choose a subset of {X

₁

, ..., X

_i−1

} s.t.

P(X

_i

|Parents(X

_i

)) = P(X

_i

|X

₁

, ..., X

_i−1

)

=⇒ Guarantees the global semantics by construction P(X

1

, ..., X

N

) = Q

i=1

P(X

i

|parents(X

_i

))

16 / 35

(42)

Given a set of random variables 1. Choose an ordering {X

₁

, ..., X

_n

}

in principle, any ordering will work (but some may cause blowups) general rule: follow causality, X ≺ Y if X ∈ causes(Y )

2. For i=1 to n do

1. add X

_i

to the network

2. as Parents(X

_i

), choose a subset of {X

₁

, ..., X

_i−1

} s.t.

P(X

_i

|Parents(X

_i

)) = P(X

_i

|X

₁

, ..., X

_i−1

)

=⇒ Guarantees the global semantics by construction P(X

1

, ..., X

N

) = Q

i=1

P(X

i

|parents(X

_i

))

16 / 35

(43)

17 / 35

(44)

17 / 35

(45)

17 / 35

(46)

17 / 35

(47)

17 / 35

(48)

Ex: 1+2+4+2+4 = 13 numbers needed (rather than 10) Can be much worse

ex: try {M, J, E, B, A} (see AIMA) Much better with causal orderings

ex: try either {B, E, A, J, M}

{E, B, A, J, M}

{B, E, A, M, J}

{E, B, A, M, J}

i.e. {B, E} ≺ A ≺ {J, M}

(both B and E cause A, A causes both M and J)

(49)

Ex: 1+2+4+2+4 = 13 numbers needed (rather than 10) Can be much worse

ex: try {M, J, E, B, A} (see AIMA) Much better with causal orderings

ex: try either {B, E, A, J, M}

{E, B, A, J, M}

{B, E, A, M, J}

{E, B, A, M, J}

i.e. {B, E} ≺ A ≺ {J, M}

(both B and E cause A, A causes both M and J)

(50)

Ex: 1+2+4+2+4 = 13 numbers needed (rather than 10) Can be much worse

ex: try {M, J, E, B, A} (see AIMA) Much better with causal orderings

ex: try either {B, E, A, J, M}

{E, B, A, J, M}

{B, E, A, M, J}

{E, B, A, M, J}

i.e. {B, E} ≺ A ≺ {J, M}

(both B and E cause A, A causes both M and J)

(51)

Ex: 1+2+4+2+4 = 13 numbers needed (rather than 10) Can be much worse

ex: try {M, J, E, B, A} (see AIMA) Much better with causal orderings

ex: try either {B, E, A, J, M}

{E, B, A, J, M}

{B, E, A, M, J}

{E, B, A, M, J}

i.e. {B, E} ≺ A ≺ {J, M}

(both B and E cause A, A causes both M and J)

(52)

add Independent failure probability q

i

for each cause U

i

=⇒ P(¬X |U

1

...U

j

, ¬U

j+1

...¬U

k

) = Q

j i=1

q

i

number of parameters linear in number of parents!

Ex: q

_Cold

= 0.6, q

_Flu

= 0.2, q

_Malaria

= 0.1:

19 / 35

(53)

add Independent failure probability q

i

for each cause U

i

=⇒ P(¬X |U

1

...U

j

, ¬U

j+1

...¬U

k

) = Q

j i=1

q

i

number of parameters linear in number of parents!

Ex: q

_Cold

= 0.6, q

_Flu

= 0.2, q

_Malaria

= 0.1:

19 / 35

(54)

add Independent failure probability q

i

for each cause U

i

=⇒ P(¬X |U

1

...U

j

, ¬U

j+1

...¬U

k

) = Q

j i=1

q

i

number of parameters linear in number of parents!

Ex: q

_Cold

= 0.6, q

_Flu

= 0.2, q

_Malaria

= 0.1:

19 / 35

(55)

1. Consider the probabilistic Wumpus World of previous chapter (a) Describe it as a Bayesian network

20 / 35

(56)

1

Bayesian Networks

2

Constructing Bayesian Networks

3

Exact Inference with Bayesian Networks

21 / 35

(57)

E/e: the set of evidence variables {E

₁

, ..., E

m

} and of evidence values {e

₁

, ..., e

m

}

Y/y: the set of unknown variables (aka hidden variables) {Y

₁

, ..., Y

l

} and unknown values {y

₁

, ..., y

l

}

=⇒ X = X ∪ E ∪ Y

A typical query asks for the posterior probability distribution:

P( X | E=e) (also written P( X | e))

Ex: P(Burglar |JohnCalls = true, MaryCalls = true) query: Burglar

evidence variables: E = {JohnCalls, MaryCalls}

hidden variables: Y = {Earthquake, Alarm}

22 / 35

(58)

E/e: the set of evidence variables {E

₁

, ..., E

m

} and of evidence values {e

₁

, ..., e

m

}

Y/y: the set of unknown variables (aka hidden variables) {Y

₁

, ..., Y

l

} and unknown values {y

₁

, ..., y

l

}

=⇒ X = X ∪ E ∪ Y

A typical query asks for the posterior probability distribution:

P( X | E=e) (also written P( X | e))

Ex: P(Burglar |JohnCalls = true, MaryCalls = true) query: Burglar

evidence variables: E = {JohnCalls, MaryCalls}

hidden variables: Y = {Earthquake, Alarm}

22 / 35

(59)

Inference by Enumeration

We defined a procedure for the task as:

P(X |e) = αP(X , e) = α P

y

P(X , e, y)

=⇒ P(X , e, y) can be rewritten as product of prior and conditional probabilities according to the Bayesian Network

then apply factorization and simplify algebraically when possible Ex:

P(B|j, m) = α P

e

P

a

P(B, e, a, j, m) = α P

e

P

a

P(B)P(e)P(a|B, e)P(j|a)P(m|a) = αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a)P(m|a)

=⇒ P(b|j, m) = αP(b) P

e

P(e) P

a

P(a|b, e)P(j|a)P(m|a) Recursive depth-first enumeration:

O(n) space, O(2

ⁿ

) time with n propositional variables Enumeration is inefficient: repeated computation

23 / 35

(60)

Inference by Enumeration

We defined a procedure for the task as:

P(X |e) = αP(X , e) = α P

y

P(X , e, y)

=⇒ P(X , e, y) can be rewritten as product of prior and conditional probabilities according to the Bayesian Network

then apply factorization and simplify algebraically when possible Ex:

P(B|j, m) = α P

e

P

a

P(B, e, a, j, m) = α P

e

P

a

P(B)P(e)P(a|B, e)P(j|a)P(m|a) = αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a)P(m|a)

=⇒ P(b|j, m) = αP(b) P

e

P(e) P

a

P(a|b, e)P(j|a)P(m|a) Recursive depth-first enumeration:

O(n) space, O(2

ⁿ

) time with n propositional variables Enumeration is inefficient: repeated computation

23 / 35

(61)

probabilities according to the Bayesian Network

then apply factorization and simplify algebraically when possible Ex:

P(B|j, m) = α P

e

P

a

P(B, e, a, j, m) = α P

e

P

a

P(B)P(e)P(a|B, e)P(j|a)P(m|a) = αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a)P(m|a)

=⇒ P(b|j, m) = αP(b) P

e

P(e) P

a

P(a|b, e)P(j|a)P(m|a) Recursive depth-first enumeration:

O(n) space, O(2

ⁿ

) time with n propositional variables Enumeration is inefficient: repeated computation

23 / 35

(62)

probabilities according to the Bayesian Network

then apply factorization and simplify algebraically when possible Ex:

P(B|j, m) = α P

e

P

a

P(B, e, a, j, m) = α P

e

P

a

P(B)P(e)P(a|B, e)P(j|a)P(m|a) = αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a)P(m|a)

=⇒ P(b|j, m) = αP(b) P

e

P(e) P

a

P(a|b, e)P(j|a)P(m|a) Recursive depth-first enumeration:

O(n) space, O(2

ⁿ

) time with n propositional variables Enumeration is inefficient: repeated computation

23 / 35

(63)

probabilities according to the Bayesian Network

then apply factorization and simplify algebraically when possible Ex:

P(B|j, m) = α P

e

P

a

P(B, e, a, j, m) = α P

e

P

a

P(B)P(e)P(a|B, e)P(j|a)P(m|a) = αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a)P(m|a)

=⇒ P(b|j, m) = αP(b) P

e

P(e) P

a

P(a|b, e)P(j|a)P(m|a) Recursive depth-first enumeration:

O(n) space, O(2

ⁿ

) time with n propositional variables Enumeration is inefficient: repeated computation

23 / 35

(64)

probabilities according to the Bayesian Network

then apply factorization and simplify algebraically when possible Ex:

P(B|j, m) = α P

e

P

a

P(B, e, a, j, m) = α P

e

P

a

P(B)P(e)P(a|B, e)P(j|a)P(m|a) = αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a)P(m|a)

=⇒ P(b|j, m) = αP(b) P

e

P(e) P

a

P(a|b, e)P(j|a)P(m|a) Recursive depth-first enumeration:

O(n) space, O(2

ⁿ

) time with n propositional variables Enumeration is inefficient: repeated computation

23 / 35

(65)

24 / 35

(66)

Repeated computation: P(j|a)P(m|a) & P(j|¬a)P(m|¬a) for each value of e

25 / 35

(67)

Repeated computation: P(j|a)P(m|a) & P(j|¬a)P(m|¬a) for each value of e

25 / 35

(68)

Repeated computation: P(j|a)P(m|a) & P(j|¬a)P(m|¬a) for each value of e

25 / 35

(69)

Repeated computation: P(j|a)P(m|a) & P(j|¬a)P(m|¬a) for each value of e

25 / 35

(70)

1. Consider the probabilistic Wumpus World of previous chapter (a) Describe it as a Bayesian network

(b) Compute the query P(P

_1,3

|b

^∗

, p

^∗

) via enumeration (c) Compare the result with that of the example in Ch. 13

26 / 35

(71)

Inference by Variable Elimination

Variable elimination:

carry out summations right-to-left (i.e., bottom-up in the tree) store intermediate results (factors) to avoid recomputation Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

27 / 35

(72)

Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

(73)

Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

(74)

Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

(75)

Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

(76)

Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

(77)

Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

(78)

Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

(79)

Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

(80)

Ex: P(B|j, m)

= α

B

z }| { P(B) P

e E

z }| { P(e) P

a A

z }| { P(a|B, e)

J

z }| { P(j|a)

M

z }| { P(m|a)

= αP(B) P

e

P(e) P

a

P(a|B, e)P(j|a) × f

_M

(A)

= αP(B) P

e

P(e) P

a

P(a|B, e) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) P

a

f

_A

(A, B, E ) × f

_J

(A) × f

_M

(A)

= αP(B) P

e

P(e) × f

_AJM

(B, E ) (sum out A)

= αP(B) P

e

f

_E

(E ) × f

_AJM

(B, E ) (sum out A)

= αP(B) × f

_{E AJM}

(B) (sum out E )

= α × f

_B

(B) × f

_{E AJM}

(B)

“×” is the pointwise product (see later) f

_M

(A)

^def

=

P(m| a) P(m|¬a)

, f

_J

(A)

^def

=

P(j| a) P(j|¬a)

, ...

f

_A...

(.): summation over the values of A...

(81)

a

_n1

a

_n1

... b

_n1

b

_n1

... a

_n1

+ b

_n1

a

_n1

+ b

_n1

...

must have the same argument variables

Pointwise product: Multiply the array elements for the same variable values

Ex:

f

_J

(A) × f

_M

(A) = P(j| a) P(j|¬a)

× P(m| a) P(m|¬a)

= P(j| a)P(m| a) P(j|¬a)P(m|¬a)

General case:

f

₃

(X

₁

, ..., X

_j

, Y

₁

, ..., Y

_k

, Z

₁

, ..., Z

_l

) =

f

₁

(X

₁

, ..., X

_j

, Y

₁

, ..., Y

_k

) × f

₂

(Y

₁

, ..., Y

_k

, Z

₁

, ..., Z

_l

) union of arguments

values: f

3

(x, y, z) = f

1

(x, y) · f

2

(y, z) matrix size: f

1

: 2

^j+k

, f

1

: 2

^{k +l}

, f

3

: 2

^{j+k +l}

28 / 35

(82)

a

_n1

a

_n1

... b

_n1

b

_n1

... a

_n1

+ b

_n1

a

_n1

+ b

_n1

...

must have the same argument variables

Pointwise product: Multiply the array elements for the same variable values

Ex:

f

_J

(A) × f

_M

(A) = P(j| a) P(j|¬a)

× P(m| a) P(m|¬a)

= P(j| a)P(m| a) P(j|¬a)P(m|¬a)

General case:

f

₃

(X

₁

, ..., X

_j

, Y

₁

, ..., Y

_k

, Z

₁

, ..., Z

_l

) =

f

₁

(X

₁

, ..., X

_j

, Y

₁

, ..., Y

_k

) × f

₂

(Y

₁

, ..., Y

_k

, Z

₁

, ..., Z

_l

) union of arguments

values: f

3

(x, y, z) = f

1

(x, y) · f

2

(y, z) matrix size: f

1

: 2

^j+k

, f

1

: 2

^{k +l}

, f

3

: 2

^{j+k +l}

28 / 35

(83)

a

_n1

a

_n1

... b

_n1

b

_n1

... a

_n1

+ b

_n1

a

_n1

+ b

_n1

...

must have the same argument variables

Pointwise product: Multiply the array elements for the same variable values

Ex:

f

_J

(A) × f

_M

(A) = P(j| a) P(j|¬a)

× P(m| a) P(m|¬a)

= P(j| a)P(m| a) P(j|¬a)P(m|¬a)

General case:

f

₃

(X

₁

, ..., X

_j

, Y

₁

, ..., Y

_k

, Z

₁

, ..., Z

_l

) =

f

₁

(X

₁

, ..., X

_j

, Y

₁

, ..., Y

_k

) × f

₂

(Y

₁

, ..., Y

_k

, Z

₁

, ..., Z

_l

) union of arguments

values: f

3

(x, y, z) = f

1

(x, y) · f

2

(y, z) matrix size: f

1

: 2

^j+k

, f

1

: 2

^{k +l}

, f

3

: 2

^{j+k +l}

28 / 35

(84)

f (B, C) =

_a

f

₃

(A, B, C) = f

₃

(a, B, C) + f

₃

(¬a, B, C) =

0.06 0.24 0.42 0.28

+

0.18 0.72 0.06 0.04

=

0.24 0.96 0.48 0.32

29 / 35

(85)

f (B, C) =

_a

f

₃

(A, B, C) = f

₃

(a, B, C) + f

₃

(¬a, B, C) =

0.06 0.24 0.42 0.28

+

0.18 0.72 0.06 0.04

=

0.24 0.96 0.48 0.32

29 / 35

(86)

Efficiency depends on variable ordering O RDER (...) Efficiency improvements:

factor out of summations factors not depending on sum variable remove irrelevant variables

30 / 35

(87)

Efficiency depends on variable ordering O RDER (...) Efficiency improvements:

factor out of summations factors not depending on sum variable remove irrelevant variables

30 / 35

(88)

Efficiency depends on variable ordering O RDER (...) Efficiency improvements:

factor out of summations factors not depending on sum variable remove irrelevant variables

30 / 35

(89)

Factor Out Constant Factors

If f

₁

, ..., f

_i

do not depend on X, then move them out of a P

x

(...):

P

x

f

₁

× · · · × f

_k

= f

₁

× · · · × f

i

P

x

(f

_i+1

× · · · × f

k

) = f

₁

× · · · × f

_i

× f

_X

Ex: P

a

f

₁

(A, B) × f

₂

(B, C)

= f

₂

(B, C) × P

a

F

₁

(A, B)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = α P

e

P

a

P

m

P(J, m, b, e, a) = α P

e

P

a

P

m

P(b)P(e)P(a|b, e)P(J|a)P(m|a) = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) P

m

P(m|a)

31 / 35

(90)

Factor Out Constant Factors

If f

₁

, ..., f

_i

do not depend on X, then move them out of a P

x

(...):

P

x

f

₁

× · · · × f

_k

= f

₁

× · · · × f

i

P

x

(f

_i+1

× · · · × f

k

) = f

₁

× · · · × f

_i

× f

_X

Ex: P

a

f

₁

(A, B) × f

₂

(B, C)

= f

₂

(B, C) × P

a

F

₁

(A, B)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = α P

e

P

a

P

m

P(J, m, b, e, a) = α P

e

P

a

P

m

P(b)P(e)P(a|b, e)P(J|a)P(m|a) = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) P

m

P(m|a)

31 / 35

(91)

P

x

f

₁

× · · · × f

_k

= f

₁

× · · · × f

i

P

x

(f

_i+1

× · · · × f

k

) = f

₁

× · · · × f

_i

× f

_X

Ex: P

a

f

₁

(A, B) × f

₂

(B, C)

= f

₂

(B, C) × P

a

F

₁

(A, B)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = α P

e

P

a

P

m

P(J, m, b, e, a) = α P

e

P

a

P

m

P(b)P(e)P(a|b, e)P(J|a)P(m|a) = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) P

m

P(m|a)

31 / 35

(92)

P

x

f

₁

× · · · × f

_k

= f

₁

× · · · × f

i

P

x

(f

_i+1

× · · · × f

k

) = f

₁

× · · · × f

_i

× f

_X

Ex: P

a

f

₁

(A, B) × f

₂

(B, C)

= f

₂

(B, C) × P

a

F

₁

(A, B)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = α P

e

P

a

P

m

P(J, m, b, e, a) = α P

e

P

a

P

m

P(b)P(e)P(a|b, e)P(J|a)P(m|a) = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) P

m

P(m|a)

31 / 35

(93)

P

x

f

₁

× · · · × f

_k

= f

₁

× · · · × f

i

P

x

(f

_i+1

× · · · × f

k

) = f

₁

× · · · × f

_i

× f

_X

Ex: P

a

f

₁

(A, B) × f

₂

(B, C)

= f

₂

(B, C) × P

a

F

₁

(A, B)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = α P

e

P

a

P

m

P(J, m, b, e, a) = α P

e

P

a

P

m

P(b)P(e)P(a|b, e)P(J|a)P(m|a) = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) P

m

P(m|a)

31 / 35

(94)

P

x

f

₁

× · · · × f

_k

= f

₁

× · · · × f

i

P

x

(f

_i+1

× · · · × f

k

) = f

₁

× · · · × f

_i

× f

_X

Ex: P

a

f

₁

(A, B) × f

₂

(B, C)

= f

₂

(B, C) × P

a

F

₁

(A, B)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = α P

e

P

a

P

m

P(J, m, b, e, a) = α P

e

P

a

P

m

P(b)P(e)P(a|b, e)P(J|a)P(m|a) = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) P

m

P(m|a)

31 / 35

(95)

Remove Irrelevant Variables

Sometimes we fave summations like P

y

P(y |...) P

y

P(y |...) = 1 =⇒ can be dropped Ex: P(JohnCalls|Burglary = true):

P(J|b) = ... = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a)

1

z }| {

P

m

P(m|a) αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) Theorem: For query X and evidence E, Y is irrelevant unless Y ∈ Ancestors(X ∪ E) Ex: X = JohnCalls, E = {Burglary }, and Ancestors({X } ∪ E) = {Alarm, Earthquake}

=⇒ MaryCalls is irrelevant

Related to backward-chaining with Horn clauses

32 / 35

(96)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = ... = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a)

1

z }| {

P

m

P(m|a) αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) Theorem: For query X and evidence E, Y is irrelevant unless Y ∈ Ancestors(X ∪ E) Ex: X = JohnCalls, E = {Burglary }, and Ancestors({X } ∪ E) = {Alarm, Earthquake}

=⇒ MaryCalls is irrelevant

Related to backward-chaining with Horn clauses

32 / 35

(97)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = ... = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a)

1

z }| {

P

m

P(m|a) αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) Theorem: For query X and evidence E, Y is irrelevant unless Y ∈ Ancestors(X ∪ E) Ex: X = JohnCalls, E = {Burglary }, and Ancestors({X } ∪ E) = {Alarm, Earthquake}

=⇒ MaryCalls is irrelevant

Related to backward-chaining with Horn clauses

32 / 35

(98)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = ... = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a)

1

z }| {

P

m

P(m|a) αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) Theorem: For query X and evidence E, Y is irrelevant unless Y ∈ Ancestors(X ∪ E) Ex: X = JohnCalls, E = {Burglary }, and Ancestors({X } ∪ E) = {Alarm, Earthquake}

=⇒ MaryCalls is irrelevant

Related to backward-chaining with Horn clauses

32 / 35

(99)

Ex: P(JohnCalls|Burglary = true):

P(J|b) = ... = αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a)

1

z }| {

P

m

P(m|a) αP(b) P

e

P(e) P

a

P(a|b, e)P(J|a) Theorem: For query X and evidence E, Y is irrelevant unless Y ∈ Ancestors(X ∪ E) Ex: X = JohnCalls, E = {Burglary }, and Ancestors({X } ∪ E) = {Alarm, Earthquake}

=⇒ MaryCalls is irrelevant

Related to backward-chaining with Horn clauses

32 / 35