Parameter Learning with Hidden Variables

(1)

Davide Bacciu

Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it

Apprendimento Automatico: Fondamenti - A.A. 2016/2017

(2)

1 Learning with Hidden Variables

The Expectation Maximization (EM) Algorithm Learning Gaussian Mixture Models

2 Concluding Remarks

3 Applications of Bayesian Learning Semantic Analysis of Text Image Understanding Structure Learning

(3)

Part I

Learning with Hidden Variables in

Bayesian Networks

(4)

Learning with Bayesian Networks

Learning with Hidden Variables

Structure

Fixed Structure Fixed Variables

X Y

P(Y |X )

X Y

P(X , Y )

Data Complete

Naive Bayes Calculate Frequencies (ML)

Discover dependencies from the data Structure Search Independence tests

Incomplete

Latent variables EM Algorithm (ML) MCMC, VBEM (Bayesian)

Difficult Problem Structural EM Parameter Learning Structure Learning

(5)

What if...

Sample classification isno longer observable

D L c

a_l

V

^D

L c

a_l

c is now anHidden Variablewithin a (multinomial)Mixture Model

(6)

Mixture Model

No classification⇒ Seek anatural grouping

What is a natural grouping?

Family School Females Males

Credit goes to Eamonn Keogh @ UCR

(7)

Dealing with Incomplete Data

Incomplete data

Missing observations (imputation) Unobserved random variables Key idea

Complete the datamaking hypothesis on the unobserved variables

Review the hypothesis Iterate

Expectation-MaximizationAlgorithm

Expectation Compute the current estimate of the hidden variables (expected counts)

Maximization Use the estimate to update the model parameters θ

(8)

Expectation-Maximization (EM) Algorithm

Likelihood maximization procedure θ_ML=arg max

θ L(θ|X) = arg max

θ P(X|θ)

L(θ|X) is theincomplete likelihoodandX is theincomplete datacontaining only observed random variables

Assumecomplete datad = (X, Z)exists where z∈ Z is an hidden or latentvariable (e.g. c in the multinomial mixture example)

Compute thecomplete likelihood

L^c(θ|d) = P(d|θ) = P(X, Z|θ) = P(Z|X, θ)P(X|θ)

Key Idea

Maximize the complete likelihoodLc instead ofL

(9)

EM at a Glance

1 Initialize parameters θ⁽¹⁾

2 Expectation Step(iteration k )

Q(θ|θ^{(k )}) =E_Z|X,θ(k )[logLc(θ|X, Z)]

3 Maximization Step(iteration k ) θ^{(k +1)}=arg max

θ Q(θ|θ^{(k )})

4 Iterate steps 2 and 3 with k = k + 1 until logLc stops increasing

Theorem (Dempster, Laird and Rubin (1997)) An EM process ensuresL(θ^{(k +1)}|X) ≥ L(θ^{(k )}|X)

(10)

EM Graphically

θ^(k) θ^(k+1)

Q(θ^(k+1)|θ^(k))

L(θ^(k+1)|X)

L(θ^(k)|X)

L(θ|X)

Q(θ|θ^(k))

(11)

Learning Gaussian Mixture Models

N

µ σ

x_j

M

N

M µ

x z

σ π

Previously- ML estimation for a single univariate Gaussian

Now- EM estimation for M univariate Gaussians

Gaussian Mixture Model (GMM) zjmis thehidden mixture selection variablewith prior P(zjm) = π_m Each observation xj is generated by a single Gaussian P(xj|z^jm)∼ N (µ^m, σ_m) Model parameters θ = (π, µ, σ)

(12)

GMM Likelihood

A set of i.i.d observationsX ={x1, . . . ,xn} has incomplete likelihood

L(θ|X) = P(X|θ) =

N

Y

j=1

P(x_j|θ)

Usemarginalizationto introduce thehiddenvariable z_jm

L(θ|X) =

N

Y

j=1 M

X

m=1

P(x_j,z_jm|θ)

Following theindependence relationshipsin the graphical model this rewrites

L(θ|X) =

N

Y

j=1 M

X

m=1

P(z_jm|π)P(xj|zjm, µ, σ)

(13)

GMM Complete Likelihood

Assumewe know the hidden assignment z_jm=1for each sample j, we have the followingcomplete likelihood

Lc(θ|X, Z) =

N

Y

j=1 M

X

m=1

z_jmπmP(x_j|µm, σm) Can be better handled bytaking the log

logLc(θ|X, Z) =

N

X

j=1 M

X

m=1

z_jmlog(πmP(x_j|µm, σm)) Now we can compute

Q(θ|θ^{(k )}) =E_Z|X,θ(k )[logLc(θ|X, Z)]

=

N

X

j=1 M

X

m=1

P(z_jm|xj, θ^{(k )})log(π_m^{(k )}P(x_j|µ^{(k )}m , σ^{(k )}_m ))

(14)

Expectation Step

To compute the auxiliary function

Q(θ|θ^{(k )}) =

N

X

j=1 M

X

m=1

P(z_jm|xj, θ^{(k )})log(π_m^{(k )}P(x_j|µ^{(k )}m , σ^{(k )}_m ))

we only need to estimate theposterior P(z_jm|xj, θ^{(k )})based on thecurrent θ^{(k )} values

P(z_jm|xj, θ^{(k )}) = π^{(k )}_m P(x_j|µ^{(k )}m , σ^{(k )}_m ) P

m⁰π_m^{(k )}0 P(x_j|µ^{(k )}m⁰, σ_m^{(k )}0)

(15)

Maximization Step

Maximizationof the auxiliary function with respect to the parameters θ = (π, µ, σ)

Q(θ|θ^{(k )}) =

N

X

j=1 M

X

m=1

P(z_jm|xj, θ^{(k )})log π_m^{(k )}

+

N

X

j=1 M

X

m=1

P(z_jm|xj, θ^{(k )})log P(x_j|µ^{(k )}m , σ^{(k )}_m )

As usual this is performed by solving

∂Q(θ|θ^{(k )})

∂θ =0

while taking into account thesum-to-one constraint

M

X

m=1

πm =1

(16)

M-Step Learning Equations

Solving theindependent maximization problemswith respect to π_m, µ_mand σ_m yields

π^{(k +1)}_m = PN

j=1P(z_jm|xj, θ^{(k )}) N

µ^{(k +1)}_m = PN

j=1x_jP(z_jm|xj, θ^{(k )}) PN

j=1P(z_jm|xj, θ^{(k )})

σ^{(k +1)}_m = v u u t

PN

j=1P(z_jm|xj, θ^{(k )})(x_j− µ^{(k )})² PN

j=1P(z_jm|xj, θ^{(k )})

(17)

2D Gaussian Mixture Example

Training data sampled from a number of Gaussian functions (Guess It!!) having common unit variance and different means

1500 test samples drawn from the same distribution for computing training-independent likelihood values (Overfitting!!)

Task is to estimate a probability density for the data using a Gaussian Mixture Model

How many mixture components?

(18)

3 Gaussian Mixtures

Estimated Mixture

−6 −4 −2 0 2 4 6

−6

−4

−2 0 2 4 6

0 10 20 30 40 50

−4.4

−4.3

−4.2

−4.1

−4

−3.9

−3.8

−3.7

EM Iterations

Log−Likelihood

The actual number of Gaussians generating the training data

Credit goes to Mark Girolami @ Glasgow University

(19)

The EM Algorithm in a Nut-Shell

The EM algorithm can besummarizedby the optimization problem

θ^{(k +1)}=arg max

θ

X

z

P(Z = z|X, θ^{(k )})logLc(θ|X, Z = z)

Theposterior P(Z = z|X, θ^{(k )})provides anexpected countfor theeventzthat is the EM counterpart of basic ML counting

The expectation-maximization algorithm is an example of Unsupervised Learning

(20)

Take Home Messages on Parameter Learning

Maximum Likelihood estimates provide straightforward solutions to bothsupervisedandunsupervisedlearning tasks

Maximum Likelihood is completely data-driven (handle with care!!)

Point estimates can degenerate if data is sparse Priors help regularizing the estimate (Smoothing)

Bayesian learning might be the solution when dealing with little training data

Expectation Maximization

Use it fordealing with unobserved data Complete observationswith hidden variables Estimate unobserved counts

Compute observed terms using suchpseudo-counts

(21)

Bayesian Learning in a Single Slide

Bayesian learning is a powerfulall-in-onemodel for Modeling your knowledge about the world

Inference Learning

Bayesian learning usesconditional independenceto Break-down the joint probability into smaller pieces Describe the world

Bayesian learning isgenerative Describes how data is generated Learns the joint probability

Bayesian learning is (often)parametric

Need to make hypothesis on the data-generating distribution

Need to choose suitable priors

(22)

Discriminative Vs Generative Learning

Two paradigms for learning statistical classifiers

Generativeclassifiers learn a model of thejoint probability P(c, x )of the features and the corresponding class label

Provide a model of howdata is generated

Use Bayes rule to predict using class posterior P(c|x) Bayesian Learning

Discriminativeclassifiers directly model theclass posterior P(c|x)or use adiscriminant function

Directly target atoptimizing classification accuracy Best to solve a specific estimation problem rather than trying to estimate the general joint density (Vapnik) Linear threshold unit, neural networks, support vector machines, decision trees, . . .

(23)

Why (not) Using Bayesian Learning

Cons

Discriminativeapproaches often lead to better classification/regression accuracies

Inference may require a significantcomputational cost Need to correctlyguessthe data generating and prior distributions

Pro

Elegant andsolid mathematical basis: lots of tools for inference, probabilistic reasoning, learning, assessing the reliability of the acquired knowledge

Straightforward toincorporate prior knowledge: data distributions and structure

Powerful and sound techniques for dealing withcomplex unsupervised learningtasks: vision, natural language processing, . . .

(24)

Things we haven’t seen

Loads of them!!

FullyBayesian generativemodels

Treating maximum likelihood probabilities as random variables

Choosing appropriate priors

Variational and Montecarlo methods

Learning thestructureof a Bayesian Network PC algorithm

Searching in the space of graphs structures Simultaneous learning ofparameters and structure

Structural EM Advancedapplications

Image and video understanding Multimodal document analysis

(25)

Part II

Applications of Bayesian Learning

(26)

Document Classification

L C

D

a_l β_j

c

π

Naive Bayes forclassificationof textual documents

Need large annotated corpora What if we don’t have such supervised information?

(27)

Generative Topic Models

Unsupervised Naive Bayes

Document classification becomes an unknownlatent topicz

Discovering thenatural groupingof documents

Expectation Maximizationlearning Unsupervised Naive Bayes allows choosing onlyone topic per document

(28)

Generative Topic Models

Probabilistic Latent Semantic Analysis

Choosingone topic per word Documents aremixturesof topics Not a fully Bayesian approach though

Generative model only for documents inthe training set

(29)

Generative Topic Models

Latent Dirichlet Allocation

Documentdistributionbecomes a random variable

Can generate documents outside the training set

Distributed as aDirichletwith hyperparameter α

Parameter learning Variational EM Gibbs Sampling

(30)

Can Make it as Complex as You Wish!

Dependent Pachinko Allocation

Hierarchical topic model Fixed hierarchy

Fixed number of topics

(31)

Topic Models for Images

Caltech-4 Dataset: cars, faces, motorbikes and airplanes on a cluttered background

Fit aLatent Semantic Analysismodel with 7 topics

Determine themost likely topicof each visual word in image d_i

P(z_k|wj,d_i) = P(zk|dⁱ)P(wj|z^k) PK

k =1P(zk|dⁱ)P(wj|z^k)

Sivic et al (2005) Discovering objects and their location in images. Proc. of the 10th IEEE Int. Conf. on Computer Vision (ICCV’05)

(32)

Learning to Segment Image Parts

Latent Topics Network

Yuan et al, A novel topic-level random walk framework for scene image co-segmentation, ECCV 2014

(33)

Discovering Gene Interaction Networks

Gene Expression Data

(34)

Assessing Influenza Virulence

Inferred Gene Bayesian Graphical Model for H5N1 Virulence

http://hiv.bio.ed.ac.uk/samLycett/research.html