Higher-Order Probabilistic Programming
A Tutorial at POPL 2019
Part II
Ugo Dal Lago
(Based on joint work with Flavien Breuvart, Raphaëlle Crubillé, Charles Grellois, Davide Sangiorgi,. . . )
POPL 2019, Lisbon, January 14th
A Foundation for Higher-Order Probabilistic Programming
I We are interested in a better understanding of some crucial questions about higher-order probabilistic programs, e.g.:
I How could we formalise and prove programs to have certain desireable properties, like being terminating or consuming a bounded amount of resources?
I How could we prove programs to be equivalent, or more generally to be in a certain relation?
I We could in principle answer these questions directly in a programming language like OCAML.
I It is methodologically much better to distill some paradigmatic calculi which expose all the essential features, but which are somehow agnostic to many unimportant details.
I We will introduce and study two such calculi:
I PCF⊕, a calculus for randomized higher-order programming.
I PCFsample,score, a calculus for bayesian programming.
A Foundation for Higher-Order Probabilistic Programming
I We are interested in a better understanding of some crucial questions about higher-order probabilistic programs, e.g.:
I How could we formalise and prove programs to have certain desireable properties, like being terminating or consuming a bounded amount of resources?
I How could we prove programs to be equivalent, or more generally to be in a certain relation?
I We could in principle answer these questions directly in a programming language like OCAML.
I It is methodologically much better to distill some paradigmatic calculi which expose all the essential features, but which are somehow agnostic to many unimportant details.
I We will introduce and study two such calculi:
I PCF⊕, a calculus for randomized higher-order programming.
I PCFsample,score, a calculus for bayesian programming.
A Foundation for Higher-Order Probabilistic Programming
I We are interested in a better understanding of some crucial questions about higher-order probabilistic programs, e.g.:
I How could we formalise and prove programs to have certain desireable properties, like being terminating or consuming a bounded amount of resources?
I How could we prove programs to be equivalent, or more generally to be in a certain relation?
I We could in principle answer these questions directly in a programming language like OCAML.
I It is methodologically much better to distill some paradigmatic calculi which expose all the essential features, but which are somehow agnostic to many unimportant details.
I We will introduce and study two such calculi:
I PCF⊕, a calculus for randomized higher-order programming.
I PCFsample,score, a calculus for bayesian programming.
PCF
⊕: Types, Terms, Values
Types τ, ρ ::= Unit | Num | τ → ρ
Terms M, N ::= V | V W | let M = x in N | M ⊕ N
| if V then M else N | f
n(V
1, . . . , V
n)
Values V, W ::= ? | x | r | λx.M | fix x.V
PCF
⊕: Type Assignment Rules
Value Typing Rules
Γ ` ? : Unit S
Γ, x : τ ` x : τ V
Γ ` r : Num R Γ, x : τ ` M : ρ
Γ ` λx.M : τ → ρ λ Γ, x : τ → ρ ` M : τ → ρ Γ ` fix x.M : τ → ρ X Term Typing Rules
Γ ` V : τ → ρ Γ ` W : τ
Γ ` V W : ρ @ Γ ` M : τ Γ, x : τ ` N : ρ
Γ ` let M = x in N : ρ L Γ ` M : τ Γ ` N : τ Γ ` M ⊕ N : τ ⊕ Γ ` V : Num Γ ` M : τ Γ ` N : τ
Γ ` if V then M else N : τ I Γ ` V1 : Num · · · Γ ` Vn: Num Γ ` fn(V1, . . . , Vn) : Num F
I The closed terms of type τ forms a set CTτ.
I Similarly for values and CVτ.
PCF
⊕: Type Assignment Rules
Value Typing Rules
Γ ` ? : Unit S
Γ, x : τ ` x : τ V
Γ ` r : Num R Γ, x : τ ` M : ρ
Γ ` λx.M : τ → ρ λ Γ, x : τ → ρ ` M : τ → ρ Γ ` fix x.M : τ → ρ X Term Typing Rules
Γ ` V : τ → ρ Γ ` W : τ
Γ ` V W : ρ @ Γ ` M : τ Γ, x : τ ` N : ρ
Γ ` let M = x in N : ρ L Γ ` M : τ Γ ` N : τ Γ ` M ⊕ N : τ ⊕ Γ ` V : Num Γ ` M : τ Γ ` N : τ
Γ ` if V then M else N : τ I Γ ` V1 : Num · · · Γ ` Vn: Num Γ ` fn(V1, . . . , Vn) : Num F
I The closed terms of type τ forms a set CTτ.
I Similarly for values and CVτ.
Distributions
I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.
I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}
I The set of all distributions over X is indicated as D(X).
I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).
I Given a distribution D on X, its sum P D is simply Px∈XD(x).
Distributions
I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.
I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}
I The set of all distributions over X is indicated as D(X).
I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).
I Given a distribution D on X, its sum P D is simply Px∈XD(x).
Distributions
I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.
I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}
I The set of all distributions over X is indicated as D(X).
I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).
I Given a distribution D on X, its sum P D is simply Px∈XD(x).
Distributions
I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.
I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}
I The set of all distributions over X is indicated as D(X).
I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).
I Given a distribution D on X, its sum P D is simply Px∈XD(x).
Distributions
I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.
I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}
I The set of all distributions over X is indicated as D(X).
I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).
I Given a distribution D on X, its sum P D is simply Px∈XD(x).
PCF
⊕: Reduction
One-Step Reduction
(λx.M )V → δ(M [V /x]) let V = x in M → δ(M [V /x]) if 0 then M else N → δ(M )
if r then M else N → δ(N ) if r 6= 0 M ⊕ N →
M : 1
2, N : 1 2
f (r1, · · · , rn) → δ(f∗(r1. . . , rn))
M → {Li : pi}i∈I
let M = x in N → {let Li= x in N : pi}i∈I
PCF
⊕: Reduction
Step-Indexed Reduction
M ⇒0 ∅ V ⇒1δ(V ) V ⇒n+1∅
M →D ∀N ∈ SUPP(D).N ⇒nEN
M ⇒n+1 P
N ∈SUPP(D)D(N) · EN
PCF
⊕: Some Easy Meta-Theorems
Lemma
If M ⇒nD, then SUPP(D) is a finite set.
Proposition (Progress)
For every M ∈ CTτ, either M is a value or there is D with M → D Proposition (Subject Reduction)
For every M ∈ CTτ and for every n ∈ N, if M → D and M ⇒nE, then D ∈ D(CTτ) and E ∈ D(CVτ).
Corollary
For every M ∈ CTτ and for every n ∈ N, there is exactly one distribution Dn such that M ⇒nDn. We will write hM in for such a distribution.
PCF
⊕: Some Easy Meta-Theorems
Lemma
If M ⇒nD, then SUPP(D) is a finite set.
Proposition (Progress)
For every M ∈ CTτ, either M is a value or there is D with M → D
Proposition (Subject Reduction)
For every M ∈ CTτ and for every n ∈ N, if M → D and M ⇒nE, then D ∈ D(CTτ) and E ∈ D(CVτ).
Corollary
For every M ∈ CTτ and for every n ∈ N, there is exactly one distribution Dn such that M ⇒nDn. We will write hM in for such a distribution.
PCF
⊕: Some Easy Meta-Theorems
Lemma
If M ⇒nD, then SUPP(D) is a finite set.
Proposition (Progress)
For every M ∈ CTτ, either M is a value or there is D with M → D Proposition (Subject Reduction)
For every M ∈ CTτ and for every n ∈ N, if M → D and M ⇒nE, then D ∈ D(CTτ) and E ∈ D(CVτ).
Corollary
For every M ∈ CTτ and for every n ∈ N, there is exactly one distribution Dn such that M ⇒nDn. We will write hM in for such a distribution.
PCF
⊕: Some Easy Meta-Theorems
Lemma
If M ⇒nD, then SUPP(D) is a finite set.
Proposition (Progress)
For every M ∈ CTτ, either M is a value or there is D with M → D Proposition (Subject Reduction)
For every M ∈ CTτ and for every n ∈ N, if M → D and M ⇒nE, then D ∈ D(CTτ) and E ∈ D(CVτ).
Corollary
For every M ∈ CTτ and for every n ∈ N, there is exactly one distribution Dn such that M ⇒nDn. We will write hM in for such a distribution.
PCF
⊕: The Operational Semantics of a Term
I Given two distributionsD, E ∈ D(X), we write D ≤ E iff D(x) ≤ E(x) for every x ∈ X. This relation endows D(X) with the structure of a partial order, which is actually an ωCPO:
I Given a closed term M ∈ CTτ, the operational semantics of M is defined to be the distribution hM i ∈ CVτ defined asP
n∈NhM in.
Term Contexts CT, DT ::= CV
|
[·]|
CVV|
V CV|
let CT= x in N|
let M = x in CT|
CT⊕ DT|
if V then CTelse DTValue Contexts CV, DV ::= λx.CT
|
fix x.CVI Given two terms M, N such that Γ ` M : τ and Γ ` N : τ , we say that M and N are (Γ, τ )-equivalent, and we write M ≡τΓN iff whenever
∅ ` C[Γ ` · : τ ] : Unit, it holds that PhC[M ]i = PhC[N ]i.
PCF
⊕: The Operational Semantics of a Term
I Given two distributionsD, E ∈ D(X), we write D ≤ E iff D(x) ≤ E(x) for every x ∈ X. This relation endows D(X) with the structure of a partial order, which is actually an ωCPO:
I Given a closed term M ∈ CTτ, the operational semantics of M is defined to be the distribution hM i ∈ CVτ defined asP
n∈NhM in.
Term Contexts CT, DT ::= CV
|
[·]|
CVV|
V CV|
let CT= x in N|
let M = x in CT|
CT⊕ DT|
if V then CTelse DT Value Contexts CV, DV ::= λx.CT|
fix x.CVI Given two terms M, N such that Γ ` M : τ and Γ ` N : τ , we say that M and N are (Γ, τ )-equivalent, and we write M ≡τΓN iff whenever
∅ ` C[Γ ` · : τ ] : Unit, it holds that PhC[M ]i = PhC[N ]i.
PCF
⊕: The Operational Semantics of a Term
I Given two distributionsD, E ∈ D(X), we write D ≤ E iff D(x) ≤ E(x) for every x ∈ X. This relation endows D(X) with the structure of a partial order, which is actually an ωCPO:
I Given a closed term M ∈ CTτ, the operational semantics of M is defined to be the distribution hM i ∈ CVτ defined asP
n∈NhM in.
Term Contexts CT, DT ::= CV
|
[·]|
CVV|
V CV|
let CT= x in N|
let M = x in CT|
CT⊕ DT|
if V then CTelse DTValue Contexts CV, DV ::= λx.CT
|
fix x.CVI Given two terms M, N such that Γ ` M : τ and Γ ` N : τ , we say that M and N are (Γ, τ )-equivalent, and we write M ≡τΓN iff whenever
∅ ` C[Γ ` · : τ ] : Unit, it holds that PhC[M ]i = PhC[N ]i.
PCF
⊕: from Equivalences to Metrics?
I The definition of contexual equivalence asks that PhC[M ]i = PhC[N ]i for every context C.
I But what if PhC[M ]i and PhC[N ]i are very close, without being really equal to each other?
I It makes sense to generalize contextual equivalence to a notion of distance:
δΓ,τ(M, N ) = sup
∅`C[Γ`·:τ ]:Unit
XhC[M ]i −X
hC[N ]i .
I For every Γ, τ , δΓ,τ is indeed a pseudo-metric:
δΓ,τ(M, M ) = 0
δΓ,τ(M, N ) = δΓ,τ(N, M )
δΓ,τ(M, L) ≤ δΓ,τ(M, N ) + δΓ,τ(N, L)
Termination in a Probabilistic Setting
I Let M be any closed term. We say that M is almost surely terminating if PhM i = 1, namely if its probability of convergence is 1.
I Example:
GEO := (fix f.λx.x ⊕ (let succ1(x) = y in f y))0
I The expected evaluation length of any closed term as follows: ExLen(M ) :=
∞
X
m=0
1 −
m
X
n=0
XhM in
!
Let M be any closed term. We say that M is positively almost surely terminating if ExLen(M ) < +∞.
Lemma
Every positively almost-surely terminating term is almost-surely terminating.
Termination in a Probabilistic Setting
I Let M be any closed term. We say that M is almost surely terminating if PhM i = 1, namely if its probability of convergence is 1.
I Example:
GEO := (fix f.λx.x ⊕ (let succ1(x) = y in f y))0
I The expected evaluation length of any closed term as follows: ExLen(M ) :=
∞
X
m=0
1 −
m
X
n=0
XhM in
!
Let M be any closed term. We say that M is positively almost surely terminating if ExLen(M ) < +∞.
Lemma
Every positively almost-surely terminating term is almost-surely terminating.
Termination in a Probabilistic Setting
I Let M be any closed term. We say that M is almost surely terminating if PhM i = 1, namely if its probability of convergence is 1.
I Example:
GEO := (fix f.λx.x ⊕ (let succ1(x) = y in f y))0
I The expected evaluation length of any closed term as follows:
ExLen(M ) :=
∞
X
m=0
1 −
m
X
n=0
XhM in
!
Let M be any closed term. We say that M is positively almost surely terminating if ExLen(M ) < +∞.
Lemma
Every positively almost-surely terminating term is almost-surely terminating.
Termination in a Probabilistic Setting
I Let M be any closed term. We say that M is almost surely terminating if PhM i = 1, namely if its probability of convergence is 1.
I Example:
GEO := (fix f.λx.x ⊕ (let succ1(x) = y in f y))0
I The expected evaluation length of any closed term as follows:
ExLen(M ) :=
∞
X
m=0
1 −
m
X
n=0
XhM in
!
Let M be any closed term. We say that M is positively almost surely terminating if ExLen(M ) < +∞.
Lemma
Every positively almost-surely terminating term is almost-surely terminating.
Variations on PCF
⊕I Pree, Untyped rather than applied, typed.
I Terms: M ::= x
|
λM.|
M M|
M ⊕ M ;I Values: V ::= λM.;
I One-Step Reduction:
(λx.M )V → δ(M [V /x]) M ⊕ N →
M : 1
2, N : 1 2
M → {Li: pi}i∈I M N → {LiN : pi}i∈I
M → {Li : pi}i∈I V M → {V Li: pi}i∈I
I The obtained calculus will be referred to as Λ⊕.
I CBN rathen than CBV.
I One-Step Reduction:
(λx.M )N → δ(M [N/x]) M ⊕ N →
M : 1
2, N : 1 2
M → {Li: pi}i∈I M N → {LiN : pi}i∈I
Variations on PCF
⊕I Pree, Untyped rather than applied, typed.
I Terms: M ::= x
|
λM.|
M M|
M ⊕ M ;I Values: V ::= λM.;
I One-Step Reduction:
(λx.M )V → δ(M [V /x]) M ⊕ N →
M : 1
2, N : 1 2
M → {Li: pi}i∈I M N → {LiN : pi}i∈I
M → {Li : pi}i∈I V M → {V Li: pi}i∈I
I The obtained calculus will be referred to as Λ⊕.
I CBN rathen than CBV.
I One-Step Reduction:
(λx.M )N → δ(M [N/x]) M ⊕ N →
M : 1
2, N : 1 2
M → {Li: pi}i∈I M N → {LiN : pi}i∈I
From Randomised Algorithms to Bayesian Programming
I Binary probabilistic choice is perfectly adequate to model randomised computation.
Theorem
The class of computable probabilistic functions coincides with the class of probabilistic functions computable by PCFN⊕.
I In recent years, starting from the pioneering works on languages like CHURCH, ANGLICAN, or HANSEI, functional programs have been also employed as means to represent probabilistic models rather than algorithms.
I The languages above can be modeled [Staton2017] as λ-calculi endowed with two new operators:
I sample, modeling sampling from the uniform distribution on [0, 1].
I score, which takes a positive real number r as a parameter, and modify the weight of the current probabilistic branch by multiplying it by r.
From Randomised Algorithms to Bayesian Programming
I Binary probabilistic choice is perfectly adequate to model randomised computation.
Theorem
The class of computable probabilistic functions coincides with the class of probabilistic functions computable by PCFN⊕.
I In recent years, starting from the pioneering works on languages like CHURCH, ANGLICAN, or HANSEI, functional programs have been also employed as means to represent probabilistic models rather than algorithms.
I The languages above can be modeled [Staton2017] as λ-calculi endowed with two new operators:
I sample, modeling sampling from the uniform distribution on [0, 1].
I score, which takes a positive real number r as a parameter, and modify the weight of the current probabilistic branch by multiplying it by r.
From Randomised Algorithms to Bayesian Programming
I Binary probabilistic choice is perfectly adequate to model randomised computation.
Theorem
The class of computable probabilistic functions coincides with the class of probabilistic functions computable by PCFN⊕.
I In recent years, starting from the pioneering works on languages like CHURCH, ANGLICAN, or HANSEI, functional programs have been also employed as means to represent probabilistic models rather than algorithms.
I The languages above can be modeled [Staton2017] as λ-calculi endowed with two new operators:
I sample, modeling sampling from the uniform distribution on [0, 1].
I score, which takes a positive real number r as a parameter, and modify the weight of the current probabilistic branch by multiplying it by r.
PCF
sample,score: Terms, Typing Rules, and Reduction
Terms M, N ::= sample
|
score(V ).Typing Rules Γ ` sample : Num A Γ ` V : Num Γ ` score(V ) : Unit C
I One needs to switch from distributions to measures, and assume the underlying set, namely R to have the structure of a measurable space
I Adapting the rule for let-terms naturally leads to M → µ
let M = x in N → let µ = x in N where let µ = x in N should itself be a measure.
I The key rule in step-indexed reduction needs to be adapted: M → µ ∀N ∈ SUPP(µ).N ⇒nσN
M ⇒n+1 A 7→R σN(A)µ(dN )
PCF
sample,score: Terms, Typing Rules, and Reduction
Terms M, N ::= sample
|
score(V ).Typing Rules Γ ` sample : Num A Γ ` V : Num Γ ` score(V ) : Unit C
I One needs to switch from distributions to measures, and assume the underlying set, namely R to have the structure of a measurable space
I Adapting the rule for let-terms naturally leads to M → µ
let M = x in N → let µ = x in N where let µ = x in N should itself be a measure.
I The key rule in step-indexed reduction needs to be adapted: M → µ ∀N ∈ SUPP(µ).N ⇒nσN
M ⇒n+1 A 7→R σN(A)µ(dN )
PCF
sample,score: Terms, Typing Rules, and Reduction
Terms M, N ::= sample
|
score(V ).Typing Rules Γ ` sample : Num A Γ ` V : Num Γ ` score(V ) : Unit C
I One needs to switch from distributions to measures, and assume the underlying set, namely R to have the structure of a measurable space
I Adapting the rule for let-terms naturally leads to M → µ
let M = x in N → let µ = x in N where let µ = x in N should itself be a measure.
I The key rule in step-indexed reduction needs to be adapted:
M → µ ∀N ∈ SUPP(µ).N ⇒nσN M ⇒n+1A 7→R σN(A)µ(dN )
The Operational Meaning of Bayesian Terms
I It can well be that hM i = µ, where µ sums to something strictly higher than 1.
I How should we interpret µ(A) for a measurable set of terms A?
I We need to normalize µ!
I Explicitly building a normalized version of µ (if it exists) is the goal of so-called inference algorithms.
I The operational semantics we have just introduced, called distribution-based is thus just an idealized form of semantics.
I An executable semantics can be given in the form of sampling-based semantics.
The Operational Meaning of Bayesian Terms
I It can well be that hM i = µ, where µ sums to something strictly higher than 1.
I How should we interpret µ(A) for a measurable set of terms A?
I We need to normalize µ!
I Explicitly building a normalized version of µ (if it exists) is the goal of so-called inference algorithms.
I The operational semantics we have just introduced, called distribution-based is thus just an idealized form of semantics.
I An executable semantics can be given in the form of sampling-based semantics.
The Operational Meaning of Bayesian Terms
I It can well be that hM i = µ, where µ sums to something strictly higher than 1.
I How should we interpret µ(A) for a measurable set of terms A?
I We need to normalize µ!
I Explicitly building a normalized version of µ (if it exists) is the goal of so-called inference algorithms.
I The operational semantics we have just introduced, called distribution-based is thus just an idealized form of semantics.
I An executable semantics can be given in the form of sampling-based semantics.
The Operational Meaning of Bayesian Terms
I It can well be that hM i = µ, where µ sums to something strictly higher than 1.
I How should we interpret µ(A) for a measurable set of terms A?
I We need to normalize µ!
I Explicitly building a normalized version of µ (if it exists) is the goal of so-called inference algorithms.
I The operational semantics we have just introduced, called distribution-based is thus just an idealized form of semantics.
I An executable semantics can be given in the form of sampling-based semantics.
Sampling-Based Semantics (1)
h(λx.M )V, si hM [V /x], si1 hlet V = x in M, si hM [V /x], si1 hif 0 then M else N, si hM, si1
hif r then M else N, si hN, si if r 6= 01 hsample, r :: si hr, si1
hscore(r), si h?, sir
hf (r1, · · · , rn), si hf1 ∗(r1. . . , rn), si hM, si hL, tir
hlet M = x in N, si hlet L = x in N, sir
Sampling-Based Semantics (2)
hV, siV hV, si1
hM, si hN, ti hN, tir V hL, uis hM, sir·sV hL, ui