PartII Higher-OrderProbabilisticProgramming

(1)

Higher-Order Probabilistic Programming

A Tutorial at POPL 2019

Part II

Ugo Dal Lago

(Based on joint work with Flavien Breuvart, Raphaëlle Crubillé, Charles Grellois, Davide Sangiorgi,. . . )

POPL 2019, Lisbon, January 14th

(2)

A Foundation for Higher-Order Probabilistic Programming

I We are interested in a better understanding of some crucial questions about higher-order probabilistic programs, e.g.:

I How could we formalise and prove programs to have certain desireable properties, like being terminating or consuming a bounded amount of resources?

I How could we prove programs to be equivalent, or more generally to be in a certain relation?

I We could in principle answer these questions directly in a programming language like OCAML.

I It is methodologically much better to distill some paradigmatic calculi which expose all the essential features, but which are somehow agnostic to many unimportant details.

I We will introduce and study two such calculi:

I PCF_⊕, a calculus for randomized higher-order programming.

I PCFsample,score, a calculus for bayesian programming.

(3)

A Foundation for Higher-Order Probabilistic Programming

(4)

A Foundation for Higher-Order Probabilistic Programming

(5)

PCF

_⊕

: Types, Terms, Values

Types τ, ρ ::= Unit | Num | τ → ρ

Terms M, N ::= V | V W | let M = x in N | M ⊕ N

| if V then M else N | f

n

(V

₁

, . . . , V

_n

)

Values V, W ::= ? | x | r | λx.M | fix x.V

(6)

PCF

_⊕

: Type Assignment Rules

Value Typing Rules

Γ ` ? : Unit S

Γ, x : τ ` x : τ V

Γ ` r : Num R Γ, x : τ ` M : ρ

Γ ` λx.M : τ → ρ λ Γ, x : τ → ρ ` M : τ → ρ Γ ` fix x.M : τ → ρ X Term Typing Rules

Γ ` V : τ → ρ Γ ` W : τ

Γ ` V W : ρ @ Γ ` M : τ Γ, x : τ ` N : ρ

Γ ` let M = x in N : ρ L Γ ` M : τ Γ ` N : τ Γ ` M ⊕ N : τ ⊕ Γ ` V : Num Γ ` M : τ Γ ` N : τ

Γ ` if V then M else N : τ I Γ ` V₁ : Num · · · Γ ` V_n: Num Γ ` f_n(V₁, . . . , V_n) : Num F

I The closed terms of type τ forms a set CTτ.

I Similarly for values and CVτ.

(7)

PCF

_⊕

: Type Assignment Rules

Value Typing Rules

Γ ` ? : Unit S

Γ, x : τ ` x : τ V

Γ ` r : Num R Γ, x : τ ` M : ρ

Γ ` λx.M : τ → ρ λ Γ, x : τ → ρ ` M : τ → ρ Γ ` fix x.M : τ → ρ X Term Typing Rules

Γ ` V : τ → ρ Γ ` W : τ

Γ ` V W : ρ @ Γ ` M : τ Γ, x : τ ` N : ρ

Γ ` let M = x in N : ρ L Γ ` M : τ Γ ` N : τ Γ ` M ⊕ N : τ ⊕ Γ ` V : Num Γ ` M : τ Γ ` N : τ

Γ ` if V then M else N : τ I Γ ` V₁ : Num · · · Γ ` V_n: Num Γ ` f_n(V₁, . . . , V_n) : Num F

I The closed terms of type τ forms a set CTτ.

I Similarly for values and CVτ.

(8)

Distributions

I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that P_x∈XD(x) ≤ 1.

I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}

I The set of all distributions over X is indicated as D(X).

I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).

I Given a distribution D on X, its sum P D is simply P_x∈XD(x).

(9)

Distributions

(10)

Distributions

(11)

Distributions

(12)

Distributions

(13)

PCF

_⊕

: Reduction

One-Step Reduction

(λx.M )V → δ(M [V /x]) let V = x in M → δ(M [V /x]) if 0 then M else N → δ(M )

if r then M else N → δ(N ) if r 6= 0 M ⊕ N →

M : 1

2, N : 1 2

f (r1, · · · , rn) → δ(f^∗(r1. . . , rn))

M → {L_i : p_i}_i∈I

let M = x in N → {let Li= x in N : pi}_i∈I

(14)

PCF

_⊕

: Reduction

Step-Indexed Reduction

M ⇒₀ ∅ V ⇒₁δ(V ) V ⇒_n+1∅

M →D ∀N ∈ SUPP(D).N ⇒nEN

M ⇒_n+1 P

N ∈SUPP(D)D(N) · EN

(15)

PCF

_⊕

: Some Easy Meta-Theorems

Lemma

If M ⇒_nD, then SUPP(D) is a finite set.

Proposition (Progress)

For every M ∈ CTτ, either M is a value or there is D with M → D Proposition (Subject Reduction)

For every M ∈ CTτ and for every n ∈ N, if M → D and M ⇒nE, then D ∈ D(CTτ) and E ∈ D(CVτ).

Corollary

For every M ∈ CTτ and for every n ∈ N, there is exactly one distribution Dn such that M ⇒_nDn. We will write hM i_n for such a distribution.

(16)

PCF

_⊕

: Some Easy Meta-Theorems

Lemma

For every M ∈ CTτ, either M is a value or there is D with M → D

Proposition (Subject Reduction)

Corollary

(17)

PCF

_⊕

: Some Easy Meta-Theorems

Lemma

Corollary

(18)

PCF

_⊕

: Some Easy Meta-Theorems

Lemma

Corollary

(19)

PCF

_⊕

: The Operational Semantics of a Term

I Given two distributionsD, E ∈ D(X), we write D ≤ E iff D(x) ≤ E(x) for every x ∈ X. This relation endows D(X) with the structure of a partial order, which is actually an ωCPO:

I Given a closed term M ∈ CTτ, the operational semantics of M is defined to be the distribution hM i ∈ CVτ defined asP

n∈NhM i_n.

Term Contexts C_T, D_T ::= C_V

|

[·]

|

C_VV

|

V C_V

|

let C_T= x in N

|

let M = x in C_T

|

C_T⊕ D_T

|

if V then C_Telse D_T

Value Contexts C_V, D_V ::= λx.C_T

|

fix x.C_V

I Given two terms M, N such that Γ ` M : τ and Γ ` N : τ , we say that M and N are (Γ, τ )-equivalent, and we write M ≡^τ_ΓN iff whenever

∅ ` C[Γ ` · : τ ] : Unit, it holds that PhC[M ]i = PhC[N ]i.

(20)

PCF

_⊕

: The Operational Semantics of a Term

n∈NhM i_n.

|

[·]

|

C_VV

|

V C_V

|

let C_T= x in N

|

let M = x in C_T

|

C_T⊕ D_T

|

if V then C_Telse D_T Value Contexts C_V, D_V ::= λx.C_T

|

fix x.C_V

(21)

PCF

_⊕

: The Operational Semantics of a Term

n∈NhM i_n.

|

[·]

|

C_VV

|

V C_V

|

let C_T= x in N

|

let M = x in C_T

|

C_T⊕ D_T

|

if V then C_Telse D_T

Value Contexts C_V, D_V ::= λx.C_T

|

fix x.C_V

(22)

PCF

_⊕

: from Equivalences to Metrics?

I The definition of contexual equivalence asks that PhC[M ]i = PhC[N ]i for every context C.

I But what if PhC[M ]i and PhC[N ]i are very close, without being really equal to each other?

I It makes sense to generalize contextual equivalence to a notion of distance:

δ^Γ,τ(M, N ) = sup

∅`C[Γ`·:τ ]:Unit

XhC[M ]i −X

hC[N ]i .

I For every Γ, τ , δ^Γ,τ is indeed a pseudo-metric:

δ^Γ,τ(M, M ) = 0

δ^Γ,τ(M, N ) = δ^Γ,τ(N, M )

δ^Γ,τ(M, L) ≤ δ^Γ,τ(M, N ) + δ^Γ,τ(N, L)

(23)

Termination in a Probabilistic Setting

I Let M be any closed term. We say that M is almost surely terminating if PhM i = 1, namely if its probability of convergence is 1.

I Example:

GEO := (fix f.λx.x ⊕ (let succ1(x) = y in f y))0

I The expected evaluation length of any closed term as follows: ExLen(M ) :=

∞

X

m=0

1 −

m

X

n=0

XhM i_n

!

Let M be any closed term. We say that M is positively almost surely terminating if ExLen(M ) < +∞.

Lemma

Every positively almost-surely terminating term is almost-surely terminating.

(24)

Termination in a Probabilistic Setting

I Example:

GEO := (fix f.λx.x ⊕ (let succ₁(x) = y in f y))0

I The expected evaluation length of any closed term as follows: ExLen(M ) :=

∞

X

m=0

1 −

m

X

n=0

XhM i_n

!

Lemma

(25)

Termination in a Probabilistic Setting

I Example:

I The expected evaluation length of any closed term as follows:

ExLen(M ) :=

∞

X

m=0

1 −

m

X

n=0

XhM i_n

!

Lemma

(26)

Termination in a Probabilistic Setting

I Example:

I The expected evaluation length of any closed term as follows:

ExLen(M ) :=

∞

X

m=0

1 −

m

X

n=0

XhM i_n

!

Lemma

(27)

Variations on PCF

_⊕

I Pree, Untyped rather than applied, typed.

I Terms: M ::= x

|

λM.

|

M M

|

M ⊕ M ;

I Values: V ::= λM.;

I One-Step Reduction:

(λx.M )V → δ(M [V /x]) M ⊕ N →

M : 1

2, N : 1 2

M → {L_i: p_i}_i∈I M N → {L_iN : p_i}_i∈I

M → {L_i : p_i}_i∈I V M → {V L_i: p_i}_i∈I

I The obtained calculus will be referred to as Λ_⊕.

I CBN rathen than CBV.

(λx.M )N → δ(M [N/x]) M ⊕ N →

M : 1

2, N : 1 2

M → {Li: pi}_i∈I M N → {L_iN : p_i}_i∈I

(28)

Variations on PCF

_⊕

I Pree, Untyped rather than applied, typed.

I Terms: M ::= x

|

λM.

|

M M

|

M ⊕ M ;

I Values: V ::= λM.;

(λx.M )V → δ(M [V /x]) M ⊕ N →

M : 1

2, N : 1 2

M → {L_i : p_i}_i∈I V M → {V L_i: p_i}_i∈I

I The obtained calculus will be referred to as Λ_⊕.

I CBN rathen than CBV.

(λx.M )N → δ(M [N/x]) M ⊕ N →

M : 1

2, N : 1 2

(29)

From Randomised Algorithms to Bayesian Programming

I Binary probabilistic choice is perfectly adequate to model randomised computation.

Theorem

The class of computable probabilistic functions coincides with the class of probabilistic functions computable by PCF^N_⊕.

I In recent years, starting from the pioneering works on languages like CHURCH, ANGLICAN, or HANSEI, functional programs have been also employed as means to represent probabilistic models rather than algorithms.

I The languages above can be modeled [Staton2017] as λ-calculi endowed with two new operators:

I sample, modeling sampling from the uniform distribution on [0, 1].

I score, which takes a positive real number r as a parameter, and modify the weight of the current probabilistic branch by multiplying it by r.

(30)

From Randomised Algorithms to Bayesian Programming

Theorem

(31)

From Randomised Algorithms to Bayesian Programming

Theorem

(32)

PCF

sample,score

: Terms, Typing Rules, and Reduction

Terms M, N ::= sample

|

score(V ).

Typing Rules Γ ` sample : Num A Γ ` V : Num Γ ` score(V ) : Unit C

I One needs to switch from distributions to measures, and assume the underlying set, namely R to have the structure of a measurable space

I Adapting the rule for let-terms naturally leads to M → µ

let M = x in N → let µ = x in N where let µ = x in N should itself be a measure.

I The key rule in step-indexed reduction needs to be adapted: M → µ ∀N ∈ SUPP(µ).N ⇒_nσ_N

M ⇒_n+1 A 7→R σ_N(A)µ(dN )

(33)

PCF

sample,score

: Terms, Typing Rules, and Reduction

|

score(V ).

I The key rule in step-indexed reduction needs to be adapted: M → µ ∀N ∈ SUPP(µ).N ⇒_nσ_N

M ⇒_n+1 A 7→R σ_N(A)µ(dN )

(34)

PCF

sample,score

: Terms, Typing Rules, and Reduction

|

score(V ).

I The key rule in step-indexed reduction needs to be adapted:

M → µ ∀N ∈ SUPP(µ).N ⇒_nσ_N M ⇒_n+1A 7→R σ_N(A)µ(dN )

(35)

The Operational Meaning of Bayesian Terms

I It can well be that hM i = µ, where µ sums to something strictly higher than 1.

I How should we interpret µ(A) for a measurable set of terms A?

I We need to normalize µ!

I Explicitly building a normalized version of µ (if it exists) is the goal of so-called inference algorithms.

I The operational semantics we have just introduced, called distribution-based is thus just an idealized form of semantics.

I An executable semantics can be given in the form of sampling-based semantics.

(36)

The Operational Meaning of Bayesian Terms

(37)

The Operational Meaning of Bayesian Terms

(38)

The Operational Meaning of Bayesian Terms

(39)

Sampling-Based Semantics (1)

h(λx.M )V, si hM [V /x], si¹ hlet V = x in M, si hM [V /x], si¹ hif 0 then M else N, si hM, si¹

hif r then M else N, si hN, si if r 6= 0¹ hsample, r :: si hr, si¹

hscore(r), si h?, si^r

hf (r₁, · · · , rn), si hf¹ ^∗(r1. . . , rn), si hM, si hL, ti^r

hlet M = x in N, si hlet L = x in N, si^r

(40)

Sampling-Based Semantics (2)

hV, siV hV, si¹

hM, si hN, ti hN, ti^r V hL, ui^s hM, si^r·sV hL, ui

(41)