• Non ci sono risultati.

PartII Higher-OrderProbabilisticProgramming

N/A
N/A
Protected

Academic year: 2021

Condividi "PartII Higher-OrderProbabilisticProgramming"

Copied!
41
0
0

Testo completo

(1)

Higher-Order Probabilistic Programming

A Tutorial at POPL 2019

Part II

Ugo Dal Lago

(Based on joint work with Flavien Breuvart, Raphaëlle Crubillé, Charles Grellois, Davide Sangiorgi,. . . )

POPL 2019, Lisbon, January 14th

(2)

A Foundation for Higher-Order Probabilistic Programming

I We are interested in a better understanding of some crucial questions about higher-order probabilistic programs, e.g.:

I How could we formalise and prove programs to have certain desireable properties, like being terminating or consuming a bounded amount of resources?

I How could we prove programs to be equivalent, or more generally to be in a certain relation?

I We could in principle answer these questions directly in a programming language like OCAML.

I It is methodologically much better to distill some paradigmatic calculi which expose all the essential features, but which are somehow agnostic to many unimportant details.

I We will introduce and study two such calculi:

I PCF, a calculus for randomized higher-order programming.

I PCFsample,score, a calculus for bayesian programming.

(3)

A Foundation for Higher-Order Probabilistic Programming

I We are interested in a better understanding of some crucial questions about higher-order probabilistic programs, e.g.:

I How could we formalise and prove programs to have certain desireable properties, like being terminating or consuming a bounded amount of resources?

I How could we prove programs to be equivalent, or more generally to be in a certain relation?

I We could in principle answer these questions directly in a programming language like OCAML.

I It is methodologically much better to distill some paradigmatic calculi which expose all the essential features, but which are somehow agnostic to many unimportant details.

I We will introduce and study two such calculi:

I PCF, a calculus for randomized higher-order programming.

I PCFsample,score, a calculus for bayesian programming.

(4)

A Foundation for Higher-Order Probabilistic Programming

I We are interested in a better understanding of some crucial questions about higher-order probabilistic programs, e.g.:

I How could we formalise and prove programs to have certain desireable properties, like being terminating or consuming a bounded amount of resources?

I How could we prove programs to be equivalent, or more generally to be in a certain relation?

I We could in principle answer these questions directly in a programming language like OCAML.

I It is methodologically much better to distill some paradigmatic calculi which expose all the essential features, but which are somehow agnostic to many unimportant details.

I We will introduce and study two such calculi:

I PCF, a calculus for randomized higher-order programming.

I PCFsample,score, a calculus for bayesian programming.

(5)

PCF

: Types, Terms, Values

Types τ, ρ ::= Unit | Num | τ → ρ

Terms M, N ::= V | V W | let M = x in N | M ⊕ N

| if V then M else N | f

n

(V

1

, . . . , V

n

)

Values V, W ::= ? | x | r | λx.M | fix x.V

(6)

PCF

: Type Assignment Rules

Value Typing Rules

Γ ` ? : Unit S

Γ, x : τ ` x : τ V

Γ ` r : Num R Γ, x : τ ` M : ρ

Γ ` λx.M : τ → ρ λ Γ, x : τ → ρ ` M : τ → ρ Γ ` fix x.M : τ → ρ X Term Typing Rules

Γ ` V : τ → ρ Γ ` W : τ

Γ ` V W : ρ @ Γ ` M : τ Γ, x : τ ` N : ρ

Γ ` let M = x in N : ρ L Γ ` M : τ Γ ` N : τ Γ ` M ⊕ N : τ ⊕ Γ ` V : Num Γ ` M : τ Γ ` N : τ

Γ ` if V then M else N : τ I Γ ` V1 : Num · · · Γ ` Vn: Num Γ ` fn(V1, . . . , Vn) : Num F

I The closed terms of type τ forms a set CTτ.

I Similarly for values and CVτ.

(7)

PCF

: Type Assignment Rules

Value Typing Rules

Γ ` ? : Unit S

Γ, x : τ ` x : τ V

Γ ` r : Num R Γ, x : τ ` M : ρ

Γ ` λx.M : τ → ρ λ Γ, x : τ → ρ ` M : τ → ρ Γ ` fix x.M : τ → ρ X Term Typing Rules

Γ ` V : τ → ρ Γ ` W : τ

Γ ` V W : ρ @ Γ ` M : τ Γ, x : τ ` N : ρ

Γ ` let M = x in N : ρ L Γ ` M : τ Γ ` N : τ Γ ` M ⊕ N : τ ⊕ Γ ` V : Num Γ ` M : τ Γ ` N : τ

Γ ` if V then M else N : τ I Γ ` V1 : Num · · · Γ ` Vn: Num Γ ` fn(V1, . . . , Vn) : Num F

I The closed terms of type τ forms a set CTτ.

I Similarly for values and CVτ.

(8)

Distributions

I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.

I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}

I The set of all distributions over X is indicated as D(X).

I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).

I Given a distribution D on X, its sum P D is simply Px∈XD(x).

(9)

Distributions

I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.

I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}

I The set of all distributions over X is indicated as D(X).

I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).

I Given a distribution D on X, its sum P D is simply Px∈XD(x).

(10)

Distributions

I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.

I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}

I The set of all distributions over X is indicated as D(X).

I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).

I Given a distribution D on X, its sum P D is simply Px∈XD(x).

(11)

Distributions

I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.

I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}

I The set of all distributions over X is indicated as D(X).

I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).

I Given a distribution D on X, its sum P D is simply Px∈XD(x).

(12)

Distributions

I Given any set X, a distribution on X is a function D : X → R[0,1] such that D(x) > 0 only for denumerably many elements of X and that Px∈XD(x) ≤ 1.

I The support of a distribution D on X is the subset SUPP(D) of X defined as SUPP(D) := {x ∈ X | D(x) > 0}

I The set of all distributions over X is indicated as D(X).

I We indicate the distribution assigning probability 1 to the element x ∈ X and 0 to any other element of X as δ(x).

I Given a distribution D on X, its sum P D is simply Px∈XD(x).

(13)

PCF

: Reduction

One-Step Reduction

(λx.M )V → δ(M [V /x]) let V = x in M → δ(M [V /x]) if 0 then M else N → δ(M )

if r then M else N → δ(N ) if r 6= 0 M ⊕ N →

 M : 1

2, N : 1 2

 f (r1, · · · , rn) → δ(f(r1. . . , rn))

M → {Li : pi}i∈I

let M = x in N → {let Li= x in N : pi}i∈I

(14)

PCF

: Reduction

Step-Indexed Reduction

M ⇒0 ∅ V ⇒1δ(V ) V ⇒n+1

M →D ∀N ∈ SUPP(D).N ⇒nEN

M ⇒n+1 P

N ∈SUPP(D)D(N) · EN

(15)

PCF

: Some Easy Meta-Theorems

Lemma

If M ⇒nD, then SUPP(D) is a finite set.

Proposition (Progress)

For every M ∈ CTτ, either M is a value or there is D with M → D Proposition (Subject Reduction)

For every M ∈ CTτ and for every n ∈ N, if M → D and M ⇒nE, then D ∈ D(CTτ) and E ∈ D(CVτ).

Corollary

For every M ∈ CTτ and for every n ∈ N, there is exactly one distribution Dn such that M ⇒nDn. We will write hM in for such a distribution.

(16)

PCF

: Some Easy Meta-Theorems

Lemma

If M ⇒nD, then SUPP(D) is a finite set.

Proposition (Progress)

For every M ∈ CTτ, either M is a value or there is D with M → D

Proposition (Subject Reduction)

For every M ∈ CTτ and for every n ∈ N, if M → D and M ⇒nE, then D ∈ D(CTτ) and E ∈ D(CVτ).

Corollary

For every M ∈ CTτ and for every n ∈ N, there is exactly one distribution Dn such that M ⇒nDn. We will write hM in for such a distribution.

(17)

PCF

: Some Easy Meta-Theorems

Lemma

If M ⇒nD, then SUPP(D) is a finite set.

Proposition (Progress)

For every M ∈ CTτ, either M is a value or there is D with M → D Proposition (Subject Reduction)

For every M ∈ CTτ and for every n ∈ N, if M → D and M ⇒nE, then D ∈ D(CTτ) and E ∈ D(CVτ).

Corollary

For every M ∈ CTτ and for every n ∈ N, there is exactly one distribution Dn such that M ⇒nDn. We will write hM in for such a distribution.

(18)

PCF

: Some Easy Meta-Theorems

Lemma

If M ⇒nD, then SUPP(D) is a finite set.

Proposition (Progress)

For every M ∈ CTτ, either M is a value or there is D with M → D Proposition (Subject Reduction)

For every M ∈ CTτ and for every n ∈ N, if M → D and M ⇒nE, then D ∈ D(CTτ) and E ∈ D(CVτ).

Corollary

For every M ∈ CTτ and for every n ∈ N, there is exactly one distribution Dn such that M ⇒nDn. We will write hM in for such a distribution.

(19)

PCF

: The Operational Semantics of a Term

I Given two distributionsD, E ∈ D(X), we write D ≤ E iff D(x) ≤ E(x) for every x ∈ X. This relation endows D(X) with the structure of a partial order, which is actually an ωCPO:

I Given a closed term M ∈ CTτ, the operational semantics of M is defined to be the distribution hM i ∈ CVτ defined asP

n∈NhM in.

Term Contexts CT, DT ::= CV

|

[·]

|

CVV

|

V CV

|

let CT= x in N

|

let M = x in CT

|

CT⊕ DT

|

if V then CTelse DT

Value Contexts CV, DV ::= λx.CT

|

fix x.CV

I Given two terms M, N such that Γ ` M : τ and Γ ` N : τ , we say that M and N are (Γ, τ )-equivalent, and we write M ≡τΓN iff whenever

∅ ` C[Γ ` · : τ ] : Unit, it holds that PhC[M ]i = PhC[N ]i.

(20)

PCF

: The Operational Semantics of a Term

I Given two distributionsD, E ∈ D(X), we write D ≤ E iff D(x) ≤ E(x) for every x ∈ X. This relation endows D(X) with the structure of a partial order, which is actually an ωCPO:

I Given a closed term M ∈ CTτ, the operational semantics of M is defined to be the distribution hM i ∈ CVτ defined asP

n∈NhM in.

Term Contexts CT, DT ::= CV

|

[·]

|

CVV

|

V CV

|

let CT= x in N

|

let M = x in CT

|

CT⊕ DT

|

if V then CTelse DT Value Contexts CV, DV ::= λx.CT

|

fix x.CV

I Given two terms M, N such that Γ ` M : τ and Γ ` N : τ , we say that M and N are (Γ, τ )-equivalent, and we write M ≡τΓN iff whenever

∅ ` C[Γ ` · : τ ] : Unit, it holds that PhC[M ]i = PhC[N ]i.

(21)

PCF

: The Operational Semantics of a Term

I Given two distributionsD, E ∈ D(X), we write D ≤ E iff D(x) ≤ E(x) for every x ∈ X. This relation endows D(X) with the structure of a partial order, which is actually an ωCPO:

I Given a closed term M ∈ CTτ, the operational semantics of M is defined to be the distribution hM i ∈ CVτ defined asP

n∈NhM in.

Term Contexts CT, DT ::= CV

|

[·]

|

CVV

|

V CV

|

let CT= x in N

|

let M = x in CT

|

CT⊕ DT

|

if V then CTelse DT

Value Contexts CV, DV ::= λx.CT

|

fix x.CV

I Given two terms M, N such that Γ ` M : τ and Γ ` N : τ , we say that M and N are (Γ, τ )-equivalent, and we write M ≡τΓN iff whenever

∅ ` C[Γ ` · : τ ] : Unit, it holds that PhC[M ]i = PhC[N ]i.

(22)

PCF

: from Equivalences to Metrics?

I The definition of contexual equivalence asks that PhC[M ]i = PhC[N ]i for every context C.

I But what if PhC[M ]i and PhC[N ]i are very close, without being really equal to each other?

I It makes sense to generalize contextual equivalence to a notion of distance:

δΓ,τ(M, N ) = sup

∅`C[Γ`·:τ ]:Unit

XhC[M ]i −X

hC[N ]i .

I For every Γ, τ , δΓ,τ is indeed a pseudo-metric:

δΓ,τ(M, M ) = 0

δΓ,τ(M, N ) = δΓ,τ(N, M )

δΓ,τ(M, L) ≤ δΓ,τ(M, N ) + δΓ,τ(N, L)

(23)

Termination in a Probabilistic Setting

I Let M be any closed term. We say that M is almost surely terminating if PhM i = 1, namely if its probability of convergence is 1.

I Example:

GEO := (fix f.λx.x ⊕ (let succ1(x) = y in f y))0

I The expected evaluation length of any closed term as follows: ExLen(M ) :=

X

m=0

1 −

m

X

n=0

XhM in

!

Let M be any closed term. We say that M is positively almost surely terminating if ExLen(M ) < +∞.

Lemma

Every positively almost-surely terminating term is almost-surely terminating.

(24)

Termination in a Probabilistic Setting

I Let M be any closed term. We say that M is almost surely terminating if PhM i = 1, namely if its probability of convergence is 1.

I Example:

GEO := (fix f.λx.x ⊕ (let succ1(x) = y in f y))0

I The expected evaluation length of any closed term as follows: ExLen(M ) :=

X

m=0

1 −

m

X

n=0

XhM in

!

Let M be any closed term. We say that M is positively almost surely terminating if ExLen(M ) < +∞.

Lemma

Every positively almost-surely terminating term is almost-surely terminating.

(25)

Termination in a Probabilistic Setting

I Let M be any closed term. We say that M is almost surely terminating if PhM i = 1, namely if its probability of convergence is 1.

I Example:

GEO := (fix f.λx.x ⊕ (let succ1(x) = y in f y))0

I The expected evaluation length of any closed term as follows:

ExLen(M ) :=

X

m=0

1 −

m

X

n=0

XhM in

!

Let M be any closed term. We say that M is positively almost surely terminating if ExLen(M ) < +∞.

Lemma

Every positively almost-surely terminating term is almost-surely terminating.

(26)

Termination in a Probabilistic Setting

I Let M be any closed term. We say that M is almost surely terminating if PhM i = 1, namely if its probability of convergence is 1.

I Example:

GEO := (fix f.λx.x ⊕ (let succ1(x) = y in f y))0

I The expected evaluation length of any closed term as follows:

ExLen(M ) :=

X

m=0

1 −

m

X

n=0

XhM in

!

Let M be any closed term. We say that M is positively almost surely terminating if ExLen(M ) < +∞.

Lemma

Every positively almost-surely terminating term is almost-surely terminating.

(27)

Variations on PCF

I Pree, Untyped rather than applied, typed.

I Terms: M ::= x

|

λM.

|

M M

|

M ⊕ M ;

I Values: V ::= λM.;

I One-Step Reduction:

(λx.M )V → δ(M [V /x]) M ⊕ N →

 M : 1

2, N : 1 2



M → {Li: pi}i∈I M N → {LiN : pi}i∈I

M → {Li : pi}i∈I V M → {V Li: pi}i∈I

I The obtained calculus will be referred to as Λ.

I CBN rathen than CBV.

I One-Step Reduction:

(λx.M )N → δ(M [N/x]) M ⊕ N →

 M : 1

2, N : 1 2



M → {Li: pi}i∈I M N → {LiN : pi}i∈I

(28)

Variations on PCF

I Pree, Untyped rather than applied, typed.

I Terms: M ::= x

|

λM.

|

M M

|

M ⊕ M ;

I Values: V ::= λM.;

I One-Step Reduction:

(λx.M )V → δ(M [V /x]) M ⊕ N →

 M : 1

2, N : 1 2



M → {Li: pi}i∈I M N → {LiN : pi}i∈I

M → {Li : pi}i∈I V M → {V Li: pi}i∈I

I The obtained calculus will be referred to as Λ.

I CBN rathen than CBV.

I One-Step Reduction:

(λx.M )N → δ(M [N/x]) M ⊕ N →

 M : 1

2, N : 1 2



M → {Li: pi}i∈I M N → {LiN : pi}i∈I

(29)

From Randomised Algorithms to Bayesian Programming

I Binary probabilistic choice is perfectly adequate to model randomised computation.

Theorem

The class of computable probabilistic functions coincides with the class of probabilistic functions computable by PCFN.

I In recent years, starting from the pioneering works on languages like CHURCH, ANGLICAN, or HANSEI, functional programs have been also employed as means to represent probabilistic models rather than algorithms.

I The languages above can be modeled [Staton2017] as λ-calculi endowed with two new operators:

I sample, modeling sampling from the uniform distribution on [0, 1].

I score, which takes a positive real number r as a parameter, and modify the weight of the current probabilistic branch by multiplying it by r.

(30)

From Randomised Algorithms to Bayesian Programming

I Binary probabilistic choice is perfectly adequate to model randomised computation.

Theorem

The class of computable probabilistic functions coincides with the class of probabilistic functions computable by PCFN.

I In recent years, starting from the pioneering works on languages like CHURCH, ANGLICAN, or HANSEI, functional programs have been also employed as means to represent probabilistic models rather than algorithms.

I The languages above can be modeled [Staton2017] as λ-calculi endowed with two new operators:

I sample, modeling sampling from the uniform distribution on [0, 1].

I score, which takes a positive real number r as a parameter, and modify the weight of the current probabilistic branch by multiplying it by r.

(31)

From Randomised Algorithms to Bayesian Programming

I Binary probabilistic choice is perfectly adequate to model randomised computation.

Theorem

The class of computable probabilistic functions coincides with the class of probabilistic functions computable by PCFN.

I In recent years, starting from the pioneering works on languages like CHURCH, ANGLICAN, or HANSEI, functional programs have been also employed as means to represent probabilistic models rather than algorithms.

I The languages above can be modeled [Staton2017] as λ-calculi endowed with two new operators:

I sample, modeling sampling from the uniform distribution on [0, 1].

I score, which takes a positive real number r as a parameter, and modify the weight of the current probabilistic branch by multiplying it by r.

(32)

PCF

sample,score

: Terms, Typing Rules, and Reduction

Terms M, N ::= sample

|

score(V ).

Typing Rules Γ ` sample : Num A Γ ` V : Num Γ ` score(V ) : Unit C

I One needs to switch from distributions to measures, and assume the underlying set, namely R to have the structure of a measurable space

I Adapting the rule for let-terms naturally leads to M → µ

let M = x in N → let µ = x in N where let µ = x in N should itself be a measure.

I The key rule in step-indexed reduction needs to be adapted: M → µ ∀N ∈ SUPP(µ).N ⇒nσN

M ⇒n+1 A 7→R σN(A)µ(dN )

(33)

PCF

sample,score

: Terms, Typing Rules, and Reduction

Terms M, N ::= sample

|

score(V ).

Typing Rules Γ ` sample : Num A Γ ` V : Num Γ ` score(V ) : Unit C

I One needs to switch from distributions to measures, and assume the underlying set, namely R to have the structure of a measurable space

I Adapting the rule for let-terms naturally leads to M → µ

let M = x in N → let µ = x in N where let µ = x in N should itself be a measure.

I The key rule in step-indexed reduction needs to be adapted: M → µ ∀N ∈ SUPP(µ).N ⇒nσN

M ⇒n+1 A 7→R σN(A)µ(dN )

(34)

PCF

sample,score

: Terms, Typing Rules, and Reduction

Terms M, N ::= sample

|

score(V ).

Typing Rules Γ ` sample : Num A Γ ` V : Num Γ ` score(V ) : Unit C

I One needs to switch from distributions to measures, and assume the underlying set, namely R to have the structure of a measurable space

I Adapting the rule for let-terms naturally leads to M → µ

let M = x in N → let µ = x in N where let µ = x in N should itself be a measure.

I The key rule in step-indexed reduction needs to be adapted:

M → µ ∀N ∈ SUPP(µ).N ⇒nσN M ⇒n+1A 7→R σN(A)µ(dN )

(35)

The Operational Meaning of Bayesian Terms

I It can well be that hM i = µ, where µ sums to something strictly higher than 1.

I How should we interpret µ(A) for a measurable set of terms A?

I We need to normalize µ!

I Explicitly building a normalized version of µ (if it exists) is the goal of so-called inference algorithms.

I The operational semantics we have just introduced, called distribution-based is thus just an idealized form of semantics.

I An executable semantics can be given in the form of sampling-based semantics.

(36)

The Operational Meaning of Bayesian Terms

I It can well be that hM i = µ, where µ sums to something strictly higher than 1.

I How should we interpret µ(A) for a measurable set of terms A?

I We need to normalize µ!

I Explicitly building a normalized version of µ (if it exists) is the goal of so-called inference algorithms.

I The operational semantics we have just introduced, called distribution-based is thus just an idealized form of semantics.

I An executable semantics can be given in the form of sampling-based semantics.

(37)

The Operational Meaning of Bayesian Terms

I It can well be that hM i = µ, where µ sums to something strictly higher than 1.

I How should we interpret µ(A) for a measurable set of terms A?

I We need to normalize µ!

I Explicitly building a normalized version of µ (if it exists) is the goal of so-called inference algorithms.

I The operational semantics we have just introduced, called distribution-based is thus just an idealized form of semantics.

I An executable semantics can be given in the form of sampling-based semantics.

(38)

The Operational Meaning of Bayesian Terms

I It can well be that hM i = µ, where µ sums to something strictly higher than 1.

I How should we interpret µ(A) for a measurable set of terms A?

I We need to normalize µ!

I Explicitly building a normalized version of µ (if it exists) is the goal of so-called inference algorithms.

I The operational semantics we have just introduced, called distribution-based is thus just an idealized form of semantics.

I An executable semantics can be given in the form of sampling-based semantics.

(39)

Sampling-Based Semantics (1)

h(λx.M )V, si hM [V /x], si1 hlet V = x in M, si hM [V /x], si1 hif 0 then M else N, si hM, si1

hif r then M else N, si hN, si if r 6= 01 hsample, r :: si hr, si1

hscore(r), si h?, sir

hf (r1, · · · , rn), si hf1 (r1. . . , rn), si hM, si hL, tir

hlet M = x in N, si hlet L = x in N, sir

(40)

Sampling-Based Semantics (2)

hV, siV hV, si1

hM, si hN, ti hN, tir V hL, uis hM, sir·sV hL, ui

(41)

Thank You!

Questions?

Riferimenti

Documenti correlati

Accumulated evidence has demonstrated that sphingolipids mediates pro-angiogenic activities, such as endothelial cell proliferation, chemotaxis, and vessel morphogenesis and

I am also greatful to following colleagues for their friendship : Maria Rosa, Nurettin, Clara, Luisa, Elisa, Elia, Marina, Natalia and Tommaso.They never let me

I Inherently difficult endeavour, because the category of measurable spaces is not cartesian closed.. I Probabilistic Coherence Spaces

I In linear dependent types [DLG2011], one is (relatively complete) for deterministic first-order functions. I How

 Spouts read or listen to data from external sources and publish them (emit in Storm terminology) into

By using a loss of function approach in Xenopus embryos I demonstrated in vivo, for the first time, the capacity of serotonin, via 5-HT2B receptor, to act as a

This last type of antagonists show a new mechanism of action based on the targeting of membrane lipid rafts that would be essential to TLR4 organization in activated oligomeric

2 The second chapter, after a discussion on the model concept and a description of the OSEC model, focuses on what are the models of structural equations (MES), analyzing