EXISTENCE OF PROPER REGULAR CONDITIONAL DISTRIBUTIONS

(1)

EXISTENCE OF PROPER REGULAR CONDITIONAL DISTRIBUTIONS

Pietro Rigo

University of Pavia

Bologna, may 16, 2018

(2)

Classical (Kolmogorovian) conditional probabilities Let (Ω, A, P ) be a probability space and G ⊂ A a sub-σ-field.

A regular conditional distribution (rcd) is a map Q on Ω × A such that

(i) Q(ω, ·) is a probability on A for ω ∈ Ω (ii) Q(·, A) is G-measurable for A ∈ A

(iii) P (A ∩ B) =

^R_B

Q(ω, A) P (dω) for A ∈ A and B ∈ G

An rcd can fail to exist. However, it exists under mild conditions and

is a.s. unique if A is countably generated.

(3)

In the standard framework, thus, conditioning is with respect to a σ-field G and not with respect to an event H.

What does it mean ?

According to the usual interpretation, it means: For each B ∈ G, we now whether B is true or false. This naive interpretation is very dangerous.

Example 1 Let X = {X

_t

: t ≥ 0} be a process adapted to a filtration F = {F

_t

: t ≥ 0}. Suppose P (X = x) = 0 for each path x and

{A ∈ A : P (A) = 0} ⊂ F

₀

. In this case,

{X = x} ∈ F

₀

for each path x.

But then we can stop. We already know the X-path at time 0 !

(4)

Example 2 (Borel-Kolmogorov paradox) Suppose {X = x} = {Y = y}

for some random variables X and Y . Let Q

_X

and Q

_Y

be rcd’s given σ(X) and σ(Y ). Then,

P (· |X = x) = Q

_X

(ω, ·) and P (· |Y = y) = Q

_Y

(ω, ·)

where ω ∈ Ω meets X(ω) = x and Y (ω) = y. Hence it may be that P (· |X = x) 6= P (· |Y = y) even if {X = x} = {Y = y}.

Example 3 For the naive interpretation to make sense, Q should be proper, i.e.

Q(ω, ·) = δ

_ω

on G for almost all ω .

But Q needs not be proper. In fact, properness of Q essentially

(5)

Conditional 0-1 laws An rcd Q is 0-1 on G if

Q(ω, ·) ∈ {0, 1} on G for almost all ω Why to focus on such a 0-1 law ?

• It is a (natural) consequence of properness

• It is equivalent to

A independent G, under Q(ω, ·), for almost all ω

• It is basic for integral representation of invariant measures

• It is not granted. It typically fails if {A ∈ A : P (A) = 0} ⊂ G

(6)

Theorem 1

Let G

_n

⊂ A be a sub-σ-field and Q

_n

an rcd given G

_n

. The rcd Q is 0-1 on G if

• The ”big” σ-field A is countably generated

• Q

_n

is 0-1 on G

_n

for each n and G ⊂ lim sup

_n

G

_n

• E(1

_A

|G

_n

) → E(1

_A

|G) a.s. for each A ∈ A

Note that, by martingale convergence, the last condition is automat-

ically true if the sequence G

_n

is monotonic

(7)

Examples

Let S be a Polish space and Ω = S

^∞

. Theorem 1 applies to Tail σ-field: G = ∩

_n

σ(X

_n

, X

_n+1

, . . .)

where X

_n

is a sequence of real random variables Symmetric σ-field:

G = {B ∈ A : B = f

⁻¹

(B) for each finite permutation f } Thus,

Theorem 1 ⇒ de Finetti’s theorem

Open problem: Theorem 1 does not apply to the shift-invariant σ-field:

G = {B ∈ A : B = s

⁻¹

(B)}

where s(x

₁

, x

₂

, . . .) = (x

₂

, x

₃

, . . .) is the shift

(8)

Disintegrability

Let Π ⊂ A be a partition of Ω. P is disintegrable on Π if P (A) =

^R_Π

P (A|H) P

^∗

(dH)

for each A ∈ A, where

• P (·|H) is a probability on A such that P (H|H) = 1

• P

^∗

is a probability on a suitable σ-field of subsets of Π

(9)

Theorem 2 Given a partition Π of Ω, let

G = {(x, y) ∈ Ω × Ω : x ∼ y}.

Then, P is disintegrable on Π whenever

• (Ω, A) is nice (e.g. a standard space)

• G is a Borel subset of Ω × Ω

Remark: G is actually a Borel set if Π is the partition in the atoms of the tail, or the symmetric, or the shift invariant σ-fields

Remark: The condition on G can be relaxed (e.g., G coanalytic)

(10)

Coherent (de Finettian) conditional probabilities A different notion, introduced by de Finetti, is as follows.

Let

P (·|·) : A × G → R.

For all n ≥ 1, c

₁

, . . . , c

_n

∈ R, A

₁

, . . . , A

_n

∈ A and B

₁

, . . . , B

_n

∈ G \ ∅, define

G(ω) =

^Pⁿ_i=1

c

_i

1

_B

i

(ω) {1

_A

i

(ω) − P (A

_i

|B

_i

)}.

Then, P (·|·) is coherent if

sup

_ω∈B

G(ω) ≥ 0 where B = ∪

ⁿ_i=1

B

_i

.

(11)

Such a definition has both merits and drawbacks. In particular, con- trary to the classical case:

• The conditioning is now with respect to events,

• P (B|B) = 1,

• For fixed B, P (·|B) is ”only” a finitely additive probability,

• Disintegrability on Π is not granted, where Π is the partition of

Ω in the atoms of G

(12)

Bayesian inference (X , E) sample space, (Θ, F ) parameter space, {P

_θ

: θ ∈ Θ} statistical model,

A prior is a probability π on F . A posterior for π is any collection Q = {Q

_x

: x ∈ X } such that

• Q

_x

is a probability on F for each x ∈ X

•

^R_A

Q

_x

(B) m(dx) =

^R_B

P

_θ

(A) π(dθ)

for all A ∈ E B ∈ F and for some (possibly finitely additive)

probability m on subsets of X

(13)

Theorem 3

Fix a measurable function T on X (a statistic) such that P

_θ

(T = t) = 0 for all θ and t.

Under mild conditions, for any prior π, there is a posterior Q for π such that

T (x) = T (y) ⇒ Q

_x

= Q

_y

Interpretation:

The above condition means that T is sufficient for Q. Suppose you

start with a prior π, describing your feelings on θ, and a statistic T ,

describing how different samples affect your inference on θ. Theorem

3 states that, whatever π and T (with P

_θ

(T = t) = 0) there is a

posterior Q for π which makes T sufficient.

(14)

Point estimation

The ideas underlying Theorem 3 yield further results. Suppose Θ ⊂ R and d : X → Θ is an estimate of θ.

Theorem 4

Under mild conditions, if the prior π is null on compacta, there is a posterior Q for π such that

^R

θ

²

Q

_x

(dθ) < ∞ and

E

_Q

(θ|x) =

^R

θ Q

_x

(dθ) = d(x) Interpretation:

The above condition means that d is optimal under square error loss.

Suppose you start with a measurable map d : X → Θ, to be regarded

as your estimate of θ. Theorem 4 states that, if the prior π vanishes

(15)

Compatibility

Let X = (X

₁

, . . . , X

_k

) be a k-dimensional random vector and X

_−i

= (X

₁

, . . . , X

_i−1

, X

_i+1

, . . . , X

_k

)

To assess the distribution of X, assign the kernels Q

₁

, . . . , Q

_k

, where each Q

_i

is only requested to satisfy

Q

_i

(x, ·) is a probability for fixed x and

the map x 7→ Q

_i

(x, A) is measurable for fixed A

The kernels Q

₁

, . . . , Q

_k

are compatible if there is a Borel probability µ on R

^k

such that

P

_µ

(X

_i

∈ · |X

_−i

= x) = Q

_i

(x, ·) for all i and µ-almost all x.

Such a µ, if exists, should be regarded as the distribution of X

(16)

Example: Let k = 2 and Q

₁

(x, ·) = Q

₂

(x, ·) = N (x, 1)

This looks reasonable in a number of problems. Nevertheless, Q

₁

and Q

₂

are not compatible, i.e., no Borel probability on R

²

admits Q

₁

and Q

₂

as conditional distributions

Compatibility issues arise in: spatial statistics, statistical me- chanics, Bayesian image analysis, multiple data imputation and Gibbs sampling

Another example are improper priors. Given the statistical model

{P

_θ

: θ ∈ Θ}, let Q = {Q

_x

: x ∈ X } be the ”formal posterior” of an

improper prior γ (i.e., γ(Θ) = ∞). Strictly speaking, Q makes sense

only if compatible with the statistical model. In that case, Q agrees

with the posterior of some (proper) prior

(17)

For x ∈ R

^k

and f ∈ C

_b

(R

^k

), let

E(f |X

_−i

= x

_−i

) =

^R

f (x

₁

, . . . , x

_i−1

, t, x

_i+1

, . . . , x

_k

) Q

_i

(x

_−i

, dt) Theorem 5

Suppose there is a compact set A

_i

such that Q

_i

(x, A

_i

) = 1 for all x ∈ R

^k−1

.

Letting A = A

₁

× . . . × A

_k

, suppose also that

x 7→ E(f |X

_−i

= x

_−i

) is continuous on A for each f ∈ C(A) Then, Q

₁

, . . . , Q

_k

are compatible if and only if

sup

_x∈A ^P^k−1_i=1

{E(f

_i

|X

_−i

= x

_−i

) − E(f

_i

|X

_−k

= x

_−k

)} ≥ 0

for all f

₁

, . . . , f

_k−1

∈ C(A)

(18)

For each i, fix a (σ-finite) measure λ

_i

and suppose that Q

_i

(x, dy) = f

_i

(x, y) λ

_i

(dy) for all x ∈ R

^k−1

Let λ = λ

₁

× . . . × λ

_k

be the product measure Theorem 6

Suppose f

_i

> 0 for all i. Then, Q

₁

, . . . , Q

_k

are compatible if and only if there are positive Borel functions u

₁

, . . . , u

_k

on R

^k−1

such that

f

_i

(x

_i

|x

_−i

) = f

_k

(x

_k

|x

_−k

) u

_i

(x

_−i

) u

_k

(x

_−k

),

for all i < k and λ-almost all x ∈ R

^k

, and

R

u

_k

dλ

_−k

= 1

Remark: The assumption f

_i

> 0 can be dropped at the price of a

(19)

An asymptotic result

Let S be a Polish space, (X

_n

) an exchangeable sequence of S-valued random variables, and

µ

_n

= (1/n)

^Pⁿ_i=1

δ

_X

i

empirical measure

a

_n

( ·) = P (X

_n+1

∈ · |X

₁

, . . . , X

_n

) predictive measure

Often, a

_n

can not be evaluated in closed form and µ

_n

is a reasonable

”estimate” of a

_n

. Here, we focus on the error d(µ

_n

, a

_n

)

where d is a distance between probability measures. For instance, if d(µ

_n

, a

_n

) → 0 in some sense

then µ

_n

is a consistent estimate of a

_n

(20)