• Non ci sono risultati.

under uncertainty aversion

N/A
N/A
Protected

Academic year: 2022

Condividi "under uncertainty aversion"

Copied!
42
0
0

Testo completo

(1)

Multi-armed bandits

under uncertainty aversion

Tanut Treetatanthiploet

Mathematical Institute University of Oxford

Based on joint work with Prof. Samuel N. Cohen 12th European Summer School in Financial Mathematics Padova, September 2019

(2)

Toy Example

Gambling is a way of buying hope on credit. – Alan Wykes

Suppose there are two biased coins.

You will gain £1 for a head

1st coin: 3 tosses with 2 heads.

2nd coin: 3000 tosses with 2000 heads.

(3)

Toy Example

Gambling is a way of buying hope on credit. – Alan Wykes

Suppose there are two biased coins.

You will gain £1 for a head

1st coin: 3 tosses with 2 heads.

2nd coin: 3000 tosses with 2000 heads.

We are biased toward a less

uncertain choice.

(4)

Toy Example

Gambling is a way of buying hope on credit. – Alan Wykes

Suppose there are two biased coins.

You will gain £1 for a head

1st coin: 3 tosses with 2 heads.

2nd coin: 3000 tosses with 2000 heads.

We are biased toward a less uncertain choice.

What shall we do if we need to

repeat for a million tosses?

(5)

Multi-armed bandits problem

Gambling is a way of buying hope on credit. – Alan Wykes

● Suppose there areM slot machines.

● One machine can be played at a time.

● Each machine may have its own state.

● The machines are independent.

● Playing a machine generates a cost and its state may evolve.

(6)

Multi-armed bandits problem

Gambling is a way of buying hope on credit. – Alan Wykes

● Suppose there areM slot machines.

● One machine can be played at a time.

● Each machine may have its own state.

● The machines are independent.

● Playing a machine generates a cost and its state may evolve.

Distribution Evolving state

Risky bandit Known Yes

Stationary bandit Unknown No

Non-stationary bandit Unknown Yes

(7)

Gittins’ index theorem as a risk model

An optimist is a guy that has never had much experience. –Don Marquis

Consider arisky bandit problemwith known distribution.

▸ Costs (h(m)(t)) are not IID.

▸ Objective: MinimiseE( ∑n=0βnhn)(tnρ)).

There existindicesassociated to each machine which can beevaluated independently such that the optimal policy is to play at each epoch a machine ofthe lowest index γ(m). (Gittins, 1979)

By modeling costs h(m)(t) under a Bayesian perspective, we can show that γ(m)≈ ¯x(m)− σ(m)ψ( 1

(n(m)+ 1)(1 − β)) (Brezzi and Lai, 2002) where ¯x(m)and σ(m)are posterior mean and s.d. and ψ is positive and nondecreasing.

(8)

Gittins’ index theorem as a risk model

An optimist is a guy that has never had much experience. –Don Marquis

Consider arisky bandit problemwith known distribution.

▸ Costs (h(m)(t)) are not IID.

▸ Objective: MinimiseE( ∑n=0βnhn)(tnρ)).

There existindicesassociated to each machine which can beevaluated independently such that the optimal policy is to play at each epoch a machine ofthe lowest index γ(m). (Gittins, 1979)

By modeling costs h(m)(t) under a Bayesian perspective, we can show that γ(m)≈ ¯x(m)− σ(m)ψ( 1

(n(m)+ 1)(1 − β)) (Brezzi and Lai, 2002)

(m) (m)

(9)

Optimistic Analogy for Reinforcement learning

An optimist is a guy that has never had much experience. –Don Marquis

For

stationary bandit problem,

▸ The mth machine generates IID costs (h(m)(t))

t∈Nwith mean µ(m).

ρ

= arg min

m

( Est. of µ

(m)

− Learning Premium

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

decreases in n(m)

) .

i.e. We have less learning reward when we are more certain about our estimator.

(10)

Uncertainty Aversion

If you expect the worst, you’ll never be disappointed – Sarah Dessen

Suppose we will only play once.

This is equivalent to setting β = 0.

Gittins’ objective: Minimise E(h

0)

(t

0ρ

)) = x ¯

0)

.

No accounting for uncertainty!

Including uncertainty aversion, we expect to have

γ(m)=Est. of µ(m)+ ( −Learning premium+Uncertainty aversion

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

Uncertainty valuation

).

(11)

Uncertainty Aversion

If you expect the worst, you’ll never be disappointed – Sarah Dessen

Suppose we will only play once.

This is equivalent to setting β = 0.

Gittins’ objective: Minimise E(h

0)

(t

0ρ

)) = x ¯

0)

.

No accounting for uncertainty!

Including uncertainty aversion, we expect to have

γ(m)=Est. of µ(m)+ ( −Learning premium+Uncertainty aversion

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

Uncertainty valuation

).

(12)

Uncertainty Aversion

If you expect the worst, you’ll never be disappointed – Sarah Dessen

Suppose we will only play once.

This is equivalent to setting β = 0.

Gittins’ objective: Minimise E(h

0)

(t

0ρ

)) = x ¯

0)

.

No accounting for uncertainty!

Including uncertainty aversion, we expect to have

γ(m)=Est. of µ(m)+ ( −Learning premium+Uncertainty aversion

´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶

Uncertainty valuation

).

(13)

Time-consistent Nonlinear Expectation

If you expect the worst, you’ll never be disappointed. – Sarah Dessen

Classical control problem under uncertainty aversion:

inf

ρ

sup

Q

E

Q

(

L

n=1

β

n

h

n)

( t

nρ

)) =∶ inf

ρ

E (

L

n=1

β

n

h

n)

( t

nρ

)) .

We say a system of operatorE( ⋅ ∣Ft) ∶ L(P, FT) → L(P, Ft) ∶ t = 0, 1, .., T is an(Ft)-consistent coherent nonlinear expectationif it satisfies strict monotonicity, positive homogeneity, subadditivity and Lebesgue property (lower semi-continuity) and

● (Ft)-consistency: for t ≤ t≤ T,

E(E(X∣Ft)∣Ft) = E(X∣Ft). The filtration (Ft)must be identified in advance.

(14)

Time-consistent Nonlinear Expectation

If you expect the worst, you’ll never be disappointed. – Sarah Dessen

Classical control problem under uncertainty aversion:

inf

ρ

sup

Q

E

Q

(

L

n=1

β

n

h

n)

( t

nρ

)) =∶ inf

ρ

E (

L

n=1

β

n

h

n)

( t

nρ

)) .

We say a system of operatorE( ⋅ ∣Ft) ∶ L(P, FT) → L(P, Ft) ∶ t = 0, 1, .., T is an(Ft)-consistent coherent nonlinear expectationif it satisfies strict monotonicity, positive homogeneity, subadditivity and Lebesgue property (lower semi-continuity) and

● (Ft)-consistency: for t ≤ t≤ T,

E(E(X∣Ft)∣Ft) = E(X∣Ft).

The filtration (Ft)must be identified in advance.

(15)

Time-consistent Nonlinear Expectation

If you expect the worst, you’ll never be disappointed. – Sarah Dessen

Classical control problem under uncertainty aversion:

inf

ρ

sup

Q

E

Q

(

L

n=1

β

n

h

n)

( t

nρ

)) =∶ inf

ρ

E (

L

n=1

β

n

h

n)

( t

nρ

)) .

We say a system of operatorE( ⋅ ∣Ft) ∶ L(P, FT) → L(P, Ft) ∶ t = 0, 1, .., T is an(Ft)-consistent coherent nonlinear expectationif it satisfies strict monotonicity, positive homogeneity, subadditivity and Lebesgue property (lower semi-continuity) and

● (Ft)-consistency: for t ≤ t≤ T,

E(E(X∣Ft)∣Ft) = E(X∣Ft).

The filtration (Ft)must be identified in advance.

(16)

Our information structures

If you expect the worst, you’ll never be disappointed. – Sarah Dessen

We have M bandits, each with a filtered space (Ω(m), (Ft(m)), P(m))and a consistent coherent nonlinear expectation

E(m)(X ∣Ft(m)) =ess sup

Q∈Q(m)

EQ(X ∣Ft(m))

(See Follmer & Schied, 2016).

▸ Define the orthant space by ¯Ω = ⊗m(m), similarly ¯P, and F (s) = ⊗¯

m

F(m)(s(m)) ∶ s = (s(1), s(2), ..., s(m)).

▸ Define the orthant nonlinear expectation by Es(Y ) = ess sup

Q∈ ¯Q EQ(Y ∣ ¯F (s)) ∶ ¯Q = {Q = ⊗

m

Q(m) for Q(m)∈ Q(m)}.

(17)

Time consistency

When the Facts Change, I Change My Mind. What Do You Do, Sir?–Keynes

One may want to minimise

E

0

(

L

n=1

β

n

h

n)

( t

nρ

)) =∶ E

0

( H

ρ

)

but E

s

does not satisfy time-consistency.

No DPP ⇒ Curse of dimensionality.

Inconsistency in decision making. i.e. ‘Optimal’ strategies may not be followed in the future.

Using an indifference valuation perspective can be helpful.

(18)

Time consistency

When the Facts Change, I Change My Mind. What Do You Do, Sir?–Keynes

One may want to minimise

E

0

(

L

n=1

β

n

h

n)

( t

nρ

)) =∶ E

0

( H

ρ

)

but E

s

does not satisfy time-consistency.

No DPP ⇒ Curse of dimensionality.

Inconsistency in decision making. i.e. ‘Optimal’ strategies may not be followed in the future.

Using an indifference valuation perspective can be helpful.

(19)

Time consistency

When the Facts Change, I Change My Mind. What Do You Do, Sir?–Keynes

One may want to minimise

E

0

(

L

n=1

β

n

h

n)

( t

nρ

)) =∶ E

0

( H

ρ

)

but E

s

does not satisfy time-consistency.

No DPP ⇒ Curse of dimensionality.

Inconsistency in decision making. i.e. ‘Optimal’ strategies may not be followed in the future.

Using an indifference valuation perspective can be helpful.

(20)

Time consistency

When the Facts Change, I Change My Mind. What Do You Do, Sir?–Keynes

One may want to minimise

E

0

(

L

n=1

β

n

h

n)

( t

nρ

)) =∶ E

0

( H

ρ

)

but E

s

does not satisfy time-consistency.

No DPP ⇒ Curse of dimensionality.

Inconsistency in decision making. i.e. ‘Optimal’ strategies may not be followed in the future.

Using an indifference valuation perspective can be helpful.

(21)

‘Gittins’ optimality’

When the Facts Change, I Change My Mind. What Do You Do, Sir?–Keynes

▸ E0(Hρ−E0(Hρ)) =0.

▸ minρE0(Hρ) ∶=Cρ≤Cρ where Cρ∈R and E0(Hρ−Cρ) =0.

Definition

We say Yρ is a compensator of a cost Hρ (under strategy ρ) if E0(Hρ−Yρ) =0.

We say ρ is a Gittins’ optimum if there exists a compensator family {Yρ}ρ such that Yρ ≤Yρ.

▸ In a dynamic setting, we require Yρto be predictable and supercompensate at all times.

(22)

‘Gittins’ optimality’

When the Facts Change, I Change My Mind. What Do You Do, Sir?–Keynes

▸ E0(Hρ−E0(Hρ)) =0.

▸ minρE0(Hρ) ∶=Cρ≤Cρ where Cρ∈R and E0(Hρ−Cρ) =0.

Definition

We say Yρ is a compensator of a cost Hρ (under strategy ρ) if E0(Hρ−Yρ) =0.

We say ρ is a Gittins’ optimum if there exists a compensator family {Yρ}ρ such that Yρ ≤Yρ.

▸ In a dynamic setting, we require Yρto be predictable and supercompensate at all times.

(23)

The main theorem

There is no more miserable human being than one in whom nothing is habitual but indecision. —W. James

For a machine m, we have(Ft(m))t≥0-adapted random costs h(m)(t).

Theorem

AGittins’ optimalstrategy for the cost Hρ∶= ∑Ln=1βnhn)(tnρ) can be given by always playing a machine with theminimal index γ(m)(s)where

γ(m)(s) ∶= ess inf {γ ∶ ess inf

τ∈T(m)(s)E(m)(∑τ

t=1

βt(h(m)(s + t) − γ) ∣ Fs(m)) ≤ 0}.

▸ γ(m) can be found by solving a nonlinear optimal stopping problem independently for each machine.

▸ Playing based on indices yields a consistent decision. (i.e. an optimal decision follows through.)

(24)

The main theorem

There is no more miserable human being than one in whom nothing is habitual but indecision. —W. James

For a machine m, we have(Ft(m))t≥0-adapted random costs h(m)(t).

Theorem

AGittins’ optimalstrategy for the cost Hρ∶= ∑Ln=1βnhn)(tnρ) can be given by always playing a machine with theminimal index γ(m)(s)where

γ(m)(s) ∶= ess inf {γ ∶ ess inf

τ∈T(m)(s)E(m)(∑τ

t=1

βt(h(m)(s + t) − γ) ∣ Fs(m)) ≤ 0}.

▸ γ(m) can be found by solving a nonlinear optimal stopping problem independently for each machine.

▸ Playing based on indices yields a consistent decision. (i.e. an optimal decision follows through.)

(25)

The main theorem

There is no more miserable human being than one in whom nothing is habitual but indecision. —W. James

For a machine m, we have(Ft(m))t≥0-adapted random costs h(m)(t).

Theorem

AGittins’ optimalstrategy for the cost Hρ∶= ∑Ln=1βnhn)(tnρ) can be given by always playing a machine with theminimal index γ(m)(s)where

γ(m)(s) ∶= ess inf {γ ∶ ess inf

τ∈T(m)(s)E(m)(∑τ

t=1

βt(h(m)(s + t) − γ) ∣ Fs(m)) ≤ 0}.

▸ γ(m) can be found by solving a nonlinear optimal stopping problem independently for each machine.

▸ Playing based on indices yields a consistent decision.

(i.e. an optimal decision follows through.)

(26)

Bernoulli Bandit

But to us, probability is the very guide of life. —Joseph Butler

▸ Bernoulli bandit: h(t) ∼ B(1, θ) : θ is unknown.

▸ We model the uncertainty by

E(t)k (⋅) ∶=ess sup

θ∈Θkt

Eθ(⋅)

where Θkt is a posterior credible set of size k (under an improper prior).

▸ We extend it by Ek( ⋅ ∣Ft) ∶= E(t)k (...E(T −1)k (⋅)...).

▸ It can be shown that

γ(t) = γk,β(pt, 1

√nt

) =∶pt+Uncertainty valuation.

(27)

Bernoulli Bandit

But to us, probability is the very guide of life. —Joseph Butler

▸ Bernoulli bandit: h(t) ∼ B(1, θ) : θ is unknown.

▸ We model the uncertainty by

E(t)k (⋅) ∶=ess sup

θ∈Θkt

Eθ(⋅)

where Θkt is a posterior credible set of size k (under an improper prior).

▸ We extend it by Ek( ⋅ ∣Ft) ∶= E(t)k (...E(T −1)k (⋅)...).

▸ It can be shown that

γ(t) = γk,β(pt, 1

√nt

) =∶pt+Uncertainty valuation.

(28)

Difference between γ and p (Uncertainty valuation)

But to us, probability is the very guide of life. —Joseph Butler

p 0.0 0.2

0.4 0.60.8 1.0 1/n

0.00.20.40.60.81.0

p

-0.6 -0.3 0.0 0.3 0.6

= 0.9999, k = 0.01

p 0.0 0.2

0.4 0.60.8 1.0 1/n

0.00.20.40.60.81.0

p

-0.6 -0.3 0.0 0.3 0.6

= 0.9999, k = 0.5

p 0.0 0.2

0.4 0.60.8 1.0 1/n

0.00.20.40.60.81.0

p

-0.6 -0.3 0.0 0.3 0.6

= 0.9999, k = 0.8

0.0 0.2 0.40.6n

0.81.0

p

-0.6 -0.3 0.0 0.3 0.6

= 0.95, k = 0.01

0.0 0.2 0.40.6n

0.81.0

p

-0.6 -0.3 0.0 0.3 0.6

= 0.95, k = 0.5

0.0 0.2 0.40.6n

0.81.0

p

-0.6 -0.3 0.0 0.3 0.6

= 0.95, k = 0.8

(29)

Monte-Carlo Simulation

Facts are stubborn things, but statistics are pliable.—Mark Twain

Randomly choose a and b independently from Γ(1, 1/100).

Take 50 samples from Beta(a, b) and use them as true probabilities of the

50 Bernoulli bandits.

Evaluate algorithms over

10000 trials/simulation.

Start with

initial

information of

size 10

from each bandit.

We will consider the performance of each algorithm by considering

Expected-expected regret: R(L) ∶= ∑

Ln=1

( θ

n)

− θ

) .

Suboptimal plays: N

(L) ∶= ∑

Ln=1

I(θ

n)

≠ θ

) .

(30)

Performance

Facts are stubborn things, but statistics are pliable.—Mark Twain

= 0.9999, k = 0.01 DR: = 0.9999, k = 0.5 DR: = 0.9999, k = 0.8 = 0.9999, k = 0.98 DR: = 0.95, k = 0.01 DR: = 0.95, k = 0.5 DR: = 0.95, k = 0.8 UCB: = 1 UCB: = 2 ompson: a0,b0 = 0.0001 Thompson: a0,b0 = 1 Thompson: a0,b0 = 50 Greedy 0

500 1000 1500 2000 2500

Expected-expected regret over 10000 trials

= 0.9999, k = 0.01 DR: = 0.9999, k = 0.5 DR: = 0.9999, k = 0.8 = 0.9999, k = 0.98 DR: = 0.95, k = 0.01 DR: = 0.95, k = 0.5 DR: = 0.95, k = 0.8 UCB: = 1 UCB: = 2 ompson: a0,b0 = 0.0001 Thompson: a0,b0 = 1 Thompson: a0,b0 = 50 Greedy 0

2000 4000 6000 8000 10000

Subotimal plays over 10000 trials

(31)

Conclusion

The finest studies are like the finest of anything else: They cost big bucks. —Charles Wheelan

Contribution

▸ We propose an alternative optimality criterion to address consistent decision making under uncertainty over multiple filtrations.

▸ We derive an index which only involves a one dimensional

(time-consistent) robust problem which is computationally tractable.

▸ Our model takes into account the desire to learn and uncertainty aversion.

Reference

▸ S.N. Cohen and T. Treetanthiploet, Gittins’ theorem under uncertainty, arXiv:1907.05689.

(32)

Q & A

(33)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ Observe that γ(m)(t) is the minimum compensated reward which could encourage us to pay the cost from time t until the optimal stopping time τ.

▸ By the minimality of γ(m)(t), we must have zero total return under the optimal stopping, i.e.

E(m)(

τ

s=t+1

βs(h(m)(s) − γ(m)(t))∣Ft(m)) =0

▸ In particular, at time τ, we need to increase the compensated reward to encourage further play.

(34)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ Observe that γ(m)(t) is the minimum compensated reward which could encourage us to pay the cost from time t until the optimal stopping time τ.

▸ By the minimality of γ(m)(t), we must have zero total return under the optimal stopping, i.e.

E(m)(

τ

s=t+1

βs(h(m)(s) − γ(m)(t))∣Ft(m)) =0

▸ In particular, at time τ, we need to increase the compensated reward to encourage further play.

(35)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ We increase the reward minimality by considering the reward process Γ(m)(t) = max

0≤θ≤t−1γ(m)(θ).

▸ This minimal reward encourage us to pay the random cost h(m)(t) until the horizon T(m) with the return 0.

▸ Since this reward encourage us to continue at any point in time, for every t ≥ 0,

E(m)(

T(m)

s=t+1

βs(h(m)(s) − Γ(m)(s))∣Ft(m)) ≤0

with equality at t = 0.

(36)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ We increase the reward minimality by considering the reward process Γ(m)(t) = max

0≤θ≤t−1γ(m)(θ).

▸ This minimal reward encourage us to pay the random cost h(m)(t) until the horizon T(m) with the return 0.

▸ Since this reward encourage us to continue at any point in time, for every t ≥ 0,

E(m)(

T(m)

s=t+1

βs(h(m)(s) − Γ(m)(s))∣Ft(m)) ≤0

with equality at t = 0.

(37)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ Now imagine that we are taking a break from the m-th arm to play another arm. Once we return to play the m-th arm we face a further rescaling of the discount factor. In particular, we have the discount factor α(s)βs instead of βs where α is decreasing.

▸ By the robust representation theorem, for any  > 0, show that there exists Q ∈ Q(m) such that

EQ[

T(m)

s=1

α(s)βs(h(m)(s) − Γ(m)(s))] ≥ −

for every predictable decreasing process α in [0, 1].

▸ In particular,

E( ∑

n

βn(hn)(tnρ) −Γn)(tnρ))) ≥0.

(38)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ Now imagine that we are taking a break from the m-th arm to play another arm. Once we return to play the m-th arm we face a further rescaling of the discount factor. In particular, we have the discount factor α(s)βs instead of βs where α is decreasing.

▸ By the robust representation theorem, for any  > 0, show that there exists Q ∈ Q(m) such that

EQ[

T(m)

s=1

α(s)βs(h(m)(s) − Γ(m)(s))] ≥ −

for every predictable decreasing process α in [0, 1].

▸ In particular,

E( ∑

n

βn(hn)(tnρ) −Γn)(tnρ))) ≥0.

(39)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ Now imagine that we are taking a break from the m-th arm to play another arm. Once we return to play the m-th arm we face a further rescaling of the discount factor. In particular, we have the discount factor α(s)βs instead of βs where α is decreasing.

▸ By the robust representation theorem, for any  > 0, show that there exists Q ∈ Q(m) such that

EQ[

T(m)

s=1

α(s)βs(h(m)(s) − Γ(m)(s))] ≥ −

for every predictable decreasing process α in [0, 1].

▸ In particular,

E( ∑

n

βn(hn)(tnρ) −Γn)(tnρ))) ≥0.

(40)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ If we take a break (i.e. leave the arm) only when

E(m)(

T(m)

s=t+1

βs(h(m)(s) − Γ(m)(s))∣Ft(m)) =0

i.e. when γ(m) reaches a new maximum, the total cost must be zero.

▸ In particular, E( ∑

n

βn(hn)(tnρ) −Γn)(tnρ))) =0.

▸ Finally, as t ↦ Γ(m)(t) is increasing, ρ minimise ∑Nn=1βnΓn)(tnρ) for all N.

(41)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ If we take a break (i.e. leave the arm) only when

E(m)(

T(m)

s=t+1

βs(h(m)(s) − Γ(m)(s))∣Ft(m)) =0

i.e. when γ(m) reaches a new maximum, the total cost must be zero.

▸ In particular, E( ∑

n

βn(hn)(tnρ) −Γn)(tnρ))) =0.

▸ Finally, as t ↦ Γ(m)(t) is increasing, ρ minimise ∑Nn=1βnΓn)(tnρ) for all N.

(42)

Sketch idea of the proof

Everything should be made as simple as possible, but not simpler—Albert Einstein

▸ If we take a break (i.e. leave the arm) only when

E(m)(

T(m)

s=t+1

βs(h(m)(s) − Γ(m)(s))∣Ft(m)) =0

i.e. when γ(m) reaches a new maximum, the total cost must be zero.

▸ In particular, E( ∑

n

βn(hn)(tnρ) −Γn)(tnρ))) =0.

▸ Finally, as t ↦ Γ(m)(t) is increasing, ρ minimise ∑Nn=1βnΓn)(tnρ) for all N.

Riferimenti

Documenti correlati

To determine the ligand concentration yielding the best SERS signal, we collected a series of SERS spectra for various concentrations of the three coating agents and obtained

An atherothrombotic plaque (grey) is completely obstructing the arterial lumen. B) Central lumen navigation strategy. A dedicated guidewire is advanced and navigated into the

In conclusione, si può affermare che l’intento di riorganizzazione della normativa di vigilanza per gli intermediari finanziari non bancari e di omogeneizzazione della stessa con

 In Section 2 we present linear theory predictions of the parallel fire hose, focusing in particular on the differences between proton and alpha contributions in driving the

Non ci vuole molto a capire che la lingua conta nelle scienze umane più che in altri campi, e questo non per un elemento estrinseco, ma proprio per la definizione degli

There are several validated transiting multi-planetary sys- tems orbiting early M dwarfs similar to TOI-776 in terms of planetary architecture. Kepler-225, Kepler-236 and Kepler-

Current understanding, based on the noticeable, small, truly congenital nevi and nevi acquired early in life, is that the first develops before puberty, presents with a

Oltre alle forme “c’è/ci sono”, che sono al presente, è possibile ottenere in modo analogo anche le espressioni al passato (c’era/c’erano oppure ci fu/ci furono) o al