Reinforcement Learning

(1)

Reinforcement Learning

Davide Bacciu

Computational Intelligence & Machine Learning Group Dipartimento di Informatica

Università di Pisa bacciu@di.unipi.it

Introduzione all’Intelligenza Artificiale - A.A. 2012/2013

(2)

Lecture Outline

1 Reinforcement Learning The Problem

Definitions Challenges

2 Q-Learning The Problem Algorithm Example

3 Issues and Related Models Q-Learning Issues SARSA Learning Summary

(3)

Motivation

In some applications it is difficult orimpossibleto provide the learner with agolden standard(x_n,y_n)(supervised learning)

Game playing

What is theright (target) movegiven current situation?

Easier to provide information aboutwins and losses Optimal control

Robot control Planning

Resource allocation Job scheduling Channel allocation

(4)

Reinforcement Learning (RL)

Learning to chose the best action based onrewardsor punishmentsfrom the interacting environment

Actions need tomaximizethe amount ofreceived rewards

(5)

Data Characterization

Observations st

Theenvironment state st attime tas perceived by the learner

Similar to input x_nin supervised/unsupervised learning Assumefinite stateshere

Actions at

Anaction atthat when performed canalter current statest

The target action a^∗_t is unknown (no supervision) Assumefinite actionshere

Rewards r (s, a)

Areward function r (s, a)that assigns a numerical value to an action a performed in state s

Actual rewards cannot beavailableat any time t

(6)

The Reinforcement Learning Task

A learning agent can perceive a set ofenvironment states S and has a set ofactionsA it can perform

At each time instant t the agent is in a state s_t, chooses an action a_t and performs it

The environment provides a reward rt =r (st,at)and produces a new state s_t+1= δ(st,at)

r and δ belong to the environment and can beunknown to the agent

Markov Decision Process(MDP)

r () and δ() depend only on current state st and action at

Learning Task

Find apolicyπ :S → A that will maximize thecumulative reward

(7)

What is a Cumulative Reward?

The total rewardaccumulated over timeby following the actions generated by a policy π, starting from aninitial state

Several ways of computing cumulated rewards Finite horizon reward

h

X

i=0

r_t+i Average reward

lim

h→∞

1 h

h−1

X

i=0

r_t+i Infinite discounted reward

∞

X

i=0

γⁱr_t+i s.t. 0 ≤ γ < 1

(8)

RL Task Formalization

Assume theinfinite discounted rewardgenerated by policy π starting from state s_t

V^π(s_t) =

∞

X

i=0

γⁱr_t+i =

∞

X

i=0

γⁱr (s_t+i,a_t+i)

where γ determines the contribution ofdelayed rewardsVs immediate rewards

The RL task is to find

π^∗ =arg max

π V^π(s) ∀s ∈ S

(9)

Components of a RL Agent

Any RL agent includes one or several of this components Policy π

Determines the agent behavior

What actions are selected in a given state Value function V^·(·)

Measures howrewardingis each state and/or action Model

The agent’s internal representation of the environment

(10)

Maze Example

Environment- a grid with inaccessible points, a starting and a goal location States- accessible locations in the grid Actions- move N, S, W, E

Rewards- −1 for each move (time step)

(11)

Maze - Policy

Agent-defined moves in each maze location

(12)

Maze - Value Function

A state-dependant value function V^∗(s) measuring thetime required to reach the goalfrom each location

(13)

Maze - Model

Agent’sinternal representationof the environment

State change function s⁰= δ(s, a) Rewards r (s, a) for each state Can be based on a very partial and approximated view of the world

(14)

Exploration Vs Exploitation

How does the agentchoose the action?

Should we always followonlythe policy π?

Exploitation - Choose the action themaximizes the reward according to yourlearned knowledge

Exploration - Choose an action that canhelp in improving your learned knowledge

In general, it is important to explore as well as to exploit

(15)

Exploration strategies

How to explore?

Choose arandomaction

Choose anactionthat has never been selected Choose an action that leads tounexplored states When to explore?

Until time T_explore

-greedy

Stop exploring when gathered enough reward

(16)

Model-based Vs Model Free

Model-Based

Maximal exploitation of experience Optimalprovided enough experience Often computationallyintractable Model-free

Appropriate whenmodel is too complexto store, learn and compute

E.g. learn rewards associated to action-states couples (Q function)

(17)

The Q-Learning Problem

The RL task is to find the optimalpolicyπ^∗s.t.

π^∗ =arg max

π V^π(s) ∀s ∈ S

By exploiting the definition of reward and state-dependentvalue function, we obtain a reformulated problem

π^∗ =arg max

a r (s, a) + γV^∗(s⁰)

∀s ∈ S where s⁰ = δ(s, a)

The Problem

Learn to choose theoptimal actionin a state s as the action a that maximizes the sum ofimmediate rewardr (s, a) plus the discounted cumulative rewardfrom the successor state

(18)

The Optimal Q-function

Define

Q^∗(s, a) = r (s, a) + γV^∗(δ(s, a)) The RL problem for state s becomes

π^∗(s) = arg max

a Q^∗(s, a)

Q^∗(s, a) denotes themaximum discounted rewardthat can be achieved starting in state s and choosing action a

Assume that we continue to choose actions optimally

(19)

Grid Example (γ = 0.9)

r (S, A) V^∗(s)

Q^∗(s, a) π^∗

(20)

Q-Learning

What if wedon’t know the optimalV^∗(s)?

We don’t need to know it, if weassume to knowQ^∗(s, a), because

V^∗(s⁰) =max

a⁰ Q^∗(s⁰,a⁰) Therefore we can rewrite

Q^∗(s, a) = r (s, a) + γV^∗(δ(s, a))

=r (s, a) + γ max

a⁰ Q^∗(δ(s, a), a⁰) Arecursive formulationbased on Q^∗ only!

(21)

Learning Q

What if wedon’t known the optimalQ-function?

Key Idea

Useapproximated Q(s, a)values that areupdated with experience

Suppose the RL agent has anexperiencehs, a, r , s⁰i Experience provides apiece of data ∆Qcomputed using thecurrent Qestimate

∆Q = r + γ max

a⁰ Q(s⁰,a⁰)

Thecurrent Q estimate can beupdatedusing the new piece of data ∆Q

(22)

Q Update

Amoving averagecomputed each time t a new experience ∆Q arrives

Q_t+1= (1 − α)Qt + α∆Qt

The Q terms are

Q_t→ current estimate at time t Q_t+1→ updated estimate

∆Q_t → experience at time t α ∈ [0, 1] is thelearning rate

Smaller α determine longer term averages Slower convergence but less fluctuations Can be afunction of timet

(23)

Q-Learning Algorithm

1 InitializeQ(S, A) for all states S and actions A

2 Observecurrent states

3 Repeat forever

A Select an action a (How?) and execute it B Receive a reward r (s, a)

C Observe the new state s⁰← δ(s, a) D Update estimate

Q(s, a) ← (1 − α)Q(s, a) + α

r (s, a) + γ max

a⁰ Q(s⁰,a⁰)

E Update state s ← s⁰

(24)

Are Q-Estimates Sufficient?

Theorem (Watkins & Dayan, 1992) Q −→ Q^∗

Q-learning will eventually converge to the optimal policy, for any deterministic Markov Decision Process

(25)

Grid Example

r (S, A) Qt(S, A)

Agent A is in state s₁

(26)

Grid Example

r (S, A) Qt(S, A)

Agent A selectsaction"move right" a_right leading to thenew states₂= δ(s₁,a_right)

(27)

Grid Example

r (S, A) Q_t(S, A)

Reward is r (s₁,a_right) =0. Assuming γ = 0.9 and α = 1 yields Q_t(s₁,a_right) ←0 + 1 · r (s₁,a_right) +0.9 · max

a⁰ Q_t(s⁰,a⁰)

← 0 + 0.9 · max

a⁰ {66, 81, 100}

← 90

(28)

Grid Example

r (S, A) Q_t+1(S, A)

TheQ-estimatefor state s₁and action a_right isupdated

(29)

Q-Learning Issues (I)

Delayedfeedback

Rewards for an action may not be received until several time steps later

Credit assignment Finite search space

Assume states and actions arediscrete and finite What aboutgeneralization?

Q-function is basically a lookup-table

Unrealistic assumption inlargeorinfinitespaces Generalize from experiences tounseen states

(30)

Q-Learning Issues (II)

Exploration Strategies

Convergence to the optimal policy is ensured only if we explore enough

Initialize Q(S, A) with values thatencourage exploration

-greedy strategy: choose arandom actionwith probability

and the best action with probability 1 − Off-policy learning

Q-learning learns the value of the optimal policy,no matter whatit does

Can lead todangerouspolicies

Looking ahead in time for thenext actioncan be asolution

(31)

On-policy Learning

On-policy procedures learn the value of thepolicy being followed

SARSA algorithm doeson-policylearning Experience is now hs, a, r , s⁰,a⁰i (SARSA)

Looking ahead at the next action a⁰being performed Next action is chosenonthe current policy

(32)

SARSA Learning Algorithm

1 InitializeQ(S, A) for all states S and actions A

2 Observecurrent states

3 Select action a using apolicy based on currentQ(S, A)

4 Repeat forever A Executeaction a

B Receive a reward r (s, a)

C Observe the new state s⁰← δ(s, a)

D Selectaction a⁰ using a policy based on current Q(S, A) E Update estimate

Q(s, a) ← (1 − α)Q(s, a) + α (r (s, a) + γQ(s⁰,a⁰)) F Update current state s ← s⁰and action a ← a⁰

(33)

Take Home Messages

Reinforcement Learning allows learning when thegolden standardis not available

An agent interacting with the environment The environment canprovide rewards The environment is (partially)observable A RL agent includes at least one between

Policy

Value function Environment model Q-Learning

Practical and simple algorithm

Learning themaximum discounted rewardthat can be achieved starting in a state s and choosing action a Theoretical guarantees ofoptimalityif the problem is MDP

(34)

Take Home Issues

Exploration Vs Exploitation

Choose action tomaximize reward

Choose action tomaximize knowledgeacquisition More open questions...

Credit assignment

Continuousstate and action spaces Generalization

Learning thevalue function(TD learning) Model-based reinforcement learning