Reinforcement Learning
Davide Bacciu
Computational Intelligence & Machine Learning Group Dipartimento di Informatica
Università di Pisa bacciu@di.unipi.it
Introduzione all’Intelligenza Artificiale - A.A. 2012/2013
Lecture Outline
1 Reinforcement Learning The Problem
Definitions Challenges
2 Q-Learning The Problem Algorithm Example
3 Issues and Related Models Q-Learning Issues SARSA Learning Summary
Motivation
In some applications it is difficult orimpossibleto provide the learner with agolden standard(xn,yn)(supervised learning)
Game playing
What is theright (target) movegiven current situation?
Easier to provide information aboutwins and losses Optimal control
Robot control Planning
Resource allocation Job scheduling Channel allocation
Reinforcement Learning (RL)
Learning to chose the best action based onrewardsor punishmentsfrom the interacting environment
Actions need tomaximizethe amount ofreceived rewards
Data Characterization
Observations st
Theenvironment state st attime tas perceived by the learner
Similar to input xnin supervised/unsupervised learning Assumefinite stateshere
Actions at
Anaction atthat when performed canalter current statest
The target action a∗t is unknown (no supervision) Assumefinite actionshere
Rewards r (s, a)
Areward function r (s, a)that assigns a numerical value to an action a performed in state s
Actual rewards cannot beavailableat any time t
The Reinforcement Learning Task
A learning agent can perceive a set ofenvironment states S and has a set ofactionsA it can perform
At each time instant t the agent is in a state st, chooses an action at and performs it
The environment provides a reward rt =r (st,at)and produces a new state st+1= δ(st,at)
r and δ belong to the environment and can beunknown to the agent
Markov Decision Process(MDP)
r () and δ() depend only on current state st and action at
Learning Task
Find apolicyπ :S → A that will maximize thecumulative reward
What is a Cumulative Reward?
The total rewardaccumulated over timeby following the actions generated by a policy π, starting from aninitial state
Several ways of computing cumulated rewards Finite horizon reward
h
X
i=0
rt+i Average reward
lim
h→∞
1 h
h−1
X
i=0
rt+i Infinite discounted reward
∞
X
i=0
γirt+i s.t. 0 ≤ γ < 1
RL Task Formalization
Assume theinfinite discounted rewardgenerated by policy π starting from state st
Vπ(st) =
∞
X
i=0
γirt+i =
∞
X
i=0
γir (st+i,at+i)
where γ determines the contribution ofdelayed rewardsVs immediate rewards
The RL task is to find
π∗ =arg max
π Vπ(s) ∀s ∈ S
Components of a RL Agent
Any RL agent includes one or several of this components Policy π
Determines the agent behavior
What actions are selected in a given state Value function V·(·)
Measures howrewardingis each state and/or action Model
The agent’s internal representation of the environment
Maze Example
Environment- a grid with inaccessible points, a starting and a goal location States- accessible locations in the grid Actions- move N, S, W, E
Rewards- −1 for each move (time step)
Maze - Policy
Agent-defined moves in each maze location
Maze - Value Function
A state-dependant value function V∗(s) measuring thetime required to reach the goalfrom each location
Maze - Model
Agent’sinternal representationof the environment
State change function s0= δ(s, a) Rewards r (s, a) for each state Can be based on a very partial and approximated view of the world
Exploration Vs Exploitation
How does the agentchoose the action?
Should we always followonlythe policy π?
Exploitation - Choose the action themaximizes the reward according to yourlearned knowledge
Exploration - Choose an action that canhelp in improving your learned knowledge
In general, it is important to explore as well as to exploit
Exploration strategies
How to explore?
Choose arandomaction
Choose anactionthat has never been selected Choose an action that leads tounexplored states When to explore?
Until time Texplore
-greedy
Stop exploring when gathered enough reward
Model-based Vs Model Free
Model-Based
Maximal exploitation of experience Optimalprovided enough experience Often computationallyintractable Model-free
Appropriate whenmodel is too complexto store, learn and compute
E.g. learn rewards associated to action-states couples (Q function)
The Q-Learning Problem
The RL task is to find the optimalpolicyπ∗s.t.
π∗ =arg max
π Vπ(s) ∀s ∈ S
By exploiting the definition of reward and state-dependentvalue function, we obtain a reformulated problem
π∗ =arg max
a r (s, a) + γV∗(s0)
∀s ∈ S where s0 = δ(s, a)
The Problem
Learn to choose theoptimal actionin a state s as the action a that maximizes the sum ofimmediate rewardr (s, a) plus the discounted cumulative rewardfrom the successor state
The Optimal Q-function
Define
Q∗(s, a) = r (s, a) + γV∗(δ(s, a)) The RL problem for state s becomes
π∗(s) = arg max
a Q∗(s, a)
Q∗(s, a) denotes themaximum discounted rewardthat can be achieved starting in state s and choosing action a
Assume that we continue to choose actions optimally
Grid Example (γ = 0.9)
r (S, A) V∗(s)
Q∗(s, a) π∗
Q-Learning
What if wedon’t know the optimalV∗(s)?
We don’t need to know it, if weassume to knowQ∗(s, a), because
V∗(s0) =max
a0 Q∗(s0,a0) Therefore we can rewrite
Q∗(s, a) = r (s, a) + γV∗(δ(s, a))
=r (s, a) + γ max
a0 Q∗(δ(s, a), a0) Arecursive formulationbased on Q∗ only!
Learning Q
What if wedon’t known the optimalQ-function?
Key Idea
Useapproximated Q(s, a)values that areupdated with experience
Suppose the RL agent has anexperiencehs, a, r , s0i Experience provides apiece of data ∆Qcomputed using thecurrent Qestimate
∆Q = r + γ max
a0 Q(s0,a0)
Thecurrent Q estimate can beupdatedusing the new piece of data ∆Q
Q Update
Amoving averagecomputed each time t a new experience ∆Q arrives
Qt+1= (1 − α)Qt + α∆Qt
The Q terms are
Qt→ current estimate at time t Qt+1→ updated estimate
∆Qt → experience at time t α ∈ [0, 1] is thelearning rate
Smaller α determine longer term averages Slower convergence but less fluctuations Can be afunction of timet
Q-Learning Algorithm
1 InitializeQ(S, A) for all states S and actions A
2 Observecurrent states
3 Repeat forever
A Select an action a (How?) and execute it B Receive a reward r (s, a)
C Observe the new state s0← δ(s, a) D Update estimate
Q(s, a) ← (1 − α)Q(s, a) + α
r (s, a) + γ max
a0 Q(s0,a0)
E Update state s ← s0
Are Q-Estimates Sufficient?
Theorem (Watkins & Dayan, 1992) Q −→ Q∗
Q-learning will eventually converge to the optimal policy, for any deterministic Markov Decision Process
Grid Example
r (S, A) Qt(S, A)
Agent A is in state s1
Grid Example
r (S, A) Qt(S, A)
Agent A selectsaction"move right" aright leading to thenew states2= δ(s1,aright)
Grid Example
r (S, A) Qt(S, A)
Reward is r (s1,aright) =0. Assuming γ = 0.9 and α = 1 yields Qt(s1,aright) ←0 + 1 · r (s1,aright) +0.9 · max
a0 Qt(s0,a0)
← 0 + 0.9 · max
a0 {66, 81, 100}
← 90
Grid Example
r (S, A) Qt+1(S, A)
TheQ-estimatefor state s1and action aright isupdated
Q-Learning Issues (I)
Delayedfeedback
Rewards for an action may not be received until several time steps later
Credit assignment Finite search space
Assume states and actions arediscrete and finite What aboutgeneralization?
Q-function is basically a lookup-table
Unrealistic assumption inlargeorinfinitespaces Generalize from experiences tounseen states
Q-Learning Issues (II)
Exploration Strategies
Convergence to the optimal policy is ensured only if we explore enough
Initialize Q(S, A) with values thatencourage exploration
-greedy strategy: choose arandom actionwith probability
and the best action with probability 1 − Off-policy learning
Q-learning learns the value of the optimal policy,no matter whatit does
Can lead todangerouspolicies
Looking ahead in time for thenext actioncan be asolution
On-policy Learning
On-policy procedures learn the value of thepolicy being followed
SARSA algorithm doeson-policylearning Experience is now hs, a, r , s0,a0i (SARSA)
Looking ahead at the next action a0being performed Next action is chosenonthe current policy
SARSA Learning Algorithm
1 InitializeQ(S, A) for all states S and actions A
2 Observecurrent states
3 Select action a using apolicy based on currentQ(S, A)
4 Repeat forever A Executeaction a
B Receive a reward r (s, a)
C Observe the new state s0← δ(s, a)
D Selectaction a0 using a policy based on current Q(S, A) E Update estimate
Q(s, a) ← (1 − α)Q(s, a) + α (r (s, a) + γQ(s0,a0)) F Update current state s ← s0and action a ← a0
Take Home Messages
Reinforcement Learning allows learning when thegolden standardis not available
An agent interacting with the environment The environment canprovide rewards The environment is (partially)observable A RL agent includes at least one between
Policy
Value function Environment model Q-Learning
Practical and simple algorithm
Learning themaximum discounted rewardthat can be achieved starting in a state s and choosing action a Theoretical guarantees ofoptimalityif the problem is MDP
Take Home Issues
Exploration Vs Exploitation
Choose action tomaximize reward
Choose action tomaximize knowledgeacquisition More open questions...
Credit assignment
Continuousstate and action spaces Generalization
Learning thevalue function(TD learning) Model-based reinforcement learning