• Non ci sono risultati.

Reinforcement Learning

N/A
N/A
Protected

Academic year: 2022

Condividi "Reinforcement Learning"

Copied!
34
0
0

Testo completo

(1)

Reinforcement Learning

Davide Bacciu

Computational Intelligence & Machine Learning Group Dipartimento di Informatica

Università di Pisa bacciu@di.unipi.it

Introduzione all’Intelligenza Artificiale - A.A. 2012/2013

(2)

Lecture Outline

1 Reinforcement Learning The Problem

Definitions Challenges

2 Q-Learning The Problem Algorithm Example

3 Issues and Related Models Q-Learning Issues SARSA Learning Summary

(3)

Motivation

In some applications it is difficult orimpossibleto provide the learner with agolden standard(xn,yn)(supervised learning)

Game playing

What is theright (target) movegiven current situation?

Easier to provide information aboutwins and losses Optimal control

Robot control Planning

Resource allocation Job scheduling Channel allocation

(4)

Reinforcement Learning (RL)

Learning to chose the best action based onrewardsor punishmentsfrom the interacting environment

Actions need tomaximizethe amount ofreceived rewards

(5)

Data Characterization

Observations st

Theenvironment state st attime tas perceived by the learner

Similar to input xnin supervised/unsupervised learning Assumefinite stateshere

Actions at

Anaction atthat when performed canalter current statest

The target action at is unknown (no supervision) Assumefinite actionshere

Rewards r (s, a)

Areward function r (s, a)that assigns a numerical value to an action a performed in state s

Actual rewards cannot beavailableat any time t

(6)

The Reinforcement Learning Task

A learning agent can perceive a set ofenvironment states S and has a set ofactionsA it can perform

At each time instant t the agent is in a state st, chooses an action at and performs it

The environment provides a reward rt =r (st,at)and produces a new state st+1= δ(st,at)

r and δ belong to the environment and can beunknown to the agent

Markov Decision Process(MDP)

r () and δ() depend only on current state st and action at

Learning Task

Find apolicyπ :S → A that will maximize thecumulative reward

(7)

What is a Cumulative Reward?

The total rewardaccumulated over timeby following the actions generated by a policy π, starting from aninitial state

Several ways of computing cumulated rewards Finite horizon reward

h

X

i=0

rt+i Average reward

lim

h→∞

1 h

h−1

X

i=0

rt+i Infinite discounted reward

X

i=0

γirt+i s.t. 0 ≤ γ < 1

(8)

RL Task Formalization

Assume theinfinite discounted rewardgenerated by policy π starting from state st

Vπ(st) =

X

i=0

γirt+i =

X

i=0

γir (st+i,at+i)

where γ determines the contribution ofdelayed rewardsVs immediate rewards

The RL task is to find

π =arg max

π Vπ(s) ∀s ∈ S

(9)

Components of a RL Agent

Any RL agent includes one or several of this components Policy π

Determines the agent behavior

What actions are selected in a given state Value function V·(·)

Measures howrewardingis each state and/or action Model

The agent’s internal representation of the environment

(10)

Maze Example

Environment- a grid with inaccessible points, a starting and a goal location States- accessible locations in the grid Actions- move N, S, W, E

Rewards- −1 for each move (time step)

(11)

Maze - Policy

Agent-defined moves in each maze location

(12)

Maze - Value Function

A state-dependant value function V(s) measuring thetime required to reach the goalfrom each location

(13)

Maze - Model

Agent’sinternal representationof the environment

State change function s0= δ(s, a) Rewards r (s, a) for each state Can be based on a very partial and approximated view of the world

(14)

Exploration Vs Exploitation

How does the agentchoose the action?

Should we always followonlythe policy π?

Exploitation - Choose the action themaximizes the reward according to yourlearned knowledge

Exploration - Choose an action that canhelp in improving your learned knowledge

In general, it is important to explore as well as to exploit

(15)

Exploration strategies

How to explore?

Choose arandomaction

Choose anactionthat has never been selected Choose an action that leads tounexplored states When to explore?

Until time Texplore

-greedy

Stop exploring when gathered enough reward

(16)

Model-based Vs Model Free

Model-Based

Maximal exploitation of experience Optimalprovided enough experience Often computationallyintractable Model-free

Appropriate whenmodel is too complexto store, learn and compute

E.g. learn rewards associated to action-states couples (Q function)

(17)

The Q-Learning Problem

The RL task is to find the optimalpolicyπs.t.

π =arg max

π Vπ(s) ∀s ∈ S

By exploiting the definition of reward and state-dependentvalue function, we obtain a reformulated problem

π =arg max

a r (s, a) + γV(s0)

∀s ∈ S where s0 = δ(s, a)

The Problem

Learn to choose theoptimal actionin a state s as the action a that maximizes the sum ofimmediate rewardr (s, a) plus the discounted cumulative rewardfrom the successor state

(18)

The Optimal Q-function

Define

Q(s, a) = r (s, a) + γV(δ(s, a)) The RL problem for state s becomes

π(s) = arg max

a Q(s, a)

Q(s, a) denotes themaximum discounted rewardthat can be achieved starting in state s and choosing action a

Assume that we continue to choose actions optimally

(19)

Grid Example (γ = 0.9)

r (S, A) V(s)

Q(s, a) π

(20)

Q-Learning

What if wedon’t know the optimalV(s)?

We don’t need to know it, if weassume to knowQ(s, a), because

V(s0) =max

a0 Q(s0,a0) Therefore we can rewrite

Q(s, a) = r (s, a) + γV(δ(s, a))

=r (s, a) + γ max

a0 Q(δ(s, a), a0) Arecursive formulationbased on Q only!

(21)

Learning Q

What if wedon’t known the optimalQ-function?

Key Idea

Useapproximated Q(s, a)values that areupdated with experience

Suppose the RL agent has anexperiencehs, a, r , s0i Experience provides apiece of data ∆Qcomputed using thecurrent Qestimate

∆Q = r + γ max

a0 Q(s0,a0)

Thecurrent Q estimate can beupdatedusing the new piece of data ∆Q

(22)

Q Update

Amoving averagecomputed each time t a new experience ∆Q arrives

Qt+1= (1 − α)Qt + α∆Qt

The Q terms are

Qt→ current estimate at time t Qt+1→ updated estimate

∆Qt → experience at time t α ∈ [0, 1] is thelearning rate

Smaller α determine longer term averages Slower convergence but less fluctuations Can be afunction of timet

(23)

Q-Learning Algorithm

1 InitializeQ(S, A) for all states S and actions A

2 Observecurrent states

3 Repeat forever

A Select an action a (How?) and execute it B Receive a reward r (s, a)

C Observe the new state s0← δ(s, a) D Update estimate

Q(s, a) ← (1 − α)Q(s, a) + α



r (s, a) + γ max

a0 Q(s0,a0)



E Update state s ← s0

(24)

Are Q-Estimates Sufficient?

Theorem (Watkins & Dayan, 1992) Q −→ Q

Q-learning will eventually converge to the optimal policy, for any deterministic Markov Decision Process

(25)

Grid Example

r (S, A) Qt(S, A)

Agent A is in state s1

(26)

Grid Example

r (S, A) Qt(S, A)

Agent A selectsaction"move right" aright leading to thenew states2= δ(s1,aright)

(27)

Grid Example

r (S, A) Qt(S, A)

Reward is r (s1,aright) =0. Assuming γ = 0.9 and α = 1 yields Qt(s1,aright) ←0 + 1 · r (s1,aright) +0.9 · max

a0 Qt(s0,a0)

← 0 + 0.9 · max

a0 {66, 81, 100}

← 90

(28)

Grid Example

r (S, A) Qt+1(S, A)

TheQ-estimatefor state s1and action aright isupdated

(29)

Q-Learning Issues (I)

Delayedfeedback

Rewards for an action may not be received until several time steps later

Credit assignment Finite search space

Assume states and actions arediscrete and finite What aboutgeneralization?

Q-function is basically a lookup-table

Unrealistic assumption inlargeorinfinitespaces Generalize from experiences tounseen states

(30)

Q-Learning Issues (II)

Exploration Strategies

Convergence to the optimal policy is ensured only if we explore enough

Initialize Q(S, A) with values thatencourage exploration

-greedy strategy: choose arandom actionwith probability

and the best action with probability 1 −  Off-policy learning

Q-learning learns the value of the optimal policy,no matter whatit does

Can lead todangerouspolicies

Looking ahead in time for thenext actioncan be asolution

(31)

On-policy Learning

On-policy procedures learn the value of thepolicy being followed

SARSA algorithm doeson-policylearning Experience is now hs, a, r , s0,a0i (SARSA)

Looking ahead at the next action a0being performed Next action is chosenonthe current policy

(32)

SARSA Learning Algorithm

1 InitializeQ(S, A) for all states S and actions A

2 Observecurrent states

3 Select action a using apolicy based on currentQ(S, A)

4 Repeat forever A Executeaction a

B Receive a reward r (s, a)

C Observe the new state s0← δ(s, a)

D Selectaction a0 using a policy based on current Q(S, A) E Update estimate

Q(s, a) ← (1 − α)Q(s, a) + α (r (s, a) + γQ(s0,a0)) F Update current state s ← s0and action a ← a0

(33)

Take Home Messages

Reinforcement Learning allows learning when thegolden standardis not available

An agent interacting with the environment The environment canprovide rewards The environment is (partially)observable A RL agent includes at least one between

Policy

Value function Environment model Q-Learning

Practical and simple algorithm

Learning themaximum discounted rewardthat can be achieved starting in a state s and choosing action a Theoretical guarantees ofoptimalityif the problem is MDP

(34)

Take Home Issues

Exploration Vs Exploitation

Choose action tomaximize reward

Choose action tomaximize knowledgeacquisition More open questions...

Credit assignment

Continuousstate and action spaces Generalization

Learning thevalue function(TD learning) Model-based reinforcement learning

Riferimenti

Documenti correlati

1) We want to examine the usability of the interfaces of the Augmented Environment. This will allow us to identify any problem that might reduce the usability of the

In the last years, multi-objective evolutionary algorithms (MOEAs) have been extensively used to generate sets of fuzzy rule-based classifiers (FRBCs) with different trade-offs

For this reason, several methods belonging to reinforcement learning, an area of machine learning that is attracting more and more attention in the scientific field, have been

Accordingly, two appropriate RL techniques have been employed: an the episodic Q-learning with linear function approximation, to reach the optimal working point starting from a

The steps involved in the authoring process can be summarized as follows: the author will select a list of competencies for a course; he/she will select or create a

• Average returns of many simulated trajectories with each possible action and then following rollout policy. • When estimate accurate, highest value action

Recognising that the intention of listing Premnotrypes (non-EU) as quarantine pests within EU plant health legislation is due to the threat they present to EU potatoes (Smith et

[2] successfully incorporated spirulina in wheat pasta and recently, a study with cookies showed that spirulina can provide a significant structuring effect,