• Non ci sono risultati.

learning Reinforcement

N/A
N/A
Protected

Academic year: 2021

Condividi "learning Reinforcement"

Copied!
40
0
0

Testo completo

(1)

Reinforcement learning

Success stories, ideas and open challenges

Roberto Capobianco

(2)
(3)
(4)
(5)

Reinforcement Learning

All of this is just reinforcement learning!

(6)

Reinforcement Learning

All of this is just reinforcement learning!

• Sparse and time delayed “labels”

• State-action sequences: s0, a0, s1, a1, …, sn-1, an-1, sn

• Interaction with the environment

• Rewards

• Q-function and discounted future reward (!)

(7)

Do you see any problems?

(8)

Do you see any problems?

• State modeling

• Q-function as a table

• Generalization capabilities

(9)

• State modeling (e.g., Atari game)

• Q-function as a table

• Generalization capabilities

We can use observations, e.g. image pixels

(10)

• State modeling

• Q-function as a table

• Generalization capabilities

(11)

• State modeling

• Q-function as a table

• Generalization capabilities

More configurations than the number of atoms in the universe!

(12)

• State modeling

• Q-function as a table

• Generalization capabilities

More configurations than the number of atoms in the universe!

We can use function approximators

(13)

• State modeling

• Q-function as a table

• Generalization capabilities

More configurations than the number of atoms in the universe!

We can use function approximators

introduces errors, but helps also with generalization

(14)

Q-networks

• Neural networks are universal function approximators

• Q-networks

(15)

DQN architecture

(16)

DQN tricks

Q-learning does not work in practice, without tricks (as most of the TD algorithms)

• Sampled mini-batches for learning the parameters

stabilizes the network and reduces correlation among data

• Exploration with ԑ-greedy policy

with probability ԑ select random action, otherwise use Q

• Update the network every once a while

(17)
(18)
(19)

Nice, but…

Why? Any ideas?

(20)
(21)

Exploration, exploration, exploration

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling

(22)

Exploration, exploration, exploration

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling

(23)

AlphaGo

Uses Monte-Carlo Tree Search (plus optimism)!

Upper confidence bound applied to trees (UCT)

(24)

AlphaGo

• “supervised” policy

• RL policy

• simulation policy (follows UCT)

• value function estimate

(25)

AlphaGo

• policy used to execute full and truncated rollouts

• full rollouts to update estimate of value function

• value function and rollout to propagate information in the tree search, as well as to update the policy

(26)

Tricks

• Sampled mini-batches for learning

• Use each version of the network against a pool of previous versions

self improvement

• Update the networks every once a while

(27)

You know what?

These principles can be also applied in robotics!

(28)

Reinforcement Learning

These principles can be also applied in robotics!

(29)

Deep Reinforcement Learning

These principles can be also applied in robotics!

(30)

Self-improvement & MCTS

These principles can be also applied in robotics!

Robot Soccer Players at Sapienza:

(31)

Self-improvement & MCTS

These principles can be also applied in robotics!

Humanoid robots at Sapienza:

(32)

Ok, back to exploration…

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling

(33)

Exploiting Uncertainty

IDEA: Measure uncertainty with density models and use that as exploration bonuses

(34)

Exploiting Uncertainty

IDEA: Measure uncertainty with density models and use that as exploration bonuses

(35)

Still exploration (yes, it is a big pain)

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling

(36)

Posterior Sampling

Bootstrapping: train K different heads (networks), each on a different subset of data

(37)

Posterior Sampling

Bootstrapping: train K different heads (networks), each on a different subset of data

(38)

Posterior Sampling

Bootstrapping: train K different heads (networks), each on a different subset of data

(39)

Posterior Sampling

How to select final policy? E.g., majority voting

(40)

Riferimenti

Documenti correlati

On LM- MEUF application, the yield of extraction is raised to 100% by virtue of the binding affinity of PADA for gold(III) (which transfers the negative charged

Ainsi, le volume de la migration hautement qualifiée vers la Tunisie entre mai 2007 et mai 2008 est estimé à 2 340 migrants dont 1 230, soit 52,6% du total, ont seulement le diplôme

In order to verify if, at least, lawful gambling had a beneficial effect on the evolution of the Italian economy in the years 1991-2012, we compared the previous rate

The hypothesis behind this contribution, starting from an approach based on an interpretation of significant elements in public action (Moini 2013) and of their conceptual framework

Since it is a cluster more than an industry, fashion as a system includes not only those firms (and related industries) which are part of the production

Systemic administration of a single dose of a recombinant serotype 8 adeno-associated virus (AAV8) vector expressing murine myotubularin to Mtm1-deficient knockout mice at the onset

In 2009, Shibata and colleagues reported their experience in a prospective randomized trial comparing TACE plus RFA versus RFA alone in patients fulfilling

In this study, we aimed to (1) explore whether the quarantine period affected the frequency and severity of migraine and days with acute medication intake, and (2) evaluate