learning Reinforcement

(1)

Reinforcement learning

Success stories, ideas and open challenges

Roberto Capobianco

(2)

(3)

(4)

(5)

Reinforcement Learning

All of this is just reinforcement learning!

(6)

Reinforcement Learning

All of this is just reinforcement learning!

• Sparse and time delayed “labels”

• State-action sequences: s₀, a₀, s₁, a₁, …, s_n-1, a_n-1, s_n

• Interaction with the environment

• Rewards

• Q-function and discounted future reward (!)

(7)

Do you see any problems?

(8)

Do you see any problems?

• State modeling

• Q-function as a table

• Generalization capabilities

(9)

• State modeling (e.g., Atari game)

We can use observations, e.g. image pixels

(10)

• State modeling

(11)

• State modeling

More configurations than the number of atoms in the universe!

(12)

• State modeling

We can use function approximators

(13)

• State modeling

We can use function approximators

introduces errors, but helps also with generalization

(14)

Q-networks

• Neural networks are universal function approximators

• Q-networks

(15)

DQN architecture

(16)

DQN tricks

Q-learning does not work in practice, without tricks (as most of the TD algorithms)

• Sampled mini-batches for learning the parameters

stabilizes the network and reduces correlation among data

• Exploration with ԑ-greedy policy

with probability ԑ select random action, otherwise use Q

• Update the network every once a while

(17)

(18)

(19)

Nice, but…

Why? Any ideas?

(20)

(21)

Exploration, exploration, exploration

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling

(22)

Exploration, exploration, exploration

(23)

AlphaGo

Uses Monte-Carlo Tree Search (plus optimism)!

Upper confidence bound applied to trees (UCT)

(24)

AlphaGo

• “supervised” policy

• RL policy

• simulation policy (follows UCT)

• value function estimate

(25)

AlphaGo

• policy used to execute full and truncated rollouts

• full rollouts to update estimate of value function

• value function and rollout to propagate information in the tree search, as well as to update the policy

(26)

Tricks

• Sampled mini-batches for learning

• Use each version of the network against a pool of previous versions

self improvement

• Update the networks every once a while

(27)

You know what?

These principles can be also applied in robotics!

(28)

Reinforcement Learning

(29)

Deep Reinforcement Learning

(30)

Self-improvement & MCTS

Robot Soccer Players at Sapienza:

(31)

Self-improvement & MCTS

Humanoid robots at Sapienza:

(32)

Ok, back to exploration…

(33)

Exploiting Uncertainty

IDEA: Measure uncertainty with density models and use that as exploration bonuses

(34)

Exploiting Uncertainty

IDEA: Measure uncertainty with density models and use that as exploration bonuses

(35)

Still exploration (yes, it is a big pain)

(36)

Posterior Sampling

Bootstrapping: train K different heads (networks), each on a different subset of data

(37)

Posterior Sampling

(38)

Posterior Sampling

(39)

Posterior Sampling

How to select final policy? E.g., majority voting

(40)