Reinforcement learning


Academic year: 2021

Reinforcement learning

Success stories, ideas and open challenges

Roberto Capobianco


Reinforcement Learning

Reinforcement Learning

• Sparse and time delayed “labels”

• State-action sequences: s0, a0, s1, a1, …, sn-1, an-1, sn

• Interaction with the environment

• Rewards

• Q-function and discounted future reward (!)


Do you see any problems?


• State modeling

• Q-function as a table

• Generalization capabilities


• State modeling (e.g., Atari game)

• Q-function as a table

We can use observations, e.g. image pixels


• State modeling

• Q-function as a table

• State modeling

• Q-function as a table

More configurations than the number of atoms in the universe!


• State modeling

• Q-function as a table

We can use function approximators


• State modeling

• Q-function as a table

• Generalization capabilities

• Neural networks are universal function approximators

• Q-networks


DQN architecture


DQN tricks

Q-learning does not work in practice, without tricks (as most of the TD algorithms)

• Sampled mini-batches for learning the parameters

stabilizes the network and reduces correlation among data

• Exploration with ԑ-greedy policy

with probability ԑ select random action, otherwise use Q

• Update the network every once a while


Nice, but…

Why? Any ideas?


Exploration, exploration, exploration

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling


Uses Monte-Carlo Tree Search (plus optimism)!

Upper confidence bound applied to trees (UCT)



• “supervised” policy

• RL policy

• simulation policy (follows UCT)

• value function estimate



• policy used to execute full and truncated rollouts

• full rollouts to update estimate of value function

• value function and rollout to propagate information in the tree search, as well as to update the policy



• Sampled mini-batches for learning

• Use each version of the network against a pool of previous versions

self improvement

• Update the networks every once a while


You know what?

These principles can be also applied in robotics!


Reinforcement Learning

Deep Reinforcement Learning

Self-improvement & MCTS

Robot Soccer Players at Sapienza:


Self-improvement & MCTS

Humanoid robots at Sapienza:


Ok, back to exploration…

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling


Exploiting Uncertainty

IDEA: Measure uncertainty with density models and use that as exploration bonuses


Still exploration (yes, it is a big pain)

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling


Posterior Sampling

Bootstrapping: train K different heads (networks), each on a different subset of data


Posterior Sampling

Posterior Sampling

How to select final policy? E.g., majority voting



