## Reinforcement learning

Success stories, ideas and open challenges

Roberto Capobianco

### Reinforcement Learning

*All of this is just reinforcement learning!*

### Reinforcement Learning

*All of this is just reinforcement learning!*

• Sparse and time delayed “labels”

• State-action sequences: s_{0}, a_{0}, s_{1}, a_{1}, …, s_{n-1}, a_{n-1}, s_{n}

• Interaction with the environment

• Rewards

• Q-function and discounted future reward (!)

### Do you see any problems?

### Do you see any problems?

• State modeling

• Q-function as a table

• Generalization capabilities

• State modeling (e.g., Atari game)

• Q-function as a table

• Generalization capabilities

We can use observations, e.g. image pixels

• State modeling

• Q-function as a table

• Generalization capabilities

• State modeling

• Q-function as a table

• Generalization capabilities

More configurations than the number of atoms in the universe!

• State modeling

• Q-function as a table

• Generalization capabilities

More configurations than the number of atoms in the universe!

We can use function approximators

• State modeling

• Q-function as a table

• Generalization capabilities

More configurations than the number of atoms in the universe!

We can use function approximators

introduces errors, but helps also with generalization

### Q-networks

*• Neural networks are universal function *
*approximators*

• Q-networks

### DQN architecture

### DQN tricks

Q-learning does not work in practice, without tricks (as most of the TD algorithms)

• Sampled mini-batches for learning the parameters

*stabilizes the network and reduces correlation among data*

• Exploration with ԑ-greedy policy

with probability ԑ select random action, otherwise use Q

• Update the network every once a while

### Nice, but…

Why? Any ideas?

### Exploration, exploration, exploration

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling

### Exploration, exploration, exploration

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling

### AlphaGo

Uses Monte-Carlo Tree Search (plus optimism)!

*Upper confidence bound applied to trees (UCT)*

### AlphaGo

• “supervised” policy

• RL policy

• simulation policy (follows UCT)

• value function estimate

### AlphaGo

• policy used to execute full and truncated rollouts

• full rollouts to update estimate of value function

• value function and rollout to propagate information in the tree search, as well as to update the policy

### Tricks

• Sampled mini-batches for learning

• Use each version of the network against a pool of previous versions

self improvement

• Update the networks every once a while

### You know what?

These principles can be also applied in robotics!

### Reinforcement Learning

These principles can be also applied in robotics!

### Deep Reinforcement Learning

These principles can be also applied in robotics!

### Self-improvement & MCTS

These principles can be also applied in robotics!

Robot Soccer Players at Sapienza:

### Self-improvement & MCTS

These principles can be also applied in robotics!

Humanoid robots at Sapienza:

### Ok, back to exploration…

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling

### Exploiting Uncertainty

IDEA: Measure uncertainty with density models and use that as exploration bonuses

### Exploiting Uncertainty

IDEA: Measure uncertainty with density models and use that as exploration bonuses

### Still exploration (yes, it is a big pain)

How do we deal with that?

• Monte-Carlo Tree Search

• Intrinsic Motivation

• Posterior Sampling

### Posterior Sampling

Bootstrapping: train K different heads (networks), each on a different subset of data

### Posterior Sampling

Bootstrapping: train K different heads (networks), each on a different subset of data

### Posterior Sampling

Bootstrapping: train K different heads (networks), each on a different subset of data

### Posterior Sampling

How to select final policy? E.g., majority voting