Reinforcement learning
Success stories, ideas and open challenges
Roberto Capobianco
Reinforcement Learning
All of this is just reinforcement learning!
Reinforcement Learning
All of this is just reinforcement learning!
• Sparse and time delayed “labels”
• State-action sequences: s0, a0, s1, a1, …, sn-1, an-1, sn
• Interaction with the environment
• Rewards
• Q-function and discounted future reward (!)
Do you see any problems?
Do you see any problems?
• State modeling
• Q-function as a table
• Generalization capabilities
• State modeling (e.g., Atari game)
• Q-function as a table
• Generalization capabilities
We can use observations, e.g. image pixels
• State modeling
• Q-function as a table
• Generalization capabilities
• State modeling
• Q-function as a table
• Generalization capabilities
More configurations than the number of atoms in the universe!
• State modeling
• Q-function as a table
• Generalization capabilities
More configurations than the number of atoms in the universe!
We can use function approximators
• State modeling
• Q-function as a table
• Generalization capabilities
More configurations than the number of atoms in the universe!
We can use function approximators
introduces errors, but helps also with generalization
Q-networks
• Neural networks are universal function approximators
• Q-networks
DQN architecture
DQN tricks
Q-learning does not work in practice, without tricks (as most of the TD algorithms)
• Sampled mini-batches for learning the parameters
stabilizes the network and reduces correlation among data
• Exploration with ԑ-greedy policy
with probability ԑ select random action, otherwise use Q
• Update the network every once a while
Nice, but…
Why? Any ideas?
Exploration, exploration, exploration
How do we deal with that?
• Monte-Carlo Tree Search
• Intrinsic Motivation
• Posterior Sampling
Exploration, exploration, exploration
How do we deal with that?
• Monte-Carlo Tree Search
• Intrinsic Motivation
• Posterior Sampling
AlphaGo
Uses Monte-Carlo Tree Search (plus optimism)!
Upper confidence bound applied to trees (UCT)
AlphaGo
• “supervised” policy
• RL policy
• simulation policy (follows UCT)
• value function estimate
AlphaGo
• policy used to execute full and truncated rollouts
• full rollouts to update estimate of value function
• value function and rollout to propagate information in the tree search, as well as to update the policy
Tricks
• Sampled mini-batches for learning
• Use each version of the network against a pool of previous versions
self improvement
• Update the networks every once a while
You know what?
These principles can be also applied in robotics!
Reinforcement Learning
These principles can be also applied in robotics!
Deep Reinforcement Learning
These principles can be also applied in robotics!
Self-improvement & MCTS
These principles can be also applied in robotics!
Robot Soccer Players at Sapienza:
Self-improvement & MCTS
These principles can be also applied in robotics!
Humanoid robots at Sapienza:
Ok, back to exploration…
How do we deal with that?
• Monte-Carlo Tree Search
• Intrinsic Motivation
• Posterior Sampling
Exploiting Uncertainty
IDEA: Measure uncertainty with density models and use that as exploration bonuses
Exploiting Uncertainty
IDEA: Measure uncertainty with density models and use that as exploration bonuses
Still exploration (yes, it is a big pain)
How do we deal with that?
• Monte-Carlo Tree Search
• Intrinsic Motivation
• Posterior Sampling
Posterior Sampling
Bootstrapping: train K different heads (networks), each on a different subset of data
Posterior Sampling
Bootstrapping: train K different heads (networks), each on a different subset of data
Posterior Sampling
Bootstrapping: train K different heads (networks), each on a different subset of data
Posterior Sampling
How to select final policy? E.g., majority voting