Model-Based RL - Offline reinforcement learning for hybrid vehicles energy consumption optimiza

Up to now, we have seen RL methods belonging to the model-free family: this kind of methods use the information leveraged from the interaction with the envi-ronment only to update their critic, which can be an exact tabular representation of the value function, or just some deep approximator. However, if on one hand solutions like QLearning are created to overcome the problem of not knowing the real transition dynamics probabilities, another idea could be starting with an approximated representation of these distributions, and try to refine them online as more information comes from the environment. This is the main feature of model-basedRL algorithms.

We can think of methods belonging to these families as two possible approaches to the problem of visiting a foreign city: since we don’t know well the city, one way to go around without getting lost (let’s say from point A to point B) could be to use some reference points (like a big square with a fountain) and remember what route to take from there. This is equivalent to employ a state (or state-action) function, assigning a certain value to certain positions.

Another method could be to use a map: we could buy one (really accurate but expensive solution, maybe unfeasible) or drawing one: if we go with the latter choice, at first our map will be really approximative; however, as more information comes in, we can refine it and use it to simulate the real world. Meaning, if we get stuck in a point, we can trace back our steps or even look ahead to find the right trajectory.

In practice, what we actually want to represent is how the environment changes its state in response to the agent’s action. Given a state and an action, a model produces a prediction of the resultant next state (or a distribution over the possi-ble next states, and possibly the reward: the part dealing with the states is the transition function, while the second is the reward function.

Why a model of the environment should be useful? For starting, a model can be used to simulate experience. Given a starting state and action, a model gener-ates all possible transitions weighted by their probabilities of occurring. Given a starting state and a policy, a model could generate all possible episodes and their probabilities. So, in the classic model-based learning cycle the interaction with the real world is used to learn the model; in turn, the model is used to plan through it generating a value function/policy which is used to interact in the real world, gaining new experience, so that we can learn a better model.

An example which conveys the idea of model-based usefulness is the game of chess:

in this case, the state space has a size of approximately 10⁴⁸! Moreover, moving one piece from its square to an adjacent one can completely change the position from winning to losing; this means that the optimal value function is really sharp, so learning that (or directly the policy) is really hard. On the contrary, the dynamics is quite easy to represent, since chess rules are simple and transitions are deterministic.

So, if I can use the model to look ahead, I’m able to estimate the value function by clever tree search strategies (planning). So, an advantage of model-based RL is that model can be a more useful representation of the information of the environment than value function. One another advantage is that model can be learned by supervised learning; moreover, the model is also useful if you want to reason about model uncertainty, which is what you don’t know about the environment and what you don’t know you don’t know; in other words, you want to understand the world better. With the model, you can choose actions to reach regions of the space you don’t know well.

One disadvantage is that we use an approximate model to learn an approximate value function, so now there are two sources of error.

Model-free RL:

• No model

• Learn value function (and/or policy) from experience Model-based RL:

• Learn a model from experience

• Plan (lookahead using the model) value function (and/or policy) from the model

In the end, what we want to learn is a transition function ˆP_Ψ[·|s, a] which approxi-mates the one of the real environment, which is P [·|s, a]. In this way we have two ways of generating experience:

• real experience: sampled from the environment (true MDP) s^′ ∼ P[·|s, a]

• simulated experience: sampled from the model (approximate MDP) s^′ ∼ Pˆ_Ψ[·|s, a]

2.7.1 DYNA: model-free + model-based

Dynais an algorithmic framework developed by Richard Sutton (see [6]) which is structured in this way:

• learn model from real experience

• learn and plan value function (and/or policy) from real and simulated experi-ence

Real experience can be exploited by a planning agent in two different ways: it can be used to update and improve the model, or to directly improve value function/

policy via model-free approaches (direct RL). This is summarized in Fig. 2.7: each arrow shows a relationship of influence and presumed improvement. The algorithm can be summarized as following (see Fig.2.8): once the q-function and transition model are initialized, we enter the main loop, in which, given a state, an action is selected by an ϵ-greedy policy. Subsequently, the action is executed and the environment transitions to the next state s^′ with a reward r; we end this process by updating our q-function. So far, this looks like plain QLearning. However, at this point we enter an inner loop which exploits the learned model: at every inner step, given state s, we choose a random action from those that were previously taken when encountered state s and, after that, we output a next state s^′ and a reward r using the learned model. Then, we update the q-function and the step ends.

Model-free and model-based approaches both have advantages and disadvantages.

While the former are much simpler in their design and are not affected by biases due to model learning, model-based methods usually make fuller use of a limited amount of experience and thus achieve a better policy with fewer interactions with the environment, as we can see in Fig. 2.9: as the agent is allowed more "thinking time" (the number N of planning steps), we observe that the number of steps required to reach the end of an episode sharply decreases and reaches its minimum after few training steps, contrarily to the case where the agent is not allowed to

"think" at all (pure model-free method).

Figure 2.7: Algorithm flow of Dyna

Figure 2.8: Dyna algorithm

Nel documento Offline reinforcement learning for hybrid vehicles energy consumption optimization (pagine 32-35)