• Non ci sono risultati.

60

Capitolo 3. Microscopic Traffic Simulation using Multi-Agent Deep Reinforcement Learning

caused by wrong choices or unexpected events, in order to avoid the negative reward that will be received in case of unhappy ending. Policies are not anymore a direct mapping between single states and actions, but take into account the long-term effect of the decisions made, in order to maximize also the rewards far in the future.

However, teaching a self-driving car to drive in a reinforcement learning fashion directly on public roads is not a good idea. Indeed, accidents and dangerous maneuvers have to happen in order to learn the negative effects they will produce. This is clearly not possible, due to safety reasons and for the impossibility of breaking a car at every mistake. There are some works, as [76] and [77], in which agents learned simple tasks directly on real life with few exploration trajectories. However, the environment was limited to simple cases that did not consider the presence of other agents. For these reasons, in order to successfully learn complex behaviors in multi-agent scenarios a simulator is often essential.

Some works, as [78], use parametric simulators in which agents learn to perform maneuvers by observing coordinates, velocities and other numerical variables such as the lane number of the other agents, and output high-level commands that have to be translated into driving actions. These graphics-less simulators have the advantage of providing a very low-dimensional state space which greatly simplify the learning process. However, a different description of the scene has to be adopted for every specific case, making them inadequate to generalize to unseen scenarios.

Instead, more flexibility can be obtained exploiting recent developments on Con-volutional Neural Networks architectures (described in Section 2.3.2), which give the possibility of training agents directly from raw images, making it possible to build agents best suited to generalization; depending on the desired approach, a different type of visual simulator has to be adopted.

3.2.1 Simulators with realistic graphics

If the goal is to build an agent able to navigate using raw camera images, the graphics of the simulator has to get as much as close to the camera one. Several attempts have been made using the Grand Theft Auto V (GTA) video game, thanks to its very high-definition and realistic physic engine ([79]). For example, in [80], it has been

3.2. Simulation in the self-driving domain 61

demonstrated that a large number of images captured from GTA is more efficient than fewer yet real images for object detection purposes.

However, entertainment programs are not made for algorithm development pur-poses, so do not facilitate benchmarking and environment customization: they do not permit extensive sensor simulation (e.g. several cameras mounted on the same vehicle) and they do not provide detailed feedback upon dangerous driving behaviors and violation of traffic rules.

In order to fulfill those requirements, authors of [81] developed a simulator called CARLA(CAr Learning to Act) build upon the Unreal game engine ([82]); this simulator is build from the ground for supporting autonomous driving models, and for this reason it was endowed with flexibility on the setup of simulated sensor suites and with the possibility of providing feedback signals useful for training autonomous agents, such as GPS coordinates, motion details as speed and acceleration, and information about collisions and infractions committed. Moreover, it gives the possibility of designing urban scenarios at will and specifying weather conditions and time of the day.

Some works, as [83] and [84], evaluated the possibility of transferring policies learned in CARLA and TORCS ([85]) simulator respectively to the real world, using as intermediate level the segmentation of the scene obtained from the camera in order to diminish the domain gap.

However, experiments involve simple tasks without considering the participation of other agents in the process. It remains an open question if it would be possible to extend such solutions to more complex multi-agent tasks.

3.2.2 Simulators with simplified graphics

A different approach is taken from simulators that does not try to mimic what we see with our eyes, but instead build a semantic representation of the scene, thus reducing its sample complexity. Indeed, the visual representation of a car does not depend anymore on its brand or color, but its fixed a priori: only the crucial appearance-independent information are maintained, such us its position on the road, encumbrance and motion dynamics. At the same time, an agent learning in such an environment does not have to learn that a road may be paved or cobblestoned, or even covered with snow, since

62

Capitolo 3. Microscopic Traffic Simulation using Multi-Agent Deep Reinforcement Learning

its representation in all those cases would be the same.

Often, a top-view representation of the scene perceivable from the agent is used instead of first person observations, as in case of SUMO ([86], Figure 3.1a) or Waymo’s ChauffeurNet ([75], Figure 3.1b) simulators: this way, a car would be drawn as an oriented two-dimensional box, meaning that also the three-dimensional perspective is removed from the interpretative burden in charge of the agent. Therefore, the idea is to simplify as much as possible the representation of the environment, so that the agent can focus mainly on learning a good-enough policy needed to accomplish its task instead of understanding also the semantic of what it is seeing around it.

The drawback of this mid-level representation comes from the fact that the real world is indeed complex and incredibly various, thus very far from the representation used. For this reason, in order to transfer the policy learned in simulation into real driving it is necessary to use detection systems able to reconstruct a representation similarly to that seen in simulation. This means that the self-driving car has to be equipped with object detectors able to recognize cars, pedestrians and other relevant information from the input coming from its sensors; at the same time, it would need high-level maps and localization systems in order to be able to reconstruct the correctly the scenario.

An agent trained in a simulator of this kind does not feature anymore the full end-to-end characteristic in which everything is taken into account from a big neural network which solve all the fundamental tasks using raw sensor data as explained in Section 1.4.2, but it will require a suite of systems in charge of several limited tasks.

On the other hand, the domain gap between simulation and reconstructed scenes can be much smaller than the gap obtained when using simulators with realistic graphics, since it is not needed anymore to get closer to the visual appearance of the scene.

3.3 Why simulating with multi-agent deep reinforcement