Delayed multi-agent A3C - Microscopic traffic simulation and maneuver execution with multi-agen

Capitolo 3. Microscopic Traffic Simulation using Multi-Agent Deep Reinforcement Learning

process, which is sequential by nature.

Road traffic vehicles, generally speaking, can be seen as homogeneous agents because they all obey the road laws (more or less..), have similar goals (connect point A with point B) and a similar way to achieve it (they don’t fly or have strange motion dynamics). For this reason it has been adopted the parameter sharing technique, that is the use of a single central policy which drives every agent, and, as a consequence, can be updated from all the agents at the same time (see Section 2.6).

However, if we look more closely, in real world driving each driver has its own peculiar way of driving, which is changing throughout its life time and is dependent on external factors such as being in a hurry. For this reason, even if sharing a unique single policy, the system has been designed so that the personal behavior of each agent can be shaped by modifying some of its dedicated input, as explained in the next section.

In the proposed setting, an agent, initialized with its behavioral parameters, explore and learn in an environment which is not anymore aseptic as in the classical single-agent version, but it is populated by several other similar single-agents, each one with its unique driving style. Therefore, agents have to learn how to reach their goal in a situation in which cooperation is needed; since they are not endowed with direct communication capabilities, the unique way to reach the goal is implicitly negotiating trough driving actions, as it happens in real world driving. Our belief is that agents, if given proper reaction time and motion dynamics, will discover human-like techniques to understand the best way to proceed: for example, a timid driver will be likely to give way to a more aggressive one in an intersection, after having tried with scarce results to insert.

In order to achieve these behaviors proper rewards have to be shaped, both to let agents respect traffic laws, as the right of way, and to penalize agents that does not follow their assigned driving style. Every agent is given a path to follow leading to its assigned goal position, and it will be rewarded positively if it would be reached without incurring in accidents; the explicit rewards used for the specific case of the roundabout insertion are explained in Section 3.7.3.

Agents take their actions at common time instances, so that race conditions are

3.4. Delayed multi-agent A3C 67

avoided and their neural network inference steps are executed together. Dependently on the scenario, it might happen that the number of agents simultaneously learning on it is not sufficient to guarantee enough differentiation on the updates of the policy.

For example, in case of intersections or roundabouts, the number of learning agents is limited by the number of incoming branches; since the updates coming from different but interacting agents is not independent anymore, the overall correlation between updates may be excessive.

In order to cope with this problem, several independent environments have been simulated simultaneously as in A3C, permitting to multiply the number of parallel learners, though maintaining them shared so that inter-agent interaction can still be learned; the effectiveness of the use of a multi-environment configuration are analyzed in the ablation studies of Section 3.7.7.

Another novelty of the proposed multi-agent algorithm respect to single-agent A3C, is the addition of delay on the exchanges of updates between agents and the global collector. Indeed, even if agents compute n-step updates as before, they do not send them immediately but wait a longer time in which they collect them locally, after which they send a unique bigger update; in particular, in the proposed approach, updates are sent at the end of each agent episode.

The reason of this choice is twofold: on the one hand, it reduces the synchroniza-tion burden because every exchange of parameters requires time for the transfer; since agents compute updates asynchronously but have a shared time scale, it is sufficient that only one of them is computing its backpropagation for getting all the others stuck, slowing down the full process; reducing the number of exchanges by bundling several updates together mitigates this overload. On the other hand, the presence of delay on the local policy updates leads to an augmentation on the diversity of agents behaviors;

indeed each one of them will own a copy of the global networks related to more distant time intervals, which is also evolving independently until the next exchange of updates; this added noise is potentially beneficial for the exploration by letting agents investigate slightly different policies. The pseudo-code for multi-agent delayed A3C in case of discrete action-space is given in Algorithm 6.

In some cases, it could be advantageous to use a technique called action repeat,

Capitolo 3. Microscopic Traffic Simulation using Multi-Agent Deep Reinforcement Learning

Algorithm 6 multi-agent delayed A3C for discrete action-spaces

1: get nenvas the number of environment instances

2: get nagas the number of agents for each environment

3: initialize parameters θ_π^gof π(a|s,θπ)

4: initialize parameters θv^gof V(a|s,θ^v)

5: run nenvenvironment instances

6: for e={1,..,n^env} do

7: for i={1,..,n^ag} do

8: create agent A^e_i

9: end for

10: end for

11: for every agent A^e=_i=_{1,..,|I|}^{1,..,n^env^}do {executed concurrently until termination}

12: t← 1

13: copy parameters θ_π^A= θ_π^g

14: copy parameters θ_vÂ= θv^g 15: ∆Â_π= 0, ∆Â_v = 0

3.4. Delayed multi-agent A3C 69

16: while stis not terminaldo

17: get st {dependent on the (nag− 1) agents of the same environment}

18: t_{last_up}= t

19: while t−tlast_up< n and stis not terminaldo

20: execute action at drawn from π(at|s^t, θ_π^A)

21: wait other agents to finish their actions

22: get reward rt, state s_t+1

23: t← t + 1

24: end while

25: R=







0 if st is terminal V(st, θ_v^A) otherwise

26: for all i∈ {t − 1,..,t −tlast_up} do

27: R= ri+ γr

28: ∆Â_π= ∆Â_π+ ∇_θÂ

π{logπ(ai|si, θ_π^A)}[R −V (a|si, θ_v^A)]

29: ∆^A_v = ∆^A_v+ ∇_θA

v{R −V (a|sⁱ, θ_v^A)}²

30: end for

31: end while{episode finished}

32: update θ_π^gusing ∆^A_π

33: update θv^gusing ∆^A_v

34: end for

Capitolo 3. Microscopic Traffic Simulation using Multi-Agent Deep Reinforcement Learning

which means repeating the same action for several subsequent frames without com-puting the updates on them, as done in [56]: besides increasing the simulation speed (backpropagation is computed in a fraction of the frames), it has been demonstrated that sometimes the use of action repeat improves learning ([96]). This is due to an increased capability of learning associations between temporally distant state-action pairs resulting from the incremented power given to each action for affecting the state.

The effects of this technique for the proposed simulator is analyzed in the ablation studies of Section 3.7.7.

Nel documento Microscopic traffic simulation and maneuver execution with multi-agent deep reinforcement learning (pagine 81-86)