• Non ci sono risultati.

in the experiments.

Thus from this description, the following set of experiment configurations are defined:

• i ∈ C = {C2, C3, C4, C5}, with 2cardinality(C)−1 = 15(to exclude the baseline with all detached domain classifiers), λi ∈ {0,0.5{

• P = {P2, P3, P4, P5, P6}, with 2cardinality(S) −1 = 31 (to exclude the baseline with all detached domain classifiers), λi ∈ {0,|CF G|0.5 { in which |CF G| is the cardinality of the subset of attached classifiers to the DA-PanopticFPN

• i ∈ C = {C2, C3, C4, C5}, with 2cardinality(C)−1 = 15 (to exclude the baseline with all detached domain classifiers), λi ∈ {0,|CF G|+10.5 { in which |CF G| is the cardinality of the subset of attached classifiers to the DA-PanopticFPN Thus, a total of 61 experiments on the possible domain adaptive model configura-tions have been run, with 3 additional separate configuraconfigura-tions which represent the aforementioned baselines.

All configurations are validated on a subset of Synthetic-CARLA, Cityscapes and BDD100k.

Due to the fact that each batch of batch_size RGB images can contain images of different H × W height, width and aspect ratios, these are resized so that their shortest side is common and at least min(H, W ) ≥ 800. If after resizing its longest side max(H, W ) ≥ 1333, it is downscaled such that it max(H, W ) = 1333. Then, they are batched together based on their aspect ratio.

The images are preprocessed by mean subtraction and standardization of the RGB channels, considering the Imagenet[52] mean channel values [103.530, 116.280, 123.675], and standard deviations= [57.375, 57.120, 58.395].

As data augmentation, input images are randomly flipped horizontally with proba-bility p = 0.5

Regarding the model parameter optimization, the ADAM[158] algorithm for stochastic optimization has been utilized.

The optimization procedure is run for N_iter = 10000 iterations.

ADAM is configured as follows:

• momentum = 0.9

• γ = 0.1

• weight decay = 1e − 4

As a polynomial decay schedule, of which the behavior is shown in figure5.1 is used to modify the learning rate during the N_iter of optimization, such schedule is configured as follows:

• initial learning rate lr0 = 0.001

• final learning rate lr10000 = 0.0

• linear warmup period of 1000 iterations

• warmup factor set to 0.001

• power= 0.9

Figure 5.1: Polynomial decay schedule, obtained with the provided configuration

The model is trained on a distributed data parallel setup, on two 3090 GPUs, filling approximately 80 − 95% of the memory of each.

Each configuration requires ≈ 2 hours to complete training and inference steps.

Inference and sampling of the validation loss is carried out every 1000 training iterations.

5.5 Experiment results

Based on the described experiment setup, the training results have been gathered.

While the total training and validation loss have been, considered, their interpreta-tion is not trivial due to the effect of the domain classifier on the backpropagated reversed gradients[140] , which although allow the model to generalize to the target domain, affect the neural network parameter optimization process.

Aside from this preliminary note, for each model configuration and baseline, the following metrics have been collected at the best validation iterate of the the 10000 training iterations (the first 1000 iterates are not considered as it is the warmup period of the ADAM optimizer):

• panoptic quality on Cityscapes validation and BDD100k out of sample sets

• recognition quality on Cityscapes validation and BDD100k out of sample sets

• segmentation quality on Cityscapes validation and BDD100k out of sample sets

• instance segmentation loss on Cityscapes validation

• semantic segmentation loss on Cityscapes validation

• training iterate

The baseline metrics are reported below, for brevity the prefix c or b signify Cityscapes validation and bdd100k validation respectively(L stands for Loss):

baseline name cPQ cRQ cSQ bPQ bRQ bSQ cL seg cL mask best iter DA baseline 1) 26.6 19.6 33.9 26.0 52.6 68.0 0.61 0.48 6000 cityscapes 2) 50.8 62.7 76.6 32.7 40.4 78.3 0.12 0.34 10000 bdd100k 3) 39.9 49.7 73.0 36.3 44.3 81.2 0.15 0.28 10000

The best 16 domain adaptation experiment results are reported below, sorted by decreasing cityscapes validation panoptic quality.

For each configuration, the experiment names indicate with the convention Embed-dingName__ λembedding,

the FPN backbone embedding name {C2, C3, C4, C5, P2, P3, P4, P5, P6} and the maximum λ coefficient assigned to the respective attached domain classifier.

The λ of all domain classifiers that are associated with an embedding that is not present in the provided naming convention are set λ = 0, thus are detached from the network. These do not provide any contribution to the loss.

experiment name cPQ bPQ cRQ bRQ cSQ bSQ cL seg cL mask best iter C2_0.17__C5_0.17 30.1 22.7 38.3 29.0 63.6 67.7 0.22 0.42 1000

C2_0.25 29.3 21.7 37.2 28.4 58.8 62.6 0.5 0.49 1000

P5_0.25__P6_0.25 28.8 23.3 36.6 29.3 63.8 68.4 0.23 0.44 1000 P3_0.17__P5_0.17

P6_0.17 28.7 23.0 36.7 29.0 59.1 69.1 0.31 0.33 1000

P2_0.17__P3_0.17

P5_0.17 28.7 22.1 36.2 28.1 61.8 68.9 0.25 0.48 2000

C4_0.17__C5_0.17 28.7 22.2 36.6 28.5 64.1 72.4 0.43 0.41 7000 P2_0.17__P4_0.17

P5_0.17 28.7 22.7 36.4 28.5 57.4 66.0 0.27 0.43 2000

P2_0.25__P3_0.25 28.6 23.2 36.6 29.4 63.8 71.3 0.38 0.42 1000 P4_0.25__P5_0.25 28.5 21.9 36.5 28.1 65.0 64.0 0.39 0.38 1000 P2_0.12__P3_0.12

P4_0.12__P6_0.12 28.5 21.0 36.1 27.6 62.8 67.5 0.39 0.42 10000

P2_0.5 28.5 22.6 36.2 28.3 59.1 60.5 0.58 0.55 1000

P2_0.25__P4_0.25 28.5 21.6 36.0 27.8 59.3 72.6 0.45 0.45 7000

C2_0.5 28.4 20.9 35.8 27.1 59.6 68.8 0.38 0.42 5000

C4_0.5 28.4 21.4 36.1 27.6 59.3 68.9 0.29 0.4 5000

P4_0.25__P6_0.25 28.3 21.9 36.3 28.2 62.9 68.3 0.3 0.47 6000 P2_0.12__P3_0.12

P5_0.12__P6_0.12 28.3 21.5 36.1 28.1 63.6 67.8 0.35 0.45 1000 Table 5.1: Domain adaptation experiment metrics, sorted by decreasing panoptic quality on Cityscapes validation set(cPQ)

experiment name cPQ bPQ cRQ bRQ cSQ bSQ cL seg cL mask C2_0.17

C5_0.17 13.2% 15.8% 13.0% 11.5% 20.9% -0.4% -63.9% -12.5%

C2_0.25 10.2% 10.7% 9.7% 9.2% 11.8% -7.9% -18.0% 2.1%

P5_0.25

P6_0.25 8.3% 18.9% 8.0% 12.7% 21.3% 0.6% -62.3% -8.3%

P3_0.17 P5_0.17

P6_0.17 7.9% 17.3% 8.3% 11.5% 12.4% 1.6% -49.2% -31.2%

P2_0.17 P3_0.17

P5_0.17 7.9% 12.8% 6.8% 8.1% 17.5% 1.3% -59.0% 0.0%

C4_0.17

C5_0.17 7.9% 13.3% 8.0% 9.6% 21.9% 6.5% -29.5% -14.6%

P2_0.17 P4_0.17

P5_0.17 7.9% 15.8% 7.4% 9.6% 9.1% -2.9% -55.7% -10.4%

P2_0.25

P3_0.25 7.5% 18.4% 8.0% 13.1% 21.3% 4.9% -37.7% -12.5%

P4_0.25

P5_0.25 7.1% 11.7% 7.7% 8.1% 23.6% -5.9% -36.1% -20.8%

P2_0.12 P3_0.12 P4_0.12 P6_0.12

7.1% 7.1% 6.5% 6.2% 19.4% -0.7% -36.1% -12.5%

P2_0.5 7.1% 15.3% 6.8% 8.8% 12.4% -11.0% -4.9% 14.6%

P2_0.25

P4_0.25 7.1% 10.2% 6.2% 6.9% 12.7% 6.8% -26.2% -6.2%

C2_0.5 6.8% 6.6% 5.6% 4.2% 13.3% 1.2% -37.7% -12.5%

C4_0.5 6.8% 9.2% 6.5% 6.2% 12.7% 1.3% -52.5% -16.7%

P4_0.25

P6_0.25 6.4% 11.7% 7.1% 8.5% 19.6% 0.4% -50.8% -2.1%

P2_0.12 P3_0.12 P5_0.12 P6_0.12

6.4% 9.7% 6.5% 8.1% 20.9% -0.3% -42.6% -6.2%

Table 5.2: Domain adaptation experiments reported in percentage change of the metrics with respect to the domain adaptation baseline(baseline 1), sorted by decreasing panoptic quality on Cityscapes validation set(cPQ)

experiment name cPQ bPQ cRQ bRQ cSQ bSQ cL seg cL mask C2_0.17

C5_0.17 -40.7% -30.6% -38.9% -28.2% -17.0% -13.5% 83.3% 23.5%

C2_0.25 -42.3% -33.6% -40.7% -29.7% -23.2% -20.1% 316.7% 44.1%

P5_0.25

P6_0.25 -43.3% -28.7% -41.6% -27.5% -16.7% -12.6% 91.7% 29.4%

P3_0.17 P5_0.17

P6_0.17 -43.5% -29.7% -41.5% -28.2% -22.8% -11.7% 158.3% -2.9%

P2_0.17 P3_0.17

P5_0.17 -43.5% -32.4% -42.3% -30.4% -19.3% -12.0% 108.3% 41.2%

C4_0.17

C5_0.17 -43.5% -32.1% -41.6% -29.5% -16.3% -7.5% 258.3% 20.6%

P2_0.17 P4_0.17

P5_0.17 -43.5% -30.6% -41.9% -29.5% -25.1% -15.7% 125.0% 26.5%

P2_0.25

P3_0.25 -43.7% -29.1% -41.6% -27.2% -16.7% -8.9% 216.7% 23.5%

P4_0.25

P5_0.25 -43.9% -33.0% -41.8% -30.4% -15.1% -18.3% 225.0% 11.8%

P2_0.12 P3_0.12 P4_0.12 P6_0.12

-43.9% -35.8% -42.4% -31.7% -18.0% -13.8% 225.0% 23.5%

P2_0.5 -43.9% -30.9% -42.3% -30.0% -22.8% -22.7% 383.3% 61.8%

P2_0.25

P4_0.25 -43.9% -33.9% -42.6% -31.2% -22.6% -7.3% 275.0% 32.4%

C2_0.5 -44.1% -36.1% -42.9% -32.9% -22.2% -12.1% 216.7% 23.5%

C4_0.5 -44.1% -34.6% -42.4% -31.7% -22.6% -12.0% 141.7% 17.6%

P4_0.25

P6_0.25 -44.3% -33.0% -42.1% -30.2% -17.9% -12.8% 150.0% 38.2%

P2_0.12 P3_0.12 P5_0.12 P6_0.12

-44.3% -34.3% -42.4% -30.4% -17.0% -13.4% 191.7% 32.4%

Table 5.3: Domain adaptation experiments reported in percentage change of the metrics with respect to the Cityscapes baseline(baseline 2) , sorted by decreasing panoptic quality on Cityscapes validation set(cPQ)

5.5.1 Result summary

As can be seen from Table 5.2, the top domain adaptive configuration in the top row, attains a 13.2% and 15.8% panoptic quality(PQ) improvement respectively on Cityscapes and BDD100k validation over the baseline trained solely on synthetic and tested on real data.

Overall, all metrics improve with respect to baseline 1), except for the segmentation quality on BDD100k validation, which is only marginally increased with respect to the baseline metrics.

The losses are all much lower than the baseline ones, with a decrease of −63%

and 12.5% in semantic and instance segmentation loss respectively, for the top performing model.

Nevertheless, there is a lot of room for improvement.

As the table 5.2 shows, the top performing domain adaptive model presents a panoptic quality metric which is 40.7% and 30.6% worse than the "oracle" model (baseline 2) metrics, respectively computed on predictions made on Cityscapes and

BDD100k validation sets.

Thus, the final model can be defined as the DA-PanopticFPN model which is trained with the domain adaptive classifiers attached to C2 and C5(each implemented as described in the model design section), respectively with maximum λC2 = 0.17 and λC5 = 0.17. The architecture is shown in figure 5.2, with the considered domain adaptive components highlighted in yellow. The ones that must remain detached are greyed out.

Figure 5.2: DA PanopticFPN architecture with the optimal domain classifiers attached respectively to C2 and C5

The PQ metrics obtained on the validation sets by the best configuration at its best training iteration are shown for Synthetic-CARLA, Cityscapes and BDD100K in figure 5.3.

Figure 5.3: PQ metrics of the best model on validation Cityscapes(left) and BDD100k(right)

For comparison, all PQ metrics are plotted in an unique graph in figure 5.4.

The grey configuration at the top is the baseline 2) model.

Figure 5.4: PQ metrics of the best model on validation Cityscapes(top) and BDD100k(bottom)

The train and validation total loss(sum of all contributions) of the best configu-ration on all datasets are plotted in figure 5.5

Figure 5.5: Train and validation losses: the train total loss on Synthetic-CARLA(bottom), validation total loss Synthetic-CARLA(top-left), Cityscapes (top-center), BDD100k(top-right)

The panoptic predictions obtained on the validation sets by the best configura-tion at its best training iteraconfigura-tion are shown for Cityscapes in figure 5.6, BDD100K in figure 5.7 and Synthetic-CARLA in figure 5.8.

Figure 5.6: validation Cityscapes panoptic predictions of the best configuration

Figure 5.7: validation BDD100k panoptic predictions of the best configuration

Figure 5.8: validation Synthetic-CARLA panoptic predictions of the best config-uration

Conclusions and further work

The development of SAE level5 autonomous vehicles entails a set of complex prob-lems each of which has to be solved with the most robust algorithms, refined to the utmost perfection.

As the authors of [2] prove, to demonstrate safety and reliability of autonomous vehicles with respect to robustness requirements of the SAE[3] metrics, 100 to 600 years of cumulative driving are needed to collect the needed data, for each separate issue e.g. fatality rate, failure rate, crash rate.

Hence, the authors set the case for the aided development of autonomous vehicles by means of simulation.

However, up to now simulation can only provide approximate models of the real world.

As the discrepancy between simulation and real world data distributions, defined as domain shift, is the major obstacle that hinders the widespread use of simulation for autonomous vehicles, this dissertation focuses its attention on such issue.

The study has been developed from the perspective of the implementation of scene understanding models for autonomous driving, as it presents one of the best use-cases both for deep learning models and domain adaptation techniques that directly address the domain shift problem.

This issue has been addressed in regards to the novel panoptic segmentation task, which allows to capture fine details of the scene as well as detect all actors within As deep models are known to be data hungry, the theoretical complete removal ofit.

the domain shift would allow them to be trained on simulation, which by design automatically annotates data, and achieve the same performance of such model trained on costly annotated real world data.

As such, this work developed the context of such use-case, by employing the state of the art CARLA[6] simulator as a means to generate auto-annotated synthetic data. Then, real world data has been obtained by means of the Cityscapes[60] and BDD100k[62] datasets.

With the creation of a label matched dataset for panoptic segmentation, the Panop-ticFPN model has been trained with proper modifications so as to be usable in a domain adaptation framework.

To this end, the self-supervised adversarial domain adaptation technique described by Ganin et al.[140] has been utilized to adapt the model for supervised and self-supervised training on synthetic and real data respectively.

As self-supervision does not require a human annotator in order to generate ground truths for the data, it enables the use of real world images, although as a restricted version(panoptic annotations on real world data would still provide much better signals for training deep models).

From the experiments that have been run, it has been possible to observe that it is indeed possible to improve models trained on labeled simulated data and self-supervised real world data, by means of domain adaptation. Hence, the optimal self-supervised adversarial domain adaptation setup has been determined and the developed DA-Panoptic FPN has been implemented according to such configuration, shown in Figure 5.2.

Domain adaptation employed as described above, allowed to improve the panoptic quality metric of the chosen model by 13.8% and 15.8% over the same baseline PanopticFPN architecture, named baseline 1, but trained solely on synthetic data and tested on the real world Cityscapes and BDD100k datasets.

Nevertheless, the margin for further improvements are large, as the ideal Panop-ticFPN model, named baseline 2, trained with label supervision directly on the Cityscapes real world dataset attains much higher panoptic quality.

The former should however be considered only as an ideal condition, as it implies that all real world data would be provided with annotations.

Nevertheless, a theoretical optimal domain adaptive model should be able to miti-gate or completely remove the domain shift, so as to attain the same performance as the above "oracle" model.

For comparison, the optimal DA-PanopticFPN architecture attains 40.7% worse panoptic quality compared to the aforementioned oracle model.

Therefore, the avenues for further research on domain adaptation for the panop-tic segmentation task are many.

Firstly, as this dissertation focused on the PanopticFPN architecture, the wide variety of panoptic segmentation models that are available in the literature should be benchmarked against the developed DA-PanopticFPN, with their own custom

domain adaptive implementation.

Many of such works have been mentioned in the related works chapter, such as the PanopticDeepLab[21], EfficientPS[22], CVRN[24] convolutional architectures.

Novel research on the panoptic task should also expand to the novel Vision Trans-former deep learning models, which approach computer vision problems from the perspective of architectures originally developed to work on sequence based data.

Another variation stems from the use of the SYNTHIA [25] or GTA5[5] syn-thetic datasets, paired with either BDD100k or Cityscapes as a means to enrich the literature on domain adaptation for the panoptic segmentation task by means of easily accessible benchmarks.

Further benchmarks can also be developed by testing the mentioned variety of panoptic segmentation models, properly modified in order to use different domain adaptation techniques, such as discrepancy based approaches (e.g. maximum mean discrepancy), pretext-task based methods (which are similar in nature to the self-supervised task employed in this work) or other adversarial domain adaptation techniques (e.g. GAN based synthetic-to-real image translation methods).

One final development which stays on the same research track as this work, is the enrichment of the CARLA semantic labels so as to provide a set of stuff and thing classes which directly match with the ones of the renowned driving datasets in the literature: Cityscapes, BDD100k,Mapillary Vistas, SYNTHIA, GTA5 to name a few.

Bibliography

[1] Aidan Fuller, Zhong Fan, Charles Day, and Chris Barlow. «Digital Twin:

Enabling Technologies, Challenges and Open Research». In: IEEE Access 8 (2020). Conference Name: IEEE Access, pp. 108952–108971. issn: 2169-3536.

doi: 10.1109/ACCESS.2020.2998358 (cit. on pp. 1, 3).

[2] Nidhi Kalra and Susan M. Paddock. «Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?» In:

Transportation Research Part A: Policy and Practice 94.C (2016). Publisher:

Elsevier, pp. 182–193. issn: 0965-8564. url: https://econpapers.repec.

org / article / eeetransa / v _ 3a94 _ 3ay _ 3a2016 _ 3ai _ 3ac _ 3ap _ 3a182 -193.htm (visited on 03/26/2022) (cit. on pp. 1, 2, 47, 116).

[3] J3016C: Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles - SAE International. url: https://

www.sae.org/standards/content/j3016_202104/(visited on 03/28/2022) (cit. on pp. 2, 116).

[4] Deepdrive. en. url: https://deepdrive.io/ (visited on 02/17/2022) (cit.

on p. 3).

[5] GTA V + Universe. Jan. 2017. url: http://web.archive.org/web/

20170111195314 / https : / / openai . com / blog / GTA - V - plus - Universe/

(visited on 02/17/2022) (cit. on pp. 3, 118).

[6] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and

Vladlen Koltun. «CARLA: An Open Urban Driving Simulator». In: arXiv:1711.03938 [cs] (Nov. 2017). arXiv: 1711.03938. url: http://arxiv.org/abs/1711.

03938 (visited on 02/17/2022) (cit. on pp. 3, 13, 50, 54, 99, 117).

[7] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. «AirSim:

High-Fidelity Visual and Physical Simulation for Autonomous Vehicles». In:

arXiv:1705.05065 [cs] (July 2017). arXiv: 1705.05065. url: http://arxiv.

org/abs/1705.05065 (visited on 02/17/2022) (cit. on p. 3).

[8] Gazebo : Blog : Vehicle and city simulation. url: https://gazebosim.org/

blog/car_sim(visited on 02/17/2022) (cit. on p. 3).

[9] NVIDIA. NVIDIA Automotive Simulation. Jan. 2018. url: https://www.

youtube.com/watch?v=booEg6iGNyo (visited on 02/17/2022) (cit. on p. 3).

[10] NVIDIA. NVIDIA DRIVE Sim. Jan. 2019. url: https://www.youtube.

com/watch?v=DXsLDyiONV4 (visited on 02/17/2022) (cit. on p. 3).

[11] Gerard Andrews. What is synthetic data? en-US. June 2021. url: https:

/ / blogs . nvidia . com / blog / 2021 / 06 / 08 / what - is - synthetic - data/

(visited on 02/17/2022) (cit. on p. 3).

[12] Artificial Intelligence & Autopilot. en-us. url: https://www.tesla.com/AI (visited on 02/07/2022) (cit. on p. 3).

[13] Tesla. Tesla AI Day. Aug. 2021. url: https://www.youtube.com/watch?v=

j0z4FweCy4M&t=5715shttps://www.youtube.com/watch?v=j0z4FweCy4M (visited on 02/17/2022) (cit. on p. 3).

[14] VI-grade provides DiM150 dynamic driving simulator to NIO as key solution for the development of new electric cars. en. url: https://www.vi-grade.

com / en / about / news / vi grade provides dim150 dynamic driving -simulator- to- nio- as- key- solution- for- the- development- of- new-electric-cars_37/(visited on 02/17/2022) (cit. on p. 3).

[15] Alexis C. Madrigal. Waymo Built a Secret World for Self-Driving Cars. en.

Section: Technology. Aug. 2017. url: https://www.theatlantic.com/

technology/archive/2017/08/inside- waymos- secret- testing- and-simulation-facilities/537648/ (visited on 02/17/2022) (cit. on p. 3).

[16] Waypoint - The official Waymo blog: Off road, but not offline: How simulation helps advance our Waymo Driver. url: https://blog.waymo.com/202 0 / 04 / off - road - but - not - offline -- simulation27 . html (visited on 02/17/2022) (cit. on p. 3).

[17] Waypoint - The official Waymo blog: Simulation City: Introducing Waymo’s most advanced simulation system yet for autonomous driving. url: https://

blog.waymo.com/2021/06/SimulationCity.html(visited on 02/17/2022) (cit. on p. 3).

[18] Self-Driving Simulation | Uber ATG. en. url: https://www.uber.com/us/e n/atg/research-and-development/simulation/ (visited on 02/17/2022) (cit. on p. 3).

[19] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. «Panop-tic Feature Pyramid Networks». In: arXiv:1901.02446 [cs] (Apr. 2019).

arXiv: 1901.02446. url: http://arxiv.org/abs/1901.02446 (visited on 01/28/2022) (cit. on pp. 3, 100, 101).

[20] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. «Panoptic Segmentation». In: arXiv:1801.00868 [cs] (Apr. 2019).

arXiv: 1801.00868. url: http://arxiv.org/abs/1801.00868 (visited on 01/28/2022) (cit. on pp. 4, 9–11, 15, 45, 59, 86).

[21] Bowen Cheng, Maxwell D. Collins, Yukun Zhu, Ting Liu, Thomas S. Huang, Hartwig Adam, and Liang-Chieh Chen. «Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation». In:

arXiv:1911.10194 [cs] (Mar. 2020). arXiv: 1911.10194. url: http://arxiv.

org/abs/1911.10194 (visited on 02/22/2022) (cit. on pp. 4, 16, 40, 43, 45, 46, 69, 118).

[22] Rohit Mohan and Abhinav Valada. «EfficientPS: Efficient Panoptic Seg-mentation». In: arXiv:2004.02307 [cs] (Feb. 2021). arXiv: 2004.02307. url:

http://arxiv.org/abs/2004.02307(visited on 02/22/2022) (cit. on pp. 4, 16, 40, 45, 69, 118).

[23] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu, Min Bai, Ersin Yumer, and Raquel Urtasun. «UPSNet: A Unified Panoptic Segmentation Network». en. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, June 2019, pp. 8810–8818. isbn: 978-1-72813-293-8. doi: 10.1109/CVPR.2019.00902.

url: https : / / ieeexplore . ieee . org / document / 8953750/ (visited on 01/28/2022) (cit. on pp. 4, 45).

[24] Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. «Cross-View Regu-larization for Domain Adaptive Panoptic Segmentation». In: arXiv:2103.02584 [cs] (Mar. 2021). arXiv: 2103.02584. url: http://arxiv.org/abs/2103.

02584 (visited on 03/23/2022) (cit. on pp. 4, 46, 118).

[25] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and An-tonio M. Lopez. «The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes». In: 2016 IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR). ISSN: 1063-6919.

June 2016, pp. 3234–3243. doi: 10.1109/CVPR.2016.352 (cit. on pp. 5, 12, 13, 118).

[26] Qi Wang, Junyu Gao, and Xuelong Li. «Weakly Supervised Adversarial Domain Adaptation for Semantic Segmentation in Urban Scenes». In: IEEE Transactions on Image Processing 28.9 (Sept. 2019). arXiv: 1904.09092,

pp. 4376–4386. issn: 1057-7149, 1941-0042. doi: 10.1109/TIP.2019.291 0667. url: http://arxiv.org/abs/1904.09092 (visited on 03/12/2022) (cit. on p. 5).

[27] Matteo Biasetton, Umberto Michieli, Gianluca Agresti, and Pietro Zanuttigh.

«Unsupervised Domain Adaptation for Semantic Segmentation of Urban Scenes». In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). ISSN: 2160-7516. June 2019, pp. 1211–

1220. doi: 10.1109/CVPRW.2019.00160 (cit. on p. 5).

[28] Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. «Semi-Supervised Domain Adaptation via Adaptive and Progressive Feature Alignment». In:

arXiv:2106.02845 [cs] (June 2021). arXiv: 2106.02845. url: http://arxiv.

org/abs/2106.02845 (visited on 03/11/2022) (cit. on p. 5).

[29] Hui Zhang, Yonglin Tian, Kunfeng Wang, Haibo He, and Fei-Yue Wang.

«Synthetic-to-Real Domain Adaptation for Object Instance Segmentation».

In: 2019 International Joint Conference on Neural Networks (IJCNN). ISSN:

2161-4407. July 2019, pp. 1–7. doi: 10.1109/IJCNN.2019.8851791 (cit. on p. 5).

[30] Manuel Diaz-Zapata, Özgür Erkent, and Christian Laugier. «Instance Seg-mentation with Unsupervised Adaptation to Different Domains for Au-tonomous Vehicles». In: 2020 16th International Conference on Control, Automation, Robotics and Vision (ICARCV). Dec. 2020, pp. 421–427. doi:

10.1109/ICARCV50220.2020.9305452 (cit. on p. 5).

[31] Zhenwei He and Lei Zhang. «Multi-adversarial Faster-RCNN for Unrestricted Object Detection». In: arXiv:1907.10343 [cs] (Sept. 2019). arXiv: 1907.10343.

url: http://arxiv.org/abs/1907.10343 (visited on 03/12/2022) (cit. on p. 5).

[32] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool.

«Domain Adaptive Faster R-CNN for Object Detection in the Wild». In:

arXiv:1803.03243 [cs] (Mar. 2018). arXiv: 1803.03243. url: http://arxiv.

org/abs/1803.03243 (visited on 03/12/2022) (cit. on p. 5).

[33] Dongnan Liu, Donghao Zhang, Yang Song, Fan Zhang, Lauren O’Donnell, Heng Huang, Mei Chen, and Weidong Cai. «Unsupervised Instance Segmen-tation in Microscopy Images via Panoptic Domain AdapSegmen-tation and Task Re-weighting». In: arXiv:2005.02066 [cs] (May 2020). arXiv: 2005.02066.

url: http://arxiv.org/abs/2005.02066 (visited on 03/11/2022) (cit. on p. 5).

[34] Borna Bešić, Nikhil Gosala, Daniele Cattaneo, and Abhinav Valada. «Un-supervised Domain Adaptation for LiDAR Panoptic Segmentation». In:

IEEE Robotics and Automation Letters 7.2 (Apr. 2022). arXiv: 2109.15286, pp. 3404–3411. issn: 2377-3766, 2377-3774. doi: 10.1109/LRA.2022.314 7326. url: http://arxiv.org/abs/2109.15286 (visited on 03/11/2022) (cit. on p. 5).

[35] Lingdong Kong, Niamul Quader, and Venice Erin Liong. «ConDA: Unsuper-vised Domain Adaptation for LiDAR Segmentation via Regularized Domain Concatenation». In: arXiv:2111.15242 [cs] (Nov. 2021). arXiv: 2111.15242.

url: http://arxiv.org/abs/2111.15242 (visited on 03/12/2022) (cit. on p. 5).

[36] Qiangeng Xu, Yin Zhou, Weiyue Wang, Charles R. Qi, and Dragomir Anguelov. «SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation». In: arXiv:2108.06709 [cs] (Aug. 2021).

arXiv: 2108.06709. url: http://arxiv.org/abs/2108.06709 (visited on 03/12/2022) (cit. on p. 5).

[37] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Jürgen Gall, and Cyrill Stachniss. «Towards 3D LiDAR-based semantic scene understanding of 3D point cloud sequences: The SemanticKITTI Dataset».

en. In: The International Journal of Robotics Research 40.8-9 (Aug. 2021).

Publisher: SAGE Publications Ltd STM, pp. 959–967. issn: 0278-3649.

doi: 10 . 1177 / 02783649211006735. url: https : / / doi . org / 10 . 1177 / 02783649211006735(visited on 03/01/2022) (cit. on p. 5).

[38] Ze Wang, Weiqiang Ren, and Qiang Qiu. «LaneNet: Real-Time Lane Detec-tion Networks for Autonomous Driving». In: arXiv:1807.01726 [cs] (July 2018). arXiv: 1807.01726. url: http : / / arxiv . org / abs / 1807 . 01726 (visited on 02/18/2022) (cit. on p. 8).

[39] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. «MOTS:

Multi-Object Tracking and Segmentation». en. In: 2019 IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, June 2019, pp. 7934–7943. isbn: 978-1-72813-293-8. doi:

10 . 1109 / CVPR . 2019 . 00813. url: https : / / ieeexplore . ieee . org / document/8953401/(visited on 02/18/2022) (cit. on p. 8).

[40] Tien-Wen Yeh, Huei-Yung Lin, and Chin-Chen Chang. «Traffic Light and Arrow Signal Recognition Based on a Unified Network». en. In: Applied Sciences 11.17 (Jan. 2021). Number: 17 Publisher: Multidisciplinary Digital Publishing Institute, p. 8066. issn: 2076-3417. doi: 10.3390/app111780 66. url: https://www.mdpi.com/2076-3417/11/17/8066 (visited on 02/18/2022) (cit. on p. 8).

[41] Julian Müller and Klaus Dietmayer. «Detecting Traffic Lights by Single Shot Detection». In: arXiv:1805.02523 [cs] (Oct. 2018). arXiv: 1805.02523. url:

http://arxiv.org/abs/1805.02523 (visited on 02/18/2022) (cit. on p. 8).

[42] Phuc Manh Nguyen, Vu Cong Nguyen, Son Ngoc Nguyen, Linh My Thi Dang, Ha Xuan Nguyen, and Vinh Dinh Nguyen. «Robust Traffic Light Detection and Classification Under Day and Night Conditions». In: 2020 20th International Conference on Control, Automation and Systems (ICCAS).

ISSN: 2642-3901. Oct. 2020, pp. 565–570. doi: 10.23919/ICCAS50221.2020.

9268343(cit. on p. 8).

[43] Pavly Salah Zaki, Marco Magdy William, Bolis Karam Soliman, Kerolos Gamal Alexsan, Keroles Khalil, and Magdy El-Moursy. «Traffic Signs Detec-tion and RecogniDetec-tion System using Deep Learning». In: arXiv:2003.03256 [cs, eess] (Mar. 2020). arXiv: 2003.03256. url: http://arxiv.org/abs/

2003.03256 (visited on 02/19/2022) (cit. on p. 8).

[44] Aashrith Vennelakanti, Smriti Shreya, Resmi Rajendran, Debasis Sarkar, Deepak Muddegowda, and Phanish Hanagal. «Traffic Sign Detection and Recognition using a CNN Ensemble». In: 2019 IEEE International Confer-ence on Consumer Electronics (ICCE). ISSN: 2158-4001. Jan. 2019, pp. 1–4.

doi: 10.1109/ICCE.2019.8662019 (cit. on p. 8).

[45] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. «SSD: Single Shot MultiBox Detector». In: arXiv:1512.02325 [cs] (Dec. 2016). arXiv: 1512.02325. doi:

10.1007/978- 3- 319- 46448- 0_2. url: http://arxiv.org/abs/1512.

02325 (visited on 03/21/2022) (cit. on pp. 8, 35, 81).

[46] Qingquan Li, Long Chen, Quanwen Zhu, Ming Li, Qun Zhang, and Shuzhi Sam Ge. «Intersection detection and recognition for autonomous urban driv-ing usdriv-ing a virtual cylindrical scanner». en. In: IET Intelligent Transport Sys-tems 8.3 (2014). _eprint:

https://onlinelibrary.wiley.com/doi/pdf/10.1049/iet-its.2012.0202, pp. 244–254. issn: 1751-9578. doi: 10.1049/iet-its.2012.

0202. url: https://onlinelibrary.wiley.com/doi/abs/10.1049/iet-its.2012.0202 (visited on 02/19/2022) (cit. on p. 8).

[47] Dhaivat Bhatt, Danish Sodhi, Arghya Pal, Vineeth Balasubramanian, and Madhava Krishna. Have i reached the intersection: A deep learning-based approach for intersection detection from monocular cameras. Pages: 4500.

2017. doi: 10.1109/IROS.2017.8206317 (cit. on p. 8).

[48] DRIVE Labs: Pursuing Perfection for Intersection Detection - NVIDIA Blog.

en-US. May 2019. url: https://blogs.nvidia.com/blog/2019/05/10/

drive-labs-intersection-detection/ (visited on 02/19/2022) (cit. on p. 8).

[49] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. «COCO-Stuff: Thing and Stuff Classes in Context». In: arXiv:1612.03716 [cs] (Mar. 2018).

arXiv: 1612.03716. url: http://arxiv.org/abs/1612.03716 (visited on 03/22/2022) (cit. on pp. 8–10, 39, 40, 99).

[50] Omar Elharrouss, Somaya Al-Maadeed, Nandhini Subramanian, and Naj-math Ottakath. «Panoptic Segmentation: A Review». en. In: (), p. 29 (cit. on p. 11).

[51] Tsung-Yi Lin et al. «Microsoft COCO: Common Objects in Context». In:

arXiv:1405.0312 [cs] (Feb. 2015). arXiv: 1405.0312. url: http://arxiv.

org/abs/1405.0312(visited on 03/01/2022) (cit. on p. 11).

[52] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. «Ima-geNet: A large-scale hierarchical image database». In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. ISSN: 1063-6919. June 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848 (cit. on pp. 11, 27, 104, 105).

[53] COCO - Common Objects in Context. url: https://cocodataset.org/

#format-data (visited on 03/28/2022) (cit. on pp. 12, 59).

[54] Edward H. Adelson. «On seeing stuff: the perception of materials by humans and machines». In: ed. by Bernice E. Rogowitz and Thrasyvoulos N. Pappas.

San Jose, CA, June 2001, pp. 1–12. doi: 10.1117/12.429489. url: http://

proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=

903694 (visited on 03/12/2022) (cit. on p. 12).

[55] Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. «Play-ing for Data: Ground Truth from Computer Games». In: arXiv:1608.02192 [cs] (Aug. 2016). arXiv: 1608.02192. url: http://arxiv.org/abs/1608.

02192 (visited on 03/22/2022) (cit. on pp. 12, 13).

[56] Daniel Hernandez-Juarez, Lukas Schneider, Antonio Espinosa, David Vázquez, Antonio M. López, Uwe Franke, Marc Pollefeys, and Juan C. Moure. «Slanted Stixels: Representing San Francisco’s Steepest Streets». In: arXiv:1707.05397 [cs] (July 2017). arXiv: 1707.05397. url: http://arxiv.org/abs/1707.

05397 (visited on 03/22/2022) (cit. on p. 13).

[57] Emanuele Alberti, Antonio Tavera, Carlo Masone, and Barbara Caputo.

«IDDA: a large-scale multi-domain dataset for autonomous driving». In:

IEEE Robotics and Automation Letters 5.4 (Oct. 2020). arXiv: 2004.08298, pp. 5526–5533. issn: 2377-3766, 2377-3774. doi: 10.1109/LRA.2020.300 9075. url: http://arxiv.org/abs/2004.08298 (visited on 01/28/2022) (cit. on p. 13).

[58] Jean-Emmanuel Deschaud. «KITTI-CARLA: a KITTI-like dataset gener-ated by CARLA Simulator». In: arXiv:2109.00892 [cs] (Aug. 2021). arXiv:

2109.00892. url: http://arxiv.org/abs/2109.00892 (visited on 03/22/2022) (cit. on p. 13).

[59] A Geiger, P Lenz, C Stiller, and R Urtasun. «Vision meets robotics: The KITTI dataset». en. In: The International Journal of Robotics Research 32.11 (Sept. 2013). Publisher: SAGE Publications Ltd STM, pp. 1231–

1237. issn: 0278-3649. doi: 10 . 1177 / 0278364913491297. url: https : //doi.org/10.1177/0278364913491297 (visited on 03/01/2022) (cit. on pp. 13, 55).

[60] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.

«The Cityscapes Dataset for Semantic Urban Scene Understanding». In:

arXiv:1604.01685 [cs] (Apr. 2016). arXiv: 1604.01685. url: http://arxiv.

org/abs/1604.01685 (visited on 02/05/2022) (cit. on pp. 14, 50, 58, 117).

[61] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and Peter Kontschieder.

«The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes».

In: 2017 IEEE International Conference on Computer Vision (ICCV). ISSN:

2380-7504. Oct. 2017, pp. 5000–5009. doi: 10.1109/ICCV.2017.534 (cit. on p. 14).

[62] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. «BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning». In: arXiv:1805.04687 [cs]

(Apr. 2020). arXiv: 1805.04687. url: http://arxiv.org/abs/1805.04687 (visited on 01/28/2022) (cit. on pp. 14, 50, 58, 99, 117).

[63] Ming-Fang Chang et al. «Argoverse: 3D Tracking and Forecasting with Rich Maps». In: arXiv:1911.02620 [cs] (Nov. 2019). arXiv: 1911.02620. url:

http : / / arxiv . org / abs / 1911 . 02620 (visited on 01/28/2022) (cit. on p. 15).

[64] Holger Caesar et al. «nuScenes: A multimodal dataset for autonomous driving». In: arXiv:1903.11027 [cs, stat] (May 2020). arXiv: 1903.11027.

url: http://arxiv.org/abs/1903.11027 (visited on 01/28/2022) (cit. on p. 15).

[65] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. «Panop-tic Feature Pyramid Networks». In: arXiv:1901.02446 [cs] (Apr. 2019).

arXiv: 1901.02446. url: http://arxiv.org/abs/1901.02446 (visited on 01/28/2022) (cit. on pp. 15, 16, 40, 45, 69, 76, 85, 98, 100).

[66] Sukjun Hwang, Seoung Wug Oh, and Seon Joo Kim. «Single-shot Path Integrated Panoptic Segmentation». In: arXiv:2012.01632 [cs] (Dec. 2020).

arXiv: 2012.01632. url: http://arxiv.org/abs/2012.01632 (visited on 02/22/2022) (cit. on p. 16).

[67] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Liwei Wang, Zeming Li, Jian Sun, and Jiaya Jia. «Fully Convolutional Networks for Panoptic Segmentation».

en. In: (Dec. 2020). url: https://arxiv.org/abs/2012.00720v2 (visited on 02/22/2022) (cit. on p. 16).

[68] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. «Mask R-CNN». In: arXiv:1703.06870 [cs] (Jan. 2018). arXiv: 1703.06870. url:

http : / / arxiv . org / abs / 1703 . 06870 (visited on 01/28/2022) (cit. on pp. 16, 41, 45, 69, 77, 93).

[69] Liang-Chieh Chen, Huiyu Wang, and Siyuan Qiao. «Scaling Wide Residual Networks for Panoptic Segmentation». In: arXiv:2011.11675 [cs] (Feb. 2021).

arXiv: 2011.11675. url: http://arxiv.org/abs/2011.11675 (visited on 02/22/2022) (cit. on p. 16).

[70] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. «Attention Is All You Need». en. In: (June 2017). url: https://arxiv.org/abs/1706.03762v5 (visited on 02/22/2022) (cit. on pp. 16, 27).

[71] Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. «Vision Transformer with Deformable Attention». en. In: (Jan. 2022). url: https:

//arxiv.org/abs/2201.00520v2 (visited on 02/22/2022) (cit. on p. 16).

[72] Alexey Dosovitskiy et al. «An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale». en. In: (Oct. 2020). url: https://arxiv.

org/abs/2010.11929v2(visited on 02/22/2022) (cit. on p. 16).

[73] Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M.

Alvarez, Tong Lu, and Ping Luo. «Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers». en. In: (Sept. 2021). url:

https://arxiv.org/abs/2109.03814v3 (visited on 02/22/2022) (cit. on p. 16).

[74] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. «Learning representations by back-propagating errors». en. In: Nature 323.6088 (Oct.

1986). Number: 6088 Publisher: Nature Publishing Group, pp. 533–536.

issn: 1476-4687. doi: 10.1038/323533a0. url: https://www.nature.com/

articles/323533a0(visited on 03/14/2022) (cit. on pp. 17, 27).

[75] Xavier Glorot and Yoshua Bengio. «Understanding the difficulty of training deep feedforward neural networks». en. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. ISSN: 1938-7228. JMLR Workshop and Conference Proceedings, Mar. 2010, pp. 249–256.

url: https://proceedings.mlr.press/v9/glorot10a.html (visited on 03/16/2022) (cit. on p. 18).

[76] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. «Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classi-fication». In: arXiv:1502.01852 [cs] (Feb. 2015). arXiv: 1502.01852. url:

http : / / arxiv . org / abs / 1502 . 01852 (visited on 03/16/2022) (cit. on p. 18).

[77] Wadii Boulila, Maha Driss, Eman Alshanqiti, Mohamed Al-Sarem, Faisal Saeed, and Moez Krichen. «Weight Initialization Techniques for Deep Learn-ing Algorithms in Remote SensLearn-ing: Recent Trends and Future Perspectives».

en. In: Advances on Smart and Soft Computing. Ed. by Faisal Saeed, Taw-fik Al-Hadhrami, Errais Mohammed, and Mohammed Al-Sarem. Vol. 1399.

Series Title: Advances in Intelligent Systems and Computing. Singapore:

Springer Singapore, 2022, pp. 477–484. isbn: 9789811655586 9789811655593.

doi: 10.1007/978- 981- 16- 5559- 3_39. url: https://link.springer.

com/10.1007/978- 981- 16- 5559- 3_39 (visited on 03/16/2022) (cit. on p. 18).

[78] Philipp Krähenbühl, Carl Doersch, Jeff Donahue, and Trevor Darrell. «Data-dependent Initializations of Convolutional Neural Networks». In: arXiv:1511.06856 [cs] (Sept. 2016). arXiv: 1511.06856. url: http://arxiv.org/abs/1511.

06856 (visited on 03/16/2022) (cit. on p. 18).

[79] Vincent Dumoulin and Francesco Visin. «A guide to convolution arith-metic for deep learning». In: arXiv:1603.07285 [cs, stat] (Mar. 2016). arXiv:

1603.07285 version: 1. url: http://arxiv.org/abs/1603.07285 (visited on 02/28/2022) (cit. on pp. 20, 21, 24).

[80] Fisher Yu and Vladlen Koltun. «Multi-Scale Context Aggregation by Dilated Convolutions». In: arXiv:1511.07122 [cs] (Apr. 2016). arXiv: 1511.07122.

url: http://arxiv.org/abs/1511.07122 (visited on 02/28/2022) (cit. on p. 21).

[81] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. «Dropout: A Simple Way to Prevent Neural Networks from Overfitting». In: Journal of Machine Learning Research 15.56 (2014), pp. 1929–1958. issn: 1533-7928. url: http://jmlr.org/papers/v15/

srivastava14a.html (visited on 03/19/2022) (cit. on p. 22).