3.3 Synthetic-to-Real dataset
3.3.1 Multi dataset label matching
Following, the label conversion from dataset specific to label-matched version is presented, for each of the datasets considered in this dissertation.
Before proceeding the label-matched subset is presented: The thing classes that are considered are:
• car
• pedestrian
The stuff classes that are considered are:
• road
• sidewalk
• building
• wall
• fence
• pole
• traffic light
• traffic sign
• vegetation
• terrain
• sky
Synthetic CARLA
The mapping in table ?? is utilized to move from dataset specific, shown in the column original labels to the label-matched version (column label matched ), note that these are not these labels are the ones used before each is converted to the PanopticFPN specific format.
Class Original Label matched
road 7 0
roadline 6 0
sidewalk 8 1
building 1 2
wall 11 3
fence 2 4
pole 5 5
traffic light 18 6 traffic sign 12 7
vegetation 9 8
terrain 22 9
sky 13 10
person 4 11
car 10 13
unlabeled 0 255
other 3 255
ground 14 255
bridge 15 255
rail track 16 255
guard rail 17 255
static 19 255
dynamic 20 255
water 21 255
Table 3.1: Synthetic dataset mapping from default labels to label-matched version
Figure 3.5: Synthetic dataset sample with its label-matched panoptic mask
Cityscapes
The mapping in table ?? is utilized to move from dataset specific, shown in the column original labels to the label-matched version (column label matched ),
note that these are not these labels are the ones used before each is converted to the PanopticFPN specific format.
Class Original Label matched
license plate -1 -1
road 7 0
sidewalk 8 1
building 11 2
wall 12 3
fence 13 4
pole 17 5
traffic light 19 6
traffic sign 20 7
vegetation 21 8
terrain 22 9
sky 23 10
person 24 11
rider 25 11
car 26 13
truck 27 13
bus 28 13
caravan 29 13
train 31 13
motorcycle 32 13
bicycle 33 13
unlabeled 0 255
ego vehicle 1 255
rectification border 2 255
out of roi 3 255
static 4 255
dynamic 5 255
ground 6 255
parking 9 255
rail track 10 255
guard rail 14 255
bridge 15 255
tunnel 16 255
polegroup 18 255
trailer 30 255
Table 3.2: Cityscapes mapping from default labels to label-matched version
Figure 3.6: Cityscapes sample with its label-matched panoptic mask
BDD100k
The mapping in table ?? is utilized to move from dataset specific, shown in the column original labels to the label-matched version (column label matched ), note that these are not these labels are the ones used before each is converted to the PanopticFPN specific format.
Class Original Label matched
road 7 0
sidewalk 8 1
building 10 2
wall 15 3
fence 11 4
pole 20 5
traffic light 25 6
traffic sign 26 7
vegetation 29 8
terrain 28 9
sky 30 10
person 31 11
rider 32 11
car 35 13
bicycle 33 13
bus 34 13
caravan 36 13
motorcycle 37 13
train 39 13
truck 40 13
unlabeled 0 255
dynamic 1 255
ego vehicle 2 255
ground 3 255
static 4 255
parking 5 255
rail track 6 255
bridge 9 255
garage 12 255
guard rail 13 255
tunnel 14 255
banner 16 255
billboard 17 255
lane divider 18 255
parking sign 19 255
polegroup 21 255
street light 22 255
traffic cone 23 255
traffic device 24 255
traffic sign frame 27 255
trailer 38 255
67
Figure 3.7: BDD100k sample with its label-matched panoptic mask
Model design
4.1 Introduction
This introductory section serves the purpose of providing a formal description of the design of the Domain Adaptive PanopticFPN neural network architecture, a modification built in this work from the template PanopticFPN model[65].
The objective of this work is the evaluate the improvement in panoptic segmentation performance that a domain adaptive model obtains compared to a baseline trained on synthetic driving data and tested on real world driving data.
Hence, considering as baseline the PanopticFPN[65] architecture, a modified ver-sion named DA-PanopticFPN which employs an adversarial self-supervised domain adaptation module has been developed in this dissertation.
The former has been selected as it adopts the minimal set of changes to develop a panoptic segmentation model, favoring a simplistic and minimal design as noted by the authors[65].
It adopts a Mask R-CNN[68] instance segmentation architecture, which consists of an FPN[107] backbone created from a ResNet50[93], and a parallel semantic segmentation branch which is attached to the termination of the FPN backbone.
As such, the effectiveness of domain adaptation can be easily evaluated by avoiding possibly unintended complex interactions that can appear in the case of more convoluted architectures such as PanopticDeepLab[21] or EfficientPS[22].
Extensions to domain adaptive versions of the above architecture is of interest for future work.
The details that are specific to such model will be presented shortly after, the following formalization.
At its core, a neural network can be understood as a set of functions Fi(·)i ∈ {1, ..., N} each parameterized by zero or more optimizable parameters, which if
present are denoted as weights Wi ∈ Rdi−1×di and biases bi ∈ Rdi−1×di, and non-optimizable ones, the hyperparameters.
The ordering with which such functions are applied to the inputs X ∈ Rd0×d1 implicitly defines a composite function
F(·) = Fn◦ Fn−1◦ ... ◦ F1(·) which represents the network.
Hence the composite function Fθ, which depends on the parameter set θ = {W1, ..., WN, b1, ..., bN}, maps inputs X to outputs Y as:
FW1,...,WN,b1,...,bN : Rd0 → Rdn
The objective of such complexly built non-linear composite function, is to approxi-mate by means of its optimizable parameters, the mapping between inputs X and the ground truths Y , which are sampled as M tuples (xm, ym)m ∈ {1, ..., n_samples}
from the non-directly observable data distribution P (X, Y ) .
The design of neural networks follows the common guiding principle to employ one or more backbone feature extractors(the latter in the case of multi-modal data or of an ensemble of models), to automatically learn relevant patterns from the input data, and one or more task-specific branches which are attached in an arbitrary manner to the backbone/s outputs.
In the setting of multi-task learning, it is common practice to define a common backbone feature extractor which is tasked to automatically learn by backpropa-gation the representations which are desirable for all the selected tasks, while the task-specific network heads (also called branches) learn specialized representations that are only useful on their own end.
In such case, the previously mentioned composite function representation of the neural network can be updated to accomodate this general setting.
Let the following definition be given:
• θF be the learnable parameters of the backbone
• θGi be the learnable parameters of the i-th task-specific branch Furthermore let:
• F(·; θF) be the backbone model
• Gi(·; θGi) i ∈ {1, ..., n_tasks} be the task-specific heads
The neural network outputs can be defined, with the associated dependencies between computations:
Z = F(X; θF)
Yi = Gi(F(X; θF); θGi) i ∈ {1, ..., n_tasks}
In which Z is the intermediate output of the shared backbone model, which is fed to each of the Gi task-specific heads to generate the related predictions.