Multi dataset label matching - Synthetic-to-Real dataset

3.3 Synthetic-to-Real dataset

3.3.1 Multi dataset label matching

Following, the label conversion from dataset specific to label-matched version is presented, for each of the datasets considered in this dissertation.

Before proceeding the label-matched subset is presented: The thing classes that are considered are:

• car

• pedestrian

The stuff classes that are considered are:

• road

• sidewalk

• building

• wall

• fence

• pole

• traffic light

• traffic sign

• vegetation

• terrain

• sky

Synthetic CARLA

The mapping in table ?? is utilized to move from dataset specific, shown in the column original labels to the label-matched version (column label matched ), note that these are not these labels are the ones used before each is converted to the PanopticFPN specific format.

Class Original Label matched

road 7 0

roadline 6 0

sidewalk 8 1

building 1 2

wall 11 3

fence 2 4

pole 5 5

traffic light 18 6 traffic sign 12 7

vegetation 9 8

terrain 22 9

sky 13 10

person 4 11

car 10 13

unlabeled 0 255

other 3 255

ground 14 255

bridge 15 255

rail track 16 255

guard rail 17 255

static 19 255

dynamic 20 255

water 21 255

Table 3.1: Synthetic dataset mapping from default labels to label-matched version

Figure 3.5: Synthetic dataset sample with its label-matched panoptic mask

Cityscapes

The mapping in table ?? is utilized to move from dataset specific, shown in the column original labels to the label-matched version (column label matched ),

note that these are not these labels are the ones used before each is converted to the PanopticFPN specific format.

Class Original Label matched

license plate -1 -1

road 7 0

sidewalk 8 1

building 11 2

wall 12 3

fence 13 4

pole 17 5

traffic light 19 6

traffic sign 20 7

vegetation 21 8

terrain 22 9

sky 23 10

person 24 11

rider 25 11

car 26 13

truck 27 13

bus 28 13

caravan 29 13

train 31 13

motorcycle 32 13

bicycle 33 13

unlabeled 0 255

ego vehicle 1 255

rectification border 2 255

out of roi 3 255

static 4 255

dynamic 5 255

ground 6 255

parking 9 255

rail track 10 255

guard rail 14 255

bridge 15 255

tunnel 16 255

polegroup 18 255

trailer 30 255

Table 3.2: Cityscapes mapping from default labels to label-matched version

Figure 3.6: Cityscapes sample with its label-matched panoptic mask

BDD100k

Class Original Label matched

road 7 0

sidewalk 8 1

building 10 2

wall 15 3

fence 11 4

pole 20 5

traffic light 25 6

traffic sign 26 7

vegetation 29 8

terrain 28 9

sky 30 10

person 31 11

rider 32 11

car 35 13

bicycle 33 13

bus 34 13

caravan 36 13

motorcycle 37 13

train 39 13

truck 40 13

unlabeled 0 255

dynamic 1 255

ego vehicle 2 255

ground 3 255

static 4 255

parking 5 255

rail track 6 255

bridge 9 255

garage 12 255

guard rail 13 255

tunnel 14 255

banner 16 255

billboard 17 255

lane divider 18 255

parking sign 19 255

polegroup 21 255

street light 22 255

traffic cone 23 255

traffic device 24 255

traffic sign frame 27 255

trailer 38 255

Figure 3.7: BDD100k sample with its label-matched panoptic mask

Model design

4.1 Introduction

This introductory section serves the purpose of providing a formal description of the design of the Domain Adaptive PanopticFPN neural network architecture, a modification built in this work from the template PanopticFPN model[65].

The objective of this work is the evaluate the improvement in panoptic segmentation performance that a domain adaptive model obtains compared to a baseline trained on synthetic driving data and tested on real world driving data.

Hence, considering as baseline the PanopticFPN[65] architecture, a modified ver-sion named DA-PanopticFPN which employs an adversarial self-supervised domain adaptation module has been developed in this dissertation.

The former has been selected as it adopts the minimal set of changes to develop a panoptic segmentation model, favoring a simplistic and minimal design as noted by the authors[65].

It adopts a Mask R-CNN[68] instance segmentation architecture, which consists of an FPN[107] backbone created from a ResNet50[93], and a parallel semantic segmentation branch which is attached to the termination of the FPN backbone.

As such, the effectiveness of domain adaptation can be easily evaluated by avoiding possibly unintended complex interactions that can appear in the case of more convoluted architectures such as PanopticDeepLab[21] or EfficientPS[22].

Extensions to domain adaptive versions of the above architecture is of interest for future work.

The details that are specific to such model will be presented shortly after, the following formalization.

At its core, a neural network can be understood as a set of functions Fi(·)i ∈ {1, ..., N} each parameterized by zero or more optimizable parameters, which if

present are denoted as weights Wi ∈ R^dⁱ⁻¹^×dⁱ and biases bi ∈ R^dⁱ⁻¹^×dⁱ, and non-optimizable ones, the hyperparameters.

The ordering with which such functions are applied to the inputs X ∈ R^d⁰^×d¹ implicitly defines a composite function

F(·) = Fn◦ Fn−1◦ ... ◦ F1(·) which represents the network.

Hence the composite function Fθ, which depends on the parameter set θ = {W₁, ..., W_N, b1, ..., bN}, maps inputs X to outputs Y as:

F_W₁_,...,W_N_,b1,...,b_N : R^d⁰ → R^dⁿ

The objective of such complexly built non-linear composite function, is to approxi-mate by means of its optimizable parameters, the mapping between inputs X and the ground truths Y , which are sampled as M tuples (xm, y_m)m ∈ {1, ..., n_samples}

from the non-directly observable data distribution P (X, Y ) .

The design of neural networks follows the common guiding principle to employ one or more backbone feature extractors(the latter in the case of multi-modal data or of an ensemble of models), to automatically learn relevant patterns from the input data, and one or more task-specific branches which are attached in an arbitrary manner to the backbone/s outputs.

In the setting of multi-task learning, it is common practice to define a common backbone feature extractor which is tasked to automatically learn by backpropa-gation the representations which are desirable for all the selected tasks, while the task-specific network heads (also called branches) learn specialized representations that are only useful on their own end.

In such case, the previously mentioned composite function representation of the neural network can be updated to accomodate this general setting.

Let the following definition be given:

• θF be the learnable parameters of the backbone

• θ^Gi be the learnable parameters of the i-th task-specific branch Furthermore let:

• F(·; θ^F) be the backbone model

• Gi(·; θG_i) i ∈ {1, ..., n_tasks} be the task-specific heads

The neural network outputs can be defined, with the associated dependencies between computations:

Z = F(X; θF)

Y_i = Gi(F(X; θ^F); θ^Gi) i ∈ {1, ..., n_tasks}

In which Z is the intermediate output of the shared backbone model, which is fed to each of the Gi task-specific heads to generate the related predictions.

Nel documento DA-PanopticFPN: a panoptic segmentation model to bridge the gap between simulated and real autonomous driving perception data = (pagine 75-84)