Learning settings - Object detection and instance segmentation with deep learning techniques

Figure 1.18: Mask Scoring R-CNN architecture.

Ds(supervised dataset) Du(unsupervised dataset)

Figure 1.19: Supervised and unsupervised datasets.

Semi-Supervised Learning

Differently from the previous learning method, in case of Semi-Supervised Learning (SSL), in addition to the validation dataset, we have two datasets for training: Duand D_s. In figure 1.19 both are showed.

Du= {x^u_i}^N_i=1^u (1.13)

Where, Ds is a dataset provided with ground-truth and typically smaller, and the second dataset Du contains Nu input examples x^u without any labels. The loss is composed by two distinct terms:

L= αLsup+ β Lunsup (1.14)

where Lsupis the loss coming from the training on the supervised dataset Ds, Lunsup

is the loss coming from the training on the unsupervised dataset D_u, and α and β are constant values to balance the training.

This type of learning is used in case collecting examples in input x is an easy task, but producing the associated ground-truth is costly in terms of time and effort. This is usually true in the case we need to train a model for the tasks of Object Detection and Instance Segmentation. We refer to Semi-Supervised Object Detection (SSOD) when the final task is Object Detection.

Self-Supervised Learning

There is some special cases where the ground-truth can be generated automatically offline, before the training start, or online, when the training is on going. A typical example is jigsaw puzzles, where the image in input is divided in squares and rear-ranged in a random way, and the network learns to solve the puzzle. The creation of the supervised signal is entrusted to a function which generate a new input every training iteration.

Teacher-Student Learning

Figure 1.20: Illustrations of KD methods with S-T frameworks from [5]. (a) for model compression and for knowledge transfer, e.g., (b) semi-supervised learning and (c) self-supervised learning.

In a Teacher-Student learning setting [5], we have two independent models, where usually a transfer learning approach is applied. Broadly applied in model compression and knowledge transfer, where the Teacher is a model bigger than the Student, the paradigm has extended its borders also in other cases. Unlike from what one might expect, in [33] authors demonstrated that it can be useful also in case of Defective Teacher, using a poorly-trained teacher, or Reverse Knowledge Distillation, when the Teacher and the Student roles are inverted. In early implementations [34], the Teacher was trained first and the Student starts the train later. Nowadays, Teacher and Student training could happen at same time. In [7], the Teacher and the Student are the same model and the Student trains on Dsand Duat same time, using the ground truth when available or using the pseudo-labels generated by the Teacher otherwise, and transfer Knowledge to the Teacher, alternatively in each iteration. In Figure 1.20,

some Teacher-Student methods are shown.

Model ensembling techniques

Figure 1.21: The Mean Teacher method from [6].

In literature, researchers have explored multiple ways to use predictions coming from multiple models. In the first type of ensembling, there are multiple predictions done to the same input and meshed together to form the final result. One of most re-cent example is the Temporal Ensembling [35] method, which extends the Π-model, by taking into account multiple predictions of same network, each one done in a dif-ferent epoch of the training. The predictions are merged together with the exponential moving average of label predictions and included some penalization techniques for the ones inconsistent with the target.

Recently, authors in [6] proposed the Mean Teacher model, a new way alternative to Temporal Ensembling [35]. In this case, the method average together the weights of the network instead of the label predictions, always through the exponential mov-ing average (EMA). In Figure 1.21, the trainmov-ing of a labeled example is shown. Both Teacher and Student evaluate the input applying a different level of noise. The soft-max output of the models are compared using classification cost and a consistency

cost function is evaluated between Teacher and Student predictions. Then, the Stu-dent weights are updated with gradient descent and the Teacher weights are updated with exponential moving average of the student weights.

Later, in [7], authors used the Mean Teacher technique to knowledge transfer from the Student to the Teacher with the following EMA formula:

θt ← αθt+ (1 − α)θs (1.15)

Where θt are the Teacher weights, θsare the Student weights and α is a smoothing coefficient hyperparameter.

Pseudo-labeling

When we talk about Semi-Supervised Learning, usually we define a way to auto-matically generate the labels for the unsupervised dataset Du. One example of this techniques is Pseudo-labeling [36], where we train a model to generate the best la-bels it can. This new dataset could be modeled as follow:

D˜u= {xû_iy˜û_i}^N_i=1û

y^u_i = y^u_i + ni

(1.16)

Where ˜Duis the new unsupervised dataset, ˜yû_i is the label predicted by the network for the i-th example, yû_i is the real ground-truth that we do not know and niis a additional noise introduced by the network itself. If the model performance are high, then the noise part is small and ˜yû_i will converge to yû_i. Otherwise, the noise will predominate and the probability of an erroneous prediction will be high.

Unbiased Teacher

In [7], authors collected best performing ideas around Semi-Supervised Learning and built the Unbiased Teacher model. It consists of two stages: Burn-In Stage and Teacher-Student Mutual Learning Stage. The first consists of a number of iterations of Supervised Learning through the supervised dataset Ds, where the Student receives

Figure 1.22: The Unbiased Teacher method from [7].

as input the images weak-augmented and the same input strongly-augmented. This stage is useful to warm-up the student model and usually consists of two thousand iterations. Because the Student model is a Faster R-CNN model, its loss for the burn-up stage is the following:

Lsup=

∑

L^{r pn}_cls(x^s_i, y^s_i) + L^{r pn}_reg(x^s_i, y^s_i) + L^RoI_cls(x^s_i, y^s_i) + L^RoI_reg(x^s_i, y^s_i)

(1.17)

Where the L^{r pn}_cls and the L^{r pn}reg are the classification and regression losses, respectively, for the Region Proposal Network (RPN) stage; the L^RoI_cls and L^RoI_reg are the classification and regression losses, respectively, for the RoI head model of Faster R-CNN.

Then, the Teacher model is initialized as clone of Student weights and the second stage starts and goes on until the end of the training. For each iteration, a batch of im-ages from Duare weakly-augmented and passed as input to the Teacher in evaluation mode, which generates the pseudo-labels filtered by a classification score threshold defined as hyperparameter. Now, the student is firstly trained on weak- and strongly-augmented images coming from Dsand its ground-truth. Then, it is trained with the strongly-augmented images from Duwith the pseudo-labels generated by the Teacher.

The Student weights are trained with back-propagation for the Dsas in Formula

1.17. For the dataset Du, the losses applied are only the classification losses.

θs← θs+ γ∂ (Lsup+ λuLunsup)

∂ θs

, Lunsup=

∑

L^{r pn}_cls(xû_i, ˆyû_i) + L^roi_cls(xû_i, ˆyû_i)

(1.18) The unsupervised losses from the bounding box are discarded since the naive confidence thresholding does not guarantee to filter out the wrong pseudo bounding boxes. Finally, the knowledge is partially passed back from the Student to the Teacher with EMA (see Formula 1.15).

Research findings

All we have to decide is what to do with the time that is given us.

J.R.R. Tolkien

2.1 A Novel Region of Interest Extraction Layer

Nel documento Object detection and instance segmentation with deep learning techniques (pagine 43-52)