Instance segmentation models - Deep learning perception

2.3 Deep learning perception

2.3.6 Instance segmentation models

While semantic segmentation is a comprehensive approach to holistic scene un-derstanding, one its major limitation to this end is the inherent lack of ways to differentiate between entities whose more appropriate representation is that of separate and distinct objects.

As such, semantic segmentation is effective to segment the background or other uncountable entities, known as stuff, which are well represented by amorphous regions e.g. the sky, the road and the vegetation.

On the other hand objects with well defined shapes, categorized in literature as things [49], benefit from object detection and recognition.

However the latter provides only coarse annotations without the fine-granularity provided by segmentation-based algorithms.

For these reasons, the line of work on instance segmentation garnered much atten-tion in scene understanding research.

As the task requires to discriminate between instances belonging to the same seman-tic category, a variety of methods exist in the literature to address the complexity of instance segmentation. The main subdivision can be done over the methodology that is employed to generate instance proposals and the associated semantic class:

two-stage and one-stage methods.

The two-stage models are the most widely researched and the ones which provided the highest performance among all.

Two-stage methods can be further categorized based on the segmentation pipeline:

• object-detection based: in which a detector determines object bounding boxes and the class of the object within it, followed by instance segmentation of the area of the input image within the associated bounding box

• segmentation based: a model generates instance segmentation proposals and per-instance category labels are assigned based on pixel-wise classifica-tion(semantic segmentation) of the whole scene.

Object detection based methods employ a first stage which generates candidate box proposals, most commonly by means of the region proposal network RPN [112]

of the Faster R-CNN architecture, which are then used to create image crops that are fed to a downstream instance segmentation network.

Such component is responsible for the generation of object proposal for each pixel of its input feature maps, each of which is assigned a set of rectangular anchors(in essence, location independent bounding boxes) of multiple scale and size. After the assignment anchors become localized with respect to the assigned feature map pixel.

Then for each of such preliminary assignments, the RPN predicts an objectness probability and bounding box coordinates. The former allows to determine whether an object is within the proposed region, while the box coordinates are used to refine the anchor location with respect to the associated ground truth.

The RPN-based approach has been followed to build the fully convolutional instance segmentation network FCIS[120], which utilizes the position sensitive score map approach first seen in the instance proposal model InstanceFCN[121], to assemble instance segmentation starting from the object proposals obtained by the RPN.

Following the same philosophy, Mask R-CNN[68] has become a top performing instance segmentation architecture still for the modern standards, by adopting the default Faster R-CNN[112]. Its ROI pooling layer is replaced by a novel ROI align layer to finely localize the predicted instance masks on the input image and an instance segmentation branch is added in parallel to the Faster R-CNN box

regression and classification branches.

As has been mentioned, Mask RCNN employs the same upstream RPN module of Faster R-CNN to generate the candidate object proposals, which are then used to jointly perform instance segmentation, bounding box regression and classification.

On the same line of work another notable method is Hypercolumn[122], which employs object and instance proposals generated by multiscale combinatorial group-ing [123] that are then passed to R-CNN[105]. Its filtered boundgroup-ing box predictions, by means of non-maximum suppression, become the hypercolumn inputs that are required to predict the instance mask.

Both instances proposals and the respective bounding box are considered to gen-erate the image crop which is fed to a downstream convolutional architecture for segmentation.

A feature vector called in fact hypercolumn is created from the convolutional feature maps of the chosen convolutional architecture, all rescaled to a common size so as to obtain the required hypercolumn vector.

Such data structure provides multi scale feature representations which are fed to multiple classifiers, whose outputs are aggregated to then generate the final instance segmentation. However such method assumes that the upstream detector can also provide the category associated with the segmented instance.

As Mask R-CNN has become a staple of instance segmentation, research efforts have also been dedicated to alternative methods with premise to possibly obtain comparable results. Deep watershed transform DWT[124] is one of such methods.

All object instances are considered as energy basins, which the DWT two-stage deep model separates by means of a per-basin predicted direction of descent of the energy, by means of its direction network DN. As the heatmap of the energy distribution is obtained through the second stage represented by the Watershed Transform Network WTN, a thresholding procedure based on the watershed algorithm is applied in order to finally separate all found instances.

A category is assigned to each of the instances by means of a preliminarily computed semantic segmentation mask which allows to filter out all image regions unrelated to the categories of interest from the input to the DWT network.

SSAP[125] adopts a similar procedure, but the semantic segmentation branch and the affinity pyramid used to compute preliminary per-pixel affinities share the same upstream convolutional feature maps.

Both semantic and per-pixel affinities are merged by a cascaded graph partition module which formulates the instance mask generation as an graph partition optimization problem.

Another novel line of work is that of dense sliding-window instance segmentation models.

Firstly initiated by the DeepMask[126] instance proposal model, TensorMask[127]

is the first instance segmentation model which simultaneously predicts instance masks and the associated category without any additional semantic segmentation pipeline or upstream object detector.

It achieves this by means of a 4D tensor representation of shape V × U × H × W that allows to model both the spatial location of instances on the H × W image planes, simultaneously to the relative instance mask positions on the V × U planes.

Such model represents a notable innovation as it achieves comparable results to Mask R-CNN with a novel approach to instance segmentation, thus opening possible avenues for improvements over the current state of the art detection-based models.

Nel documento DA-PanopticFPN: a panoptic segmentation model to bridge the gap between simulated and real autonomous driving perception data = (pagine 53-56)