The rationale behind the above 55-5 subdivision is to keep the train set as large and diverse as possible in order to reduce the risk of overfitting, considering that only 60% of the whole dataset (approximately 6 hours) is annotated for the public and that the 60 sequences are not from 60 different situations, with many sequences being just different clips from the same shots.
That’s why the clips extracted for the validation set only contain shots which have direct siblings in the train set and no unique shot was extracted. It can be a risk since having no unseen setting in the validation set will eventually bias the evaluation results but keeping the most possible diverse train set was preferred.
4.3.2 Preparing the samples
The annotated sequences are composed of a set of frames and an XML file containing the annotations for the frames of the sequence, and they have to be parsed in order to be used by the Tensorflow estimator (the API object responsible for the training). An XML sample UA-DETRAC annotation file follows this structure:
1<s e q u e n c e name=" MVI_20011 ">
2 <s e q u e n c e _ a t t r i b u t e c a m e r a _ s t a t e=" u n s t a b l e " s e n c e _ w e a t h e r=
" sunny "/>
3 <i g n o r e d _ r e g i o n>
4 <box l e f t =" 7 7 8 . 7 5 " t op=" 2 4 . 7 5 " width=" 1 8 1 . 7 5 " h e i g h t="
6 3 . 5 "/>
5 < !−− P o s s i b l y o t h e r r e g i o n s . . . −−>
6 </ i g n o r e d _ r e g i o n>
7 <frame d e n s i t y=" 7 " num=" 1 ">
8 < t a r g e t _ l i s t>
9 <t a r g e t i d=" 1 ">
10 <box l e f t =" 5 9 2 . 7 5 " t op=" 3 7 8 . 8 " width=" 1 6 0 . 0 5 "
h e i g h t=" 1 6 2 . 2 "/>
11 < a t t r i b u t e o r i e n t a t i o n=" 1 8 . 4 8 8 " s p e e d=" 6 . 8 5 9 "
t r a j e c t o r y _ l e n g t h=" 5 " t r u n c a t i o n _ r a t i o=" 0 . 1 " v e h i c l e _ t y p e=
" c a r "/>
12 </ t a r g e t>
13 < !−− Other t a r g e t s i n t h e frame . . . −−>
14 </ t a r g e t _ l i s t>
15 </ frame>
16 < !−− Other f r a m e s o f t h e s e q u e n c e . . . −−>
17 </ s e q u e n c e>
The XML root is a sequence object which has the following children:
4.3 – Dataset preparation
a sequence_attribute child that specifies camera stats and weather, an ig-nored_region child which contains boxes of regions that have not been an-notated (i.e. far segments of a road, parking areas) and N frame children that contain the annotations for each of the N images in the sequence. Each f rameobject has a target_list with a series of target children, each contain-ing a box with the coordinates of the rectangle that surrounds the object and an attribute object with data like the category (car, bus, van or other) or the speed of the vehicle. The objects in the annotations are tracked between the frames and the target id can be seen as a track id which the object during its passage in the scene.
The annotated data that has to be fed to the training process in Ten-sorFlow needs to be parsed to n tfrecord files, where n is configurable and corresponds to the number of shards. Each TFRecord file will contain a series of Tensorflow Example objects, one for each frame sample, that contain not only the annotations but also the encoded image data as JPEG. Setting the number of shards to 1 produces 1 big tfrecord file containing the samples of the whole dataset, while setting it to something greater than 1 will produce multiple files, providing speed benefits by enabling concurrent readers and facilitating the shuffling operations while reading the samples. A form of very simple shuffling is performed during the write operations by simply iterating through the record files while saving the frame samples. In this work, the number of shards was set to 20.
Dealing with the ignored regions
A problem that has to be addressed is how to manage the ignored regions, since Tensorflow od api doesn’t have a built-in way to manage them. Leaving them be is not an option, since the loss function would penalize the network if any detection in those non-annotated regions is found (they would be treated as false positives), so the simple adopted solution was to mask them with a uniform black overlay, as it can be seen in the following image.
A limit of this approach is that some context information is lost due to the masking but it is also true that in the data augmentation step random crops are performed so the loss should not be too noticeable. It would be interesting to analyze the performance differences, if any, between masking the image versus managing the ignored regions directly in the training process but this would require modifications to the algorithm itself, which is why this comparison was discarded. The mask solution was thus chosen since it works out-of-the-box with accurate detection performance.
Ignored region
Ignored region
Ignored region Ignored
region
Ignored Regions in XML Annotation Resulting Input Sample
Figure 4.4. Cutout of ignored regions in input samples
4.3.3 Data augmentation
The smaller the training dataset, the higher the risk to overfit it. A way of dealing with the overfitting problem is to virtually enlarge the dataset by creating new samples starting from the original ones; the process is typically referred to as data augmentation. In this work data augmentation is done online at training time by a preprocessor offered by the object detection api.
In this way, no additional input data has to be created and stored. The preprocessor has been configured to mimic the augmentations done by the Faster R-CNN and SSD papers.
For the Faster R-CNN architectures the input samples are just randomly horizontally flipped, while for the SSD architectures the inputs are randomly flipped and also randomly cropped. The random crop method follows the one defined in the SSD paper, where the input sample is randomly processed in one of the following ways:
1. Left as is
2. Cropped with a IOU threshold of 0.1, 0.3, 0.5, 0.7, or 0.9 with at least one ground truth object
3. Cropped without caring of the above constraint (implemented by setting IOU threshold to 0.0)
The image crop is eventually executed by keeping a random size between 0.1 and 1 of the original image size and an aspect ratio between 0.5 and 2.