NanoSciTracker: an object tracking library for microbiology and an industrial collimation algorithm optimisation

(1)

NanoSciTracker: an object tracking

library for microbiology and an

industrial collimation algorithm

optimisation

Supervisor(s): Dr Stefano Cozzini,

Dr Giuseppe Piero Brandino

Candidate: Luis Gerardo Le´on Vega

6

th

_edition

2019–2020

(2)

the literature related to the topic and introducing my knowledge and experimental results. In those cases, I have made use of external references, I have proceeded to indicate the sources and their references. Consequently, I assume the entire responsibility for this work and its content.

Luis Gerardo Le´on Vega Trieste, Italy, October 30th 2020

This work is licensed under a Creative Commons “Attribution-NonCommercial-ShareAlike 4.0 International”

(3)

(4)

(5)

Acknowledgements

This work was funded by the Istituto Officina dei Materiali, part of the Consiglio Nazionale delle Ricerche (CNR-IOM), and supervised by eXact Lab S.R.L.

The scientific material used during this project (video sequences) and technical informa-tion was also provided by the Scuola Internazionale Superiore di Studi Avanzati (SISSA) di Trieste.

During the development process, Dr Giuseppe Piero Brandino (eXact Lab) supervised the industrial application part of this project. In contrast, Dr Stefano Cozzini (CNR-IOM, eXact Lab) supervised the scientific part. Dr Vincent Torre (SISSA) provided the data required for the GSC tracking application.

Luis Gerardo Le´on Vega

(6)

(7)

Abstract

Automating the study of cell motility opens the opportunity to accelerate the analysis for treatment of several illnesses caused by cancer cells, evaluating the performance of new drugs and repositioning existing drugs for customised medication.

Differently from other brain cells, the glioblastoma stem cells have greater motility, and some drugs can boost their expansion. The challenge of tracking and measuring their motility comes as a formidable challenge for most of the computer vision techniques, with low-quality images, segmentation issues, instance splitting and shape variance.

This work explores the utilisation of a two-stage pipeline composed by a feedbacked Otsu’s threshold-based detection and Kernelised Correlated Filters object tracking for detecting and carrying out the object tracking part of the problem. Additionally, the labelling process utilises a weighted multi-feature approach, which takes several features from the objects and matches them for tracking recovery with the aim of label preservation. By utilising this approach, the detection error obtained was around 17%. In contrast, the tracking valid ratio (healthy trackers over detections) achieved 100% for a single-scene approach, and 77.11% for a multi-scene approach, for a non-chaotic and well-behaved video sequence.

Additionally, this work presents a performance optimisation for a collimation algorithm. The optimisation consists on data-layout transformations, AVX-512 vectorisation, and multithreading applied to a perforated loop. At the end of the work, the optimisation manages to get 4.65 times of speed-up by using six threads, and AVX-512 intrinsics. Keywords: object detection, subtraction techniques, image sequence analysis, feature detection, feature extraction, high-performance computing, vector processor, microscopy, brain cancer, glioblastoma stem cells, multi-scene tracking

(8)

(9)

3.2.1 Application stack proposal . . . 50 3.3 Scene block . . . 53 3.3.1 Requirements . . . 53 3.3.2 Proposed architecture . . . 54 3.3.3 Object detector . . . 56 3.3.4 Object tracker . . . 59 3.3.5 Cross matcher . . . 61 3.4 Multi-scene tracker . . . 65 3.4.1 Requirements . . . 65 3.4.2 Proposed architecture . . . 66

3.4.3 Global block class . . . 66

3.4.4 Features . . . 69

3.4.5 Global matcher . . . 74

3.4.6 Records dumping . . . 76

3.5 Results . . . 78

3.5.1 Metrics . . . 78

3.5.2 Real use case test . . . 79

3.5.3 Statistics . . . 81 3.5.4 Critical case . . . 84 4 Conclusions 85 4.1 Future work . . . 86 4.2 Repository . . . 86 Bibliography 87

(11)

List of Figures

1.1 Point cloud collimation algorithm functionality. The collimation (also known as registration) tries to align the point clouds to reduce the dif-ference between their points . . . 2

1.2 Collimation algorithm flow. The process starts by performing a rough collimation, and continues with the iterative point cloud collimation until converging . . . 3

1.3 Raw capture example. The image has been scaled by a factor of 0.0622 times to convert it from 12-bits to 8-bits . . . 5

1.4 Refined video sequence. This is an example of how the final frame looks like after applying the processing and stitching the captures. Each frame is composed of four captures distributed horizontally and vertically. . . 6

1.5 Computer vision problems observed on the provided video sequence and how they potentially impact on the detection and tracking processes . . . . 8

2.1 The original average roto-translation algorithm processes one point at a time per loop iteration . . . 12

2.2 OpenMP fork-join execution model. When the master thread (main) enters into a parallel region, it forks several workers. When the workers finish their execution, they join into the master thread. Retrieved from [6] . . . 15

2.3 OpenMP scheduling model for loop unrolling. In the case of static schedul-ing, the several iterations execute in an ordered fashion. In dynamic scheduling, the threads take the pending indices. . . 15

2.4 SIMD operation. The instruction performs an element-wise addition on vectors A and B . . . 16

2.5 The incoming point cloud is sampled and packaged into a vector of single-precision floats. Then, it is split into two vectors of double-single-precision floats. 20

2.6 Vectorised average calculation algorithm - The proposed algorithm uses AVX instructions. It replaces the branches by mask and data filtering. To preserve the accuracy of the computation of the angle, it is still serial. . . . 21

2.7 Operational frequency while enabling the vector units for several number of cores. Based on [28] . . . 26

3.1 Cells detection by using image segmentation based on deep learning [57] . . 29

3.2 Fundamental issues within image segmentation . . . 29

(12)

3.3 Types of multi-tracking. Non-overlapping cameras consider a monocular case, whereas overlapping cases consider high overlaps between scenes . . . 31

3.4 Usciigaci architecture. The segmentation module uses R-CNN model, whereas the tracking uses k-d tree tracking. Image retrieved from [57] . . . 32

3.5 TrackMate functionality example. It is possible to reconstruct the tracklets on the left and observe the lineage on the right. Retrieved from [56] . . . . 34

3.6 Comparison of the several thresholding methods proposed by the OpenCV library. The comparison employs several pixel-level distributions to see the effectiveness of the method. Retrieved from [8] . . . 36

3.7 Background subtraction algorithms example. Based on the movements in the original image, the algorithm is capable of detecting the moving objects and determining the foreground from it [8] . . . 37

3.8 Mask-RCNN architecture for instance segmentation. Retrieved from[24] . . 39

3.9 HOG features applied to a glioblastoma cell image, with 8x8 blocks, one cell each. Low and high pixel values are represented by blue and yellow colours, respectively. . . 47

3.10 HOG processing pipeline. The gradients are computed through the Sobel filter, according to equation (3.31) and (3.32) . . . 49

3.11 Haar-like features used in face detection and recognition. Images retrieved from [18] . . . 49

3.12 Proposed stack architecture for the glioblastoma stem cells tracking appli-cation. The fundamental idea is to create a multi-scene tracker, focusing on autonomous local tracking scenes and matching for labelling . . . 51

3.13 Proposed architecture for the scene blocks. The frame is used as-is for detecting new objects, and update the current trackers. A cross-matching process filters the new detections from the currently tracked objects. Each scene block has four lists of trackers with their features. . . 55

3.14 Luma histogram for a single cell window. Most of the pixels have low levels with some variance due to noise. It is also possible to observe that true negative and true positive samples are likely to have a Gaussian distribution. 57

3.15 Detection preprocessing pipeline. The input image is binarised by using Otsu’s thresholding. Afterwards, the preprocessing applies a morpholog-ical dilatation, and an opening (with a kernel 17x17) on the full image (2560x1920). The algorithm finally draws the bounding boxes after the morphological operations . . . 58

3.16 Detector pipeline used for each scene block. Each scene block is 1280x960. It is resized to get a faster detection . . . 59

3.17 Tracker wrapper class. This class contains the ID, features, and the KCF tracker . . . 60

3.18 Tracking statistics . . . 61

3.19 Criteria used by the cross matcher to evaluate when the detection matches a current tracked instance . . . 62

(13)

3.20 Proposed architecture for the global block. The frames from the several scenes feed the scene blocks. The results are concatenated and matched at a global level by their features, and finally stored . . . 67

3.21 Global block class. This class wraps the scenes, the tracker lists, and contains the matcher for the tracking resolution . . . 68

3.22 Histogram example from a tracking instance . . . 70

3.23 HOG features example from a tracking instance . . . 71

3.24 MOSSE filter demonstration, extracting the original F0, generating the

template filter ˆH, and getting the impulse response ˜Gi . . . 72

3.25 Global matcher architecture. The lists are filtered to remove redundancies coming from the scenes, and then, the feature comparison takes place. The matcher thresholds the score and decides if the new tracker should be labelled or not, leading to the removal of the best candidate from the list. . 75

3.26 Tracking statistics: it illustrates the evolution of the tracking population as the frames advance in the video sequence. There is a convergence between the detections and number of trackers, which shows the robustness of the detection-tracking pair . . . 80

3.27 Tracking statistics for multi-scene: compared to the single-scene approach, the multi-scene does not reach to track the detections, leading to higher number of dead-trackers . . . 81

3.28 Playground world with four scenes. The trackers colours are: valid in-scene (white), out-of-scene (red), dead (purple), new detections (blue) . . . 82

3.29 Tracking statistics for multi-scene in a playground: the playground is re-lated to an ideal case with semi-linear displacements. The valid trackers are around 50% of the total detections. Nevertheless, the out-of-scene com-pletes the number of detections, suggesting that the detections also include those trackers. The dead trackers are generated by object occlusion, and their rate is relatively low compared to the total trackers . . . 83

3.30 Chaotic video example where the number of cells multiplies by a factor greater than two. This application cannot address this type of cases . . . . 84

(14)

(15)

List of Tables

2.1 Initial profiling results of the collimation algorithm. Results reported by Microsoft Visual Studio 2017. The profiling skips some parts since they do not consume more than 10 microseconds. The functions in bold represent parent functions, and the non-bold represent subfunctions of the parent. . . 11

2.2 Optimisation with AVX of the average computation - Runtime results be-fore and after applying the AVX vectorisation to the average computation 24

3.1 Application statistics in terms of detections and tracking instances . . . 82

(16)

(17)

Chapter 1 Introduction

This work addresses two different problems condensated in this document. For each problem, refer to the sections called:

• Collimation algorithm optimisation: for the high-performance computing op-timisation of the industrial collimation algorithm.

• Glioblastoma stem cells tracker: for the computer vision proposal for character-ising the cell motility by using the two-stage feedbacked pipeline with multi-feature support.

1.1 Collimation algorithm optimisation

1.1.1 Context

One of the customers of eXact Lab asked for optimising an iterative point cloud collimation algorithm. The point cloud contains about 6000 points which characterise a transversal view of several metallic profiles, such as H-shaped profiles, railway profiles, and others. For the capturing system, the customer uses laser-based cameras to take the edges of the profiles, which are decomposed into X and Y coordinates.

The iterative point cloud collimation algorithm aims to reduce the difference between two point cloud sets (see Figure 1.1). For this purpose, the algorithm computes the difference between the points and determines the minimum. Then, it averages the differences to determine a displacement vector, used for rotating and translating the point cloud to collimate. The process repeats several times until the difference is below a fixed threshold, as illustrated in Figure 1.2. In the case of the customer’s use case, the algorithm takes around 4-8 iterations to complete a collimation according to their fixed threshold.

Moreover, the collimation algorithm is part of a point cloud processing pipeline composed of three stages:

(18)

Figure 1.1: Point cloud collimation algorithm functionality. The collimation (also known as registration) tries to align the point clouds to reduce the difference between their points

• Elaboration: Initialises data structures and standardises the data representation. • Collimation: Collimates the incoming profile with respect to a reference.

• Analysis: Performs some analysis to look for anomalies.

Since it is a pipeline, the slowest part determines the execution time, and the stage execution happens in parallel. Besides, pointer dequeues connects the several pipeline stages. Currently, the collimation stage is the slowest stage of the whole pipeline.

1.1.2 Problem

The customer hired eXact Lab to optimise the algorithm and make it run faster, since it does not meet the target runtime. Having a high runtime impacts negatively on the production line, making the analysis process slower. The final users of the customer’s solution are asking for speeding the algorithm up to increase the productivity.

According to the customer’s requirements:

• I001: the pipeline shall run in less than 250 microseconds. It involves all the stages, especially the slowest one. It also represents a framerate of 4000 frames-per-second (fps).

• I002: the pipeline shall keep its length of 3 stages. It is possible to move some functions from one stage to another as long as it does not impact the performance negatively.

• I003: the optimisation can make use of GPU resources and CPU threads as long as other parts are not affected.

(19)

Rough collimation by aligning the center of the

bounding box

Exit check and iteration counter check Calculate the Average Rotation and Translation Continue

Queue the result

Exit

Threshold check

Perform the rotation and translation

Not passed Passed

Figure 1.2: Collimation algorithm flow. The process starts by performing a rough collimation, and continues with the iterative point cloud collimation until converging

• I005: the whole algorithm must run on Windows and be compiled using the built-in compiler from Visual Studio 2017. It is mandatory to meet the requirements of other legacy code.

The original algorithm previous any optimisation runs within a range between 0.95 mil-liseconds to 1.05 milmil-liseconds, which exceeds the run time by more than four times. Thus, the goal is to optimise the algorithm to speed it up by at least four times to meet I001 while taking into consideration the other requirements listed above.

1.1.3 Technical challenges

There are some critical considerations before addressing the problem:

• Thread parallelism can be harmful for the other threads, since the system can fall into a resource starvation at the processing level.

(20)

• There is no possibility of splitting the problem and moving into new pipeline stages. According to the customer, this will increase the latency, which is not convenient for the production line, especially when detecting anomalies (later detection). • The system also runs other applications, which shall not be affected by the

optimi-sation process.

It is also possible to exploit GPU acceleration during the process. Nevertheless, this increases the system cost in terms of adding new hardware to the setup and correctly handling the cooling. The system will be installed in a rack with small space available. Multi-GPU approaches may lead to increasing costs in terms of cooling, which is not desired by the customer.

Moreover, the GPU acceleration is going to be reserved for the next optimisation sprint, which involves accelerating the Analysis stage.

1.1.4 Contribution

This section of the work aims to speed the customer’s application up to meet the required runtime. The project’s impact relies on final user requirements which will allow faster production lines around the globe.

1.2 Glioblastoma stem cells tracker

1.2.1 Context

Within the CNR-IOM, there is a team devoted to research the molecular mechanisms of diseases’ onset. The aim is to design new drugs and reposition existing drugs. The ARES project is one of the projects dedicated to brain cancer research, presenting innovative approaches towards personalised treatments of glioblastoma, which is one of the most common malignant of the primitive brain tumours[15].

One of the activities within the ARES project is to study the behaviour of the glioblastoma stem cells (GSC) in response to some drugs to create a customised treatment for the patients, increasing the effectiveness of the therapy.

Some of the experiments done by the research team consist basically in recording a sample of GSC cells for 72 hours, dividing the process in two main stages:

• Control (6 hours): where the GSC are in a controlled environment • Inhibited (64 hours): where the researchers add a drug to the sample

(21)

The control stage allows characterising the behaviour of the GSC in a stable environment, whereas the inhibited stage allows researchers measuring the effect of some drugs on the GSC.

For measurements, four different devices record 72-hours of frames at a resolution of 1280x960 with 12 bits of depth in grayscale (see Figure 1.3). The image processing na¨ıvely stitches to make a video sequence with 810 frames. The sampling time for the frame capture is 5 minutes approximately.

Figure 1.3: Raw capture example. The image has been scaled by a factor of 0.0622 times to convert it from 12-bits to 8-bits

The provided refined video sequences are the stitching of the four captures plus a prepro-cessing in order to correct the pixel levels, since most of the pixels from the raw image do not exceed 32 out of 255 in the 8-bit image representation. Figure 1.4 shows an example of a preprocessed frame.

1.2.2 Problem

Most of the analysis involves a researcher looking at the video sequence to observe the phenomenon. The process repeats for each inhibitor or experiment performed on a set of cells. When comparing the behaviour of the GSC to the inhibitor, the researcher has to compare the video sequences visually. The difficulty of these comparisons scales as the number of experiments increase.

Therefore, the problem to address is to facilitate the process of comparison by recording the tracks and provide a proper representation for the researchers without exhausting their sight.

(22)

Figure 1.4: Refined video sequence. This is an example of how the final frame looks like after applying the processing and stitching the captures. Each frame is composed of four captures distributed horizontally and vertically.

1.2.3 Technical challenges

The problem can be addressed by utilising Computer Vision (CV) techniques such as object detection and tracking. Nevertheless, there are some challenges involved in the solution development, since there are non-classical elements and phenomena. Figure 1.5

illustrates some of the issues which are listed below:

Object detection

1. Object merging: the objects may be merged into one single instance in partial occlusion cases (Figure 1.5a).

2. Object elasticity: the objects may be segmented into one single instance given the elasticity after a separation (Figure1.5b).

3. Auto-focus readjustment: the capture devices have the auto-focus enabled, blurring the images for more than three frames in a row, difficulting the edge detection (Figure 1.5c).

4. Intensity: some instances may be confused by the background because of their pixel levels (Figure 1.5d).

(23)

Object tracking

1. Object solidness: the cells are amorphic. Trackers are usually deployed just after detecting a cell. Most of the trackers available are intended for solid objects and might be improper for non-solid objects (Figure 1.5b).

2. Object occlusion: the objects may be merged into one single instance in partial and full occlusion cases (Figure 1.5a).

3. Object splitting: the cells may perform mitosis leading to an instance splitting phenomenon. Most of trackers do not support instance division (Figure 1.5e). 4. Object similarity: since the cells have fuzzy visual features, the tracker may confuse

one instance with another (Figure 1.5e).

5. Multi-scene tracking: the recording comes from four different sources, which are na¨ıvely stitched in a single frame. It will require to add extrinsic parameters to set the coordinate system and control when a cell goes from one source to another to assign the proper label (Figure 1.5f).

Besides to the CV issues mentioned above, there is no labelled data to work with, making it harder to use cutting-edge techniques based on deep learning.

1.2.4 Contribution

This work proposes a two-stage feedbacked detection-tracking pipeline for cell motility characterisation to automatise the drug performance analysis and effectiveness comparison of new and existing drugs.

(24)

(a) Cells occlusion: one cell is on top of

an-other. This shows

how cells can poten-tially merge and oc-clude others

(b) Cells changes their

shape, sticking to

others and present

an amorphous

phe-nomenon

(c) The capture gets

blurry because of

source autofocus

(d) The pixel levels of the cells decay with the time because of the

consumption of the

fluorescent material

(e) The cells can perform the mitosis processes, splitting themselves to reproduce

(f ) The image is composed by four captures from four different sources

Figure 1.5: Computer vision problems observed on the provided video sequence and how they potentially impact on the detection and tracking processes

(25)

Chapter 2 Collimation algorithm optimisation

(26)

2.1 Background and Related Work

This section recompiles the information needed to tackle the optimisation of the colli-mation algorithm. It starts with global profiling of the collicolli-mation algorithm to identify the hotspots, then presents useful information required to choose the proper optimisa-tion techniques which can address the problem, and concludes with a summary of the optimisations needed based on the information already presented during the section.

2.1.1 Collimation algorithm

Figure 1.2 presents a flow diagram which represents the algorithm to optimise. There are missing parts, such as CPU-GPU transferences, since the average computation makes use of the GPU to parallelise the point lookup. Moreover, the whole stage is in a single thread.

The sections of this algorithm are:

• Profile upload: uploads the profile to the GPU.

• Rough collimation: computes the difference between the centre of the bounding box and the centre of the profile. It allows centring the profile in case of having an offset. • Iterative Collimation: performs the collimation using the Iterative Cloud Point (ICP) registration. The process repeats until the profile has an absolute average difference below a fixed threshold (0.01 for the cartesian coordinates and 0.001 radians for the rotation angle).

– Average calculation: computes the average translation of the profile in carte-sian coordinates, and the average rotation extracted from an orientation vector. – Roto-translation: performs the correction of the angle and the displacement

for every point of the profile.

From the sections of the algorithm, the iterative collimation can have excessive runtime depending on the number of iterations required to meet the stop condition.

Profiling

Table 2.1 summarises the initial collimation algorithm performance as a baseline for op-timisation analysis. There are some missing parts since the profiler does not show them because of their low runtime consumption.

From Table 2.1, the Average Computation, and Rotate and Translate are the most con-suming functions of the algorithm, taking in average 875 microseconds (aggregated), re-spectively, apart from the profile upload, which consumes time because of serialisation. In case of taking five iterations, the algorithm consumes 1250 microseconds.

(27)

Table 2.1: Initial profiling results of the collimation algorithm. Results reported by Microsoft Visual Studio 2017. The profiling skips some parts since they do not consume more than 10 microseconds. The functions in bold represent parent functions, and the non-bold represent subfunctions of the parent. Function/Subfunction Time (us) Iterations Average (us)

Upload 590 1 590 cudaSetDevice 30 1 30 cudaMemcpyAsync 50 1 50 Average Computation 180 2 – 5 630 cudaStreamSynchronize 70 2 – 5 245 findPointsAverageRotTransl 30 2 – 5 105 cudaSetDevice 20 2 – 5 70 cudaMemcpyAsync 10 2 – 5 35 cudaMemcpyAsync 100 1 100

Rotate and Translate 70 2 – 5 245

Within the hotspots, the CUDA stream synchronisation and the transferences are spend-ing a critical time in the algorithm. For addressspend-ing the transference time, the algorithm should run in a single device (including other stages) or moving it to the Elaboration stage, which is the fastest stage without runtime issues, according to the customer. On the other hand, the synchronisation issues are unavoidable if using GPU, since they guarantee that the job has finished.

None of the functions of rough collimation appeared in the profile since their consumption is not relevant. Hence, the focus of optimisation will be the iterative collimation. For future reference, Figure 2.1 depicts the dataflow of the algorithm.

2.1.2 CPU-level parallelism techniques

This section explores the several techniques for CPU-level optimisation by exploiting the modern processor features, such as multiple cores, vector instructions1_{, branch prediction,} and others.

The focus of the analysis will respect the stated in I003, and the I004 as a non-explored approach within the algorithm. There are other models of parallelism such as the process-oriented parallelism, not covered in this section since the existence of a pre-existent struc-ture of the project and the model of execution chosen.

(28)

Find best map to reference proﬁle Valid map? Translation greater than 10 Yes Accumulate translation No

Grab new point

No

Yes

Compute difference to the baricenter

Difference

is zero? No Compute distance _{and its inverse}

Yes NaN? Compute angle No Accumulate angle Yes

Figure 2.1: The original average roto-translation algorithm processes one point at a time per loop iteration

(29)

Thread-level parallelism

The thread-level parallelism is an exploitable capability of multi-core processors. The background of multi-core processors relies on Moore’s law for microprocessors, which predicts that the number of transistors integrated within a microprocessor die will double every 24 months according to the plot from the original article written by Gordon E. Moore [43]. The number of transistors is often a parameter to specify how complex is a microprocessor. In the early stages of the microprocessor, the speed of the processor was highly dependent on the clock speed. However, Moore’s law started to enter into a crisis because of physical limits when scaling the transistor [33][34]. Some of the issues are:

• Miniaturisation: miniaturising the transistor carries considerable investments in terms of RD for the fabrication labs.

• Voltage-Frequency trade-off: to compensate for the power dissipation of the proces-sors, an increase in frequency involves a decrease in voltage.

• Leakage: because of quantum effects (such as the tunnel effect), miniaturising the transistor implies a smaller barrier to the electrons, allowing them to pass through the channel without polarising the transistor 2_.

• Voltage threshold: usually, the transistors require a minimum voltage to activate their saturation region. If the voltage is not enough, the transistor will fall into the resistive zone, where the digital state is undefined (it is neither 0 nor 1).

Superscalar and multi-core microprocessors emerged as an alternative to avoid dealing with the voltage-frequency knobs [50]. It carried some changes because of the possibility of race-conditions3_{, and divergence in processes or threads.} _{However, the multi-core} microprocessors allow users exploiting parallelism and concurrent execution, speeding algorithms up linearly respect to the number of cores.

In terms of the frequency, the manufactures preferred to fix low frequencies, which hardly reaches the 3 GHz to keep the power dissipation under control.

It is possible to have multiple threads within a single core by Simultaneous Multithreading (SMT) [44], which aims to maximise the utilisation of the execution units by executing more than one thread within each CPU core. Albeit, the SMT is not suitable for applica-tions which exploit the same execution units at the same time. A mathematical-intensive application will have its peak when using thread affinity: one thread per core.

Since the algorithm is written in C++, some possible ways to implement thread-level parallelism are:

2_{This is known as leakage current, which is also explained by the reduction on the resistance of the} transistor channel

3_{A race-condition is when two processes try to write to the same memory address, corrupting the} data.

(30)

• WinAPI Threads and POSIX Threads • std::threads

• OpenMP

Windows API threads are limited to Windows-based operating systems, and they are not portable to UNIX systems, since the model used by UNIX is the POSIX [41] [22].

However, there are other Application Programming Interfaces (APIs), which offer porta-bility amongst platforms without changing the interface. std::threads is one of these interfaces, which simplifies the access to multithreading within the C++ programming language. std::threads API can run in several machine architectures, and operating systems, such as Windows, macOS, and Linux-based. Moreover, the std::threads API resembles the POSIX API [40].

Another API is OpenMP. Alex Vrenios presents in [60] the programming model of MPI and OpenMP in the C language, which can be ported to C++ straight-forwardly. He summarises the several kinds of parallelism and the OpenMP basics.

OpenMP introduces the concept of parallel regions, which are scope-based, and defines regions executed by multiple threads. To define a parallel region, OpenMP presents the pragma in Listing 2.1.

Listing 2.1: OpenMP parallel region pragma [19]

const i n t numberOfThreads = 4 ;

#pragma omp p a r a l l e l num threads ( NumberOfThreads ) {

i n t threadNum = om p g et t hr ea d n um ( ) ; }

The parallel region defined by OpenMP also accepts the number of threads needed within the region by the parameter num threads. Also, within the parallel region, the OpenMP makes available the function omp get thread num, which holds the thread index.

Compared to the other multithreading APIs, OpenMP is the least invasive thanks to its scoping approach, which makes the user capable of implementing thread-level parallelism without significant changes on the code.

OpenMP uses the fork-join model when the execution enters and exits from a parallel region (see Figure2.2). This model allows OpenMP to launch workers when the execution requires it, and synchronise when the workers finish their job [6]. This process is entirely transparent to the user, which creates scopes where the parallel region begins and ends, as Listing2.1 shows.

Moreover, OpenMP provides a pragma for loop unrolling. It allows the programmer to concentrate on the algorithm instead of taking care of the parallelism and index handling.

(31)

Figure 2.2: OpenMP fork-join execution model. When the master thread (main) enters into a parallel region, it forks several workers. When the workers finish their execution, they join into the master thread. Retrieved from [6]

. Static scheduling

TH0 TH1 TH2 TH3 TH0 TH1 TH2 TH3 TH0 TH1

(a) OpenMP static scheduling

Dynamic scheduling

TH0 TH3 TH2 TH1 TH2 TH0 TH1 TH3 TH0 TH1

(b) OpenMP dynamic scheduling

Figure 2.3: OpenMP scheduling model for loop unrolling. In the case of static scheduling, the several iterations execute in an ordered fashion. In dynamic scheduling, the threads take the pending indices.

The index assignation occurs in two ways: statically, which means that the indices are as-signed in the order as they are sequential (Figure2.3a), and the dynamic approach, where the workers take the indices which are pending of execution as they are free (Figure2.3b). Listing2.2shows a code snippet to illustrate how the pragma for loop unrolling works. To define the index assignation mode (a.k.a scheduling), the pragma provides an option called schedule, which receives the scheduling mode. For the static scheduling, the parameter is static, whereas for dynamic is dynamic. Listing 2.3 presents the output for the code snippet provided by Listing 2.2.

Listing 2.2: OpenMP loop unrolling example with a for loop

const i n t numberOfThreads = 4 ;

#pragma omp p a r a l l e l num threads ( NumberOfThreads ) {

#pragma omp f o r s c h e d u l e ( s t a t i c ) f o r ( i n t i { 0 } ; i < N; ++i ) {

s t d : : c o u t << ” Hi from : ” << o mp g et t hr e ad n um ( ) << ” \n” ; }

(32)

}

s t d : : c o u t << s t d : : f l u s h ;

Listing 2.3: OpenMP loop unrolling example result

Hi from : 0 Hi from : 1 Hi from : 2 Hi from : 3 Hi from : 0 Hi from : 1 . . .

One of the aspects to consider from the fork-join execution model is that the thread spawning and can add some administrative overhead to the execution, especially, consid-ering that the microseconds matter for the application to optimise [21].

Instruction-level parallelism

A program which runs the same operation on a series of data is an example of an applica-tion which is parallelisable by using data-level parallelism. Depending on the dependency amongst data, the degree of parallelism decreases [5]. There are several areas where the data parallelism is applicable, such as: instruction-level parallelism, multithreading, distributed systems, and others.

The instruction-level parallelism includes the capability of a processor to execute the same instruction on multiple data. This feature is also known as Single-Instruction Multiple Data (SIMD) [47]. Figure 2.4 illustrates how a SIMD addition works. The instruction performs an element-wise addition between the vector A (composed by a0, a1, ..., an),

and vector B (composed by b0, b1, ..., bn). This operation often takes a few clock cycles

depending on the type of the operands and on the processor. a0, a1, ..., an b0, b1, ..., bn

+

a0+ b0, a1 + b1, ..., an+ bn

Figure 2.4: SIMD operation. The instruction performs an element-wise addition on vectors A and B

(33)

common technologies are: MMX4_{, Streaming SIMD Extensions (SSE), Advanced Vector} Extensions (AVX) [39].

According to the requirements presented in Problem, the target machine has an AVX512-capable processor, which is the current last generation of intrinsics on x86-based pro-cessors. Intel proposes AVX-512 for advanced analytics, compute-intensive applications, cryptography, and data compression [27]. AVX-512 promises to have twice the capability of its predecessor, AVX-2.

AVX-2 expands the register length to 256 bits, allowing processing up to eight single-precision point operands in a single run, and up to four double-single-precision floating-point operands [9]. Additionally, AVX-2 also expands the 8-bit integer, making it possible to run 32 operations in a single run. AVX-512, instead, doubles the quantities stated before. It can process up to eight double-precision floating-point operands at a time. There are two ways to use the vector instructions:

1. Instructing the compiler the machine architecture 2. Using intrinsics

For the Microsoft Visual Studio 2019 or later, the way to instruct the compiler to use the vector instructions is using the compiler switch /arch:AVX512 [51]. However, Microsoft Visual Studio 2017 only has support for AVX-512 at the intrinsics level. According to I005, the approach to focus on is in integrating the vector support by intrinsics.

Intrinsics are special instructions which work as an interface between the programming language and the assembly language. In the case of C++, they can be used by including the header file immintrin.h [29]. Listing2.4 presents the code for AVX-512 intrinsics to operate similarly to the addition in Figure 2.4.

Listing 2.4: AVX-512 intrinsics example

#include <i m m i n t r i n . h>

// A, B, C a r e a r r a y s w i t h 8 e l e m e n t s e a c h

void v e c t o r a d d i t i o n ( const double ∗ a , const double ∗ b , double ∗ c ) { i f ( ! a | | ! b | | ! c ) return ; m512d a v = mm512 load pd ( a ) ; m512d b v = mm512 load pd ( b ) ; m512d c v = mm512 add pd ( a v , b v ) ;

(34)

m m 5 1 2 s t o r e p d ( c , c v ) ; }

For AVX-2, the instructions are almost the same, except for the prefixes m512d and mm512 to m256d and mm256, respectively. However, there are some instructions available in AVX-512 which are not available in AVX-2. It is preferred to check the Intrinsics Guide [29] for compatibility.

One of the drawbacks of placing intrinsics on the code is the negative impact on portability and maintainability. To move the code to an older machine might imply to rewrite the intrinsic-based functions to meet the processor architecture. Moreover, only programmers familiar with intrinsics might maintain the code.

Another drawback of using intrinsics is the requirement of using properly aligned memory, since there are functions which assume alignment and contiguity in the memory blocks, such as mm512 store pd, and mm512 load pd. They load/store the memory block to the register, and might throw a segmentation fault in case the memory block is not aligned correctly [29].

(35)

2.2 Optimisation

2.2.1 Data representation

Original state

In the current data layout, it encapsulates the profile points within an array of structs, where each struct has x, y as float members (see Listing 2.5).

Listing 2.5: Current point cloud representation

typedef s tr u c t { f l o a t x ;

f l o a t y ;

} P r o f i l e P o i n t ;

P r o f i l e P o i n t [ ] p r o f i l e ;

Moreover, the points are accessed uniformly stridden because a loop perforation technique[54], which samples the profile to determine the average angle of rotation and displacement.

Proposed improvement

The implementation proposal has two vector registers: one to hold the x coordinates and the other one for the y. To couple the original point cloud to the proposal, it is required to have a memory copy at the very beginning of each loop iteration.

Following the flow diagram illustrated in Figure 2.1, a conversion from single-precision floating-point numbers to double-precision is required for keeping a high-resolution of the differences when averaging. It will split each vector loaded when copying the memory into two vectors holding number with twice the resolution. Besides, this will cause a splitting of the loop into two iterations per point sample. Figure2.5illustrates the data adaptation process required to change the data-layout.

2.2.2 Vectorised roto-translation

Original state

After making the data accesses sequential, it is possible to refactor the original algorithm to its vectorised version. Since the vector instructions are low-level instructions, the best way to explain it is through data transformations and linking their relation with the original algorithm, shown in Figure 2.1.

The provided algorithm for Average Rotation and Translation Calculation has branches to decide whether it should discard the Point-Under-Analysis (PUA) or not, depending

(36)

X, Y X, Y X, Y X, Y X, Y X, Y X, Y X, Y X, Y X, Y X, Y X, Y X, Y X, Y X, Y X, Y X X X X X X X X Y Y Y Y Y Y Y Y Memory Copy X X X X Y Y Y Y X X X X Y Y Y Y Double precision conversion Current proﬁle point

cloud

Data is ordered in occurrence order

To inner loop 1 To inner loop 2

Figure 2.5: The incoming point cloud is sampled and packaged into a vector of single-precision floats. Then, it is split into two vectors of double-precision floats.

on distance respect to the geometrical centre, and the invalid values which can raise from divisions by zero.

A for loop provides the sample points used for computing the average. It is also an opportunity to apply thread-level parallelism with OpenMP threads if needed.

2.2.3 Proposed improvement

Since the code has branches to make decisions about the point neglection, vector instruc-tions are not directly applicable to the algorithm. An alternative to discard points from the analysis is to mask them and flatten the execution as a continuous pipeline.

Figure 2.6 shows the proposed vectorised implementation of the average computation. Comparing the implementation presented in Figure Figure 2.1 and Figure 2.6, the two principal differences are the data retrieving and the mask-based data filtering. The last one comes to replace the branching, which may lead to mispredictions and performance degradation, compared to using the Floating-Point Unit (FPU) for comparison and logical operations (when applying the mask to discard invalid data).

The process starts by filling a Single-Precision Floating point AVX register with 8 points (16 points in case of AVX512) uniformly stridden. It performs a memory copy to:

• Make the sequence of points (by coordinates) contiguous in memory. • Adapt the data structure.

(37)

Grab 8 points and ﬁll the AVX vector

register

Sampling

The 8 points grabbed are uniformly strided by the so called

passoCampionamento.

Find the sectors in the map

Find best map to reference proﬁle

New

implementation

The sectors look-up is now a new block. Before, it was part of Find

best map to the reference proﬁle

block.

Mask the invalid mappings

Mask those translations greater

than 10

Intersect the masks (AND)

Filter the translations No bunches available.

Get another two

Split the process into two bunches (4 points

each)

Accumulate translations Compute the difference

to the baricenter

Mask differences different to zero

Compute the distance and its inverse

Compute the angle

Accumulate the angle

Go for another bunch

No concurrent

Concurrent

Combine the masks and ﬁlter the distances

Figure 2.6: Vectorised average calculation algorithm - The proposed algorithm uses AVX in-structions. It replaces the branches by mask and data filtering. To preserve the accuracy of the computation of the angle, it is still serial.

(38)

This process allows having the conversion already presented in Figure 2.5. The pro-cess splits into two steps which represents the upcast from Single-Precision to Double-Precision, where the following operations require this precision to minimise the quantisa-tion errors product of the numeric representaquantisa-tion.

The process Find best map to reference profile performs almost in the same way as the original version to this proposal. The main difference is that the sector calculation moves out from this computing block to take advantage of the already prepared AVX register. It makes available the index of the point which minimises the distance between the PUA from the incoming profile and the reference profile, and the difference.

Nevertheless, it is also possible that there is no optimal point or the difference is greater than 10 (given from the original algorithm). The original implementation solves it by acquiring the absolute value of the difference and comparing it in an if-statement. The proposed implementation, instead, compares the absolute value using the FPU, generating a mask for filtering the data later, avoiding having execution breaks caused by branching. After the filtering, the translations are reduced into the global accumulation, which will help to compute the average translation.

The rotation angle computation is similar to the translation, basing the decisions on masking and filtering the data. The angle computation is computed by an arctangent operation (atan2(y,x)).

Given the nature of the computation, an approximation causes significant differences in the final results. By using an approximation (with a precision of approximately 10−9), the collimated profile results vary in around 10−4. For overcoming this difference, the algorithm makes use of the serial arctangent and iterates the elements of the vector, leading to an error not detectable at 10−9.

2.2.4 Multithreaded roto-translation

Original state

The average calculation algorithm performs the sampling by using a for loop over the profile points as presented in Listing 2.6.

Listing 2.6: Original loop over the samples in average computation

const i n t s a m p l e s t r i d e = 1 0 ;

// Where N i s t h e number o f p o i n t s o f t h e p r o f i l e f o r ( i n t i { 0 } ; i < N; i += s a m p l e s t r i d e ) {

P r o f i l e P o i n t c u r r e n t e l e m e n t = p r o f i l e [ i ] ; }

(39)

parameter, specified by a configuration profile provided by the final user.

Proposed improvement

In case that the vectorisation is not enough to meet I001, it is possible to exploit CPU multi-threading by using OpenMP threads. The number of threads must consider I003. OpenMP threads, as an option, offer pragmas for semi-perfect5 _{loops. However, since} the closest sample points shall be as close as possible for the vectorisation and data representation, the pragma cannot be directly applied (see Listing 2.6). Instead, the proposed approach is to split the sampling loop into two loops. The outer one steps in bunches of sixteen sample elements (if AVX512 is present) and the inner one steps into the sixteen sample points, as presented in Listing 2.7.

Listing 2.7: Parallelised loop over the samples in average computation

const i n t s a m p l e s t r i d e = 1 0 ; const i n t v e c t o r s i z e = 1 6 ;

const i n t j u m p o u t t e r = s a m p l e s t r i d e ∗ v e c t o r s i z e ; // Where N i s t h e number o f p o i n t s o f t h e p r o f i l e #pragma OMP f o r s c h e d u l e ( s t a t i c )

f o r ( i n t bunch { 0 } ; bunch < N; bunch += j u m p o u t e r ) {

f o r ( i n t sample { 0 } ; sample < j u m p o u t t e r ; sample += s a m p l e s t r i d e ) { // Apply d a t a r e p r e s e n t a t i o n t r a n s f o r m

} }

A critical remark about Listing 2.7 is that the profile array shall be a multiple of jump out, or controlled within the inner loop to avoid possible out-of-bounds accesses to the point array.

5_{A semi-perfect loop is a loop whose number of iterations is known through an arithmetic operation} among the corner indices and the step.

(40)

2.3 Results

The proposed optimisations have been applied and tested on a development machine capable of using both: AVX2 and AVX-512. The resources of this machine are listed below:

• CPU: Intel Xeon Skylake (16 cores - 2.50 GHz) • RAM: 128 GB - DDR4

• Windows 10 operating system - Visual Studio 2017 (I005)

This machine is similar to the machine used by the customer’s application (mentioned in I004), delivering a production-like scenario for testing.

Table 2.2 presents the results acquired after running the entire project on this machine by using a test profile.

Table 2.2: Optimisation with AVX of the average computation - Runtime results before and after applying the AVX vectorisation to the average computation Threads Vectorisation Min Max SpeedUp Status

1 No AVX 0.95 1.05 - Not passed

1 AVX2 0.88 0.97 1.08 Not passed

1 AVX-512 0.98 1.02 1 Not passed

2 No AVX 0.6 0.65 1.6 Not passed

2 AVX2 0.5 0.57 1.86 Not passed

2 AVX-512 0.52 0.57 1.83 Not passed

4 No AVX 0.49 0.53 1.96 Not passed

4 AVX2 0.33 0.35 2.94 Not passed

4 AVX-512 0.27 0.28 3.84 Close

6 AVX-512 0.21 0.22 4.65 Passed

Table 2.2 shows several configurations for the optimisation, allowing the user to tune the level of optimisation of the final algorithm. Scaling in the number of threads is not enough to accelerate the algorithm to meet the required runtime (250 microseconds). The scalability efficiency of the algorithm at thread level is around 50% when having four threads.

Fixing the observations on the speed up achieved by using vectorisation, the gain is not significant compared to the baseline6_{. Although AVX2 has half of the registers available} compared to AVX-512, it manages to accelerate more than the AVX-512 for a single thread case. In the other cases, the scalability does not represent the expected performance for

(41)

the number of registers available in AVX-512. The expected gap between AVX2 and AVX-512 is a factor of two on performance gain.

From the observations, the best factor achieved between AVX-512 and AVX2 is 1.30 times out of the expected 2. It represents an efficiency of 65% when scaling with AVX-512 compared to AVX2. Comparing AVX-512 to a non-vectorised version, the best efficiency achieved (using six threads) is 30%. Possible causes of the degradation are:

1. Memory copies: since the data structure is adapted to meet the memory contiguity and the stridden access to the points plays a critical role in slowing the optimisation down.

2. Serial parts: the current implementation performs the angle computation with a serial arctangent function.

3. The vector units have a slower clock compared to the CPU base clock. It may cause a degradation of about 25-40%.

Pointing out the third cause, the user should expect a clock slowdown when using vector instructions. Intel specifies in [26] that the scaling between SSE (four 32-bit floating-point numbers) and AVX (eight 32-bit floating-floating-point numbers) is 1.8 times. For AVX2, Intel claims they improved the vector units, allowing 2.8 times in scaling by doubling the number of single-precision floating-point operations (SP-FLOP) per CPU clock.

Regarding data provided by Intel about the Second Generation of Xeon processors [28], for the SKU:6256 processor 7, there are clock slowdowns when using the vector units combined with the number of threads. The processor manages to reach the following clock speeds using all the cores (twelve in total):

• No AVX: 4.3 GHz • AVX 2: 3.8 GHz • AVX 512: 3.3 GHz

From these clock speeds, AVX512 can scale up to 6.14 times while using double-precision floating-point numbers. (2.1) computes the scalability of the vector units.

S = fvector fbase

× nregisters (2.1)

where S is the scalability, fvector is the maximum frequency of the vector units, fbase is the

maximum frequency without using vector units, and nregisters is the number of operands

which can fit within a vector register.

(42)

0 2 4 6 8 10 12 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 Number of cores F requency (GHz)

Operational CPU frequency achieved by vectorisation No AVX

AVX2 AVX512

Figure 2.7: Operational frequency while enabling the vector units for several number of cores. Based on [28]

Since the scalability of AVX-512 is 6.14 times, it represents an efficiency of 76.75%. For AVX2, it is higher since the speed degradation because of the clock speed decay is lower. Nevertheless, the operational speed reachable while the vector units are active depends also on the number of cores which are using those units. Figure2.7 shows how the clock speed changes according to the number of CPU cores and the vector technology.

In [11], the authors report a study about the scalability of the intrinsics in Intel processors, achieving a speedup of 3.5 times in a single thread analysis. The authors also mention the energy consumption and other bottlenecks which can give a better understanding of the bottleneck on the evaluated code.

(43)

Chapter 3 Glioblastoma stem cells tracker

(44)

3.1 Background and Related Work

3.1.1 Related research

Cell tracking is a field which has been studied for several years, addressing several issues related to non-conventional computer vision challenges. There are attempts to address the problem from classical mathematical methods to deep learning methods.

For the case of classical mathematical model, the common approach is to tailor the al-gorithms to work under certain conditions, such as the cameral pixel-levels, background, cell population size, and cell shape. A similar phenomenon also happens to algorithms based on deep learning. They often require many labelled data to achieve a good object detection.

This section studies the several problems found within the community and their attempts to address those issues.

Common issues

The research community has been studying some of the issues mentioned in Technical challenges, and there are some proposals to address them.

Mosig et al. mention in [3] the utilisation of segmentation techniques to detect and differentiate the several cells within a sample. The idea behind using image segmentation is its capability to partition an image into regions to represent meaningful areas of the image [55]. This meaningful areas are usually several objects or part of that object. For Mosig et al., those meaningful areas represent different cells. Tsai et al. followed the same approach in [57], whose results are illustrated in Figure 3.1.

Within the issues found by using image segmentation for detecting cells are three primary cases (see Figure3.2):

• Over-partitioning: when one cell has a single colour in the ground-truth but receives two colours (Figure 3.2b).

• Under-partitioning: when two cells have two colours in the ground-truth but receives one colour (Figure 3.2c).

• Mis-partitioning: one colour overlaps within two cells (Figure 3.2d).

Finding a balance to get close to the ground truth is even more difficult for cell segmenta-tion since the images are grayscale and are often noisy, as Figure 3.1a shows. It discards any attempt to use colours as an input to get a better image segmentation quality. The authors in [2] also mention this issue.

(45)

(a) Input image (b) Cells segmentation used in Usiigaci Figure 3.1: Cells detection by using image segmentation based on deep learning [57]

(a) Ground truth (b)

Over-partitioning

(c) Under-partitioning

(d) Mis-partitioning

Figure 3.2: Fundamental issues within image segmentation

Additionally to the segmentation issues, Mosig et al. indicate that cells can double their population during a video sequence analysis, making the tracking of the cells harder to follow and differentiate the several offsprings. Also, cells change their shape and movement patterns abruptly, making the features based on movement and shape almost useless for this particular use case. Tsai et al. also mention these problems in their study in Usiigaci. Carpenter et al. offer an open-source software to profile cells, allowing users segmenting images, monitoring several parameters related to the cell population, and tracking [10]. The authors mention that the object identification is one of the most challenging steps for any analysis pipeline for cells use case, given that the cells can touch each other. It causes that simple and popular algorithms fail.

Measuring the cell size is a challenging part as well, since the feasibility for the edge detection [10]. Carpenter et al. used watershed to address the edge detection, but it does not give good results.

(46)

According to Carpenter, the robustness of the cell analysis implies adding as many fea-tures as possible to get characteristics from all possible perspectives. It causes that the analysis process slows down until the point of needing clusters to run the analysis over thousands of images, taking around two minutes to analyse a single image using a single CPU. Considering that the dataset for this project contains 810 frames composed of four subframes (3240 images), it might take more than a hundred minutes.

Labelling the instances and preserving the identifiers from one frame to the next one involves a pattern recognition task. Mosig et al. claim that the typical object tracking methods based on one-to-one methods do not handle cell division, cell fusion, or over-segmentation properly [3].

Perner in [49] highlights the issues of single-shot tracking mechanism proposed by Li et al. in [38], which detects the instances in the first frames and then continues tracking the detected instances. She claims that there is a loss of data because of the mitosis phe-nomenon. It tends to increase the number of the living instances within the experiment. Moreover, she highlights the issue with the label swapping, where two instances exchanges their labels because of their feature similarities.

Perner also mentions that the either euclidean distance or the cell size as Padfield et al.[46], and Kan et al.[32] implemented in their respective work. It suggests that a robust tracking algorithm needs several features to strengthen the tracking process while keeping the labels fixed to the instances. Evaluating the use case addressed by this project, using the euclidean distance may lead to possible label swapping when cell fusion events occur. Besides, the cell size is not reliable since the glioblastoma cells change their shapes frequently and abruptly, as Figure 1.5 presents.

In most of the research studied before the project, one of the common issues was the pixel intensity of the fluorescence used to identify the cells [3][2] [10]. This problem also exists within the provided video sequence, where the fluorescence tends to decay as the video evolves.

Since the addressed use case also involves several scenes to represent a single frame of the video sequence, the multi-scene tracking also becomes a challenge to take into considera-tion. This challenge is present in several use cases such as video surveillance.

Turolla, Marchesotti, and Regazzoni propose a multi-scene tracking method which recon-structs the space by using extrinsic parameters1_{, and takes other features, such as the} euclidean distance between objects, speed difference, and the Bhattacharyya coefficient to evaluate the similarity between the colour histograms [58]. The algorithm links the labels by thresholding a decision vector with the weights after the correlation amongst the several features of the tracked instances. Given the nature of the problem, the speed can give erratic results because of the randomness of the cell movements. Besides, the colour histogram is limited to grayscale.

Other research also uses the colour histogram as a feature to make decisions about the

(47)

label linkage [64][61][20]. Most of the use cases were related to tracking people, and colour video sequences, where the histogram provides more information about the object composition, and works for identification.

(a) Overlapping multi-scene

C

(b) Non-overlapping multi-scene

Figure 3.3: Types of multi-tracking. Non-overlapping cameras consider a monocular case,

whereas overlapping cases consider high overlaps between scenes

According to the types of multi-scene cases mentioned by Fleuret et al. in [20], the GSC use case is a monocular tracking case, where a camera covers a scene with a minimum overlapping of another camera (see Figure 3.3b). For the authors, it poses a problem for 3D reconstruction, which may lead to precision loss and mispredictions.

The GSC case is more similar to the case presented by Wang et al. in [64]. The authors address a problem related to non-overlapping camera tracking. They use multi-feature tracking to match the instances locally tracked by each scene. Nevertheless, the results analyse the case for two people tracking.

In [13], the authors emphasise that using the colour features for matching objects in non-overlapping cameras is not enough and can be affected by several factors, such as brightness differences, pose, and viewpoint. Instead, they propose using the direction to link the non-overlapping scenes. On the other hand, in [4], the authors propose using several features to make the tracking labelling more robust rather than relying on a single feature for linking the detections.

Current solutions

Mosig et al. propose a topological method for cell detection and tracking based on topo-logical alignment[3]. It combines flux tensors for detection of the moving instances and multiple features for labelling. The authors report that the performance of the flux tensors depends on the contrast and signal-to-noise ratio (SNR).

(48)

Figure 3.4: Usciigaci architecture. The segmentation module uses R-CNN model, whereas the tracking uses k-d tree tracking. Image retrieved from [57]

may occur during the first iterations by aligning two consecutive frames. It aims to maximise the overlap. Nevertheless, the authors had issues while distinguishing between pseudo-divisions and mitosis. They addressed this issue by treating the alignment prob-lem as a generalised assignation probprob-lem, where the overlaps are weighted. Thus, the segmentation iterates over the overlaps amongst the consecutive frames.

They concluded that the topological alignment was not able to address the problem of under-segmentation in some cases, but they manage to improve the overall segmentation. Usiigaci is one of the proposals which utilises deep learning to perform the image seg-mentation on cells, and it is one of the current state-of-the-art in cell image segseg-mentation (published in 2019)[57]. The motivation behind Usiigaci is similar to the problem ad-dressed by this thesis project, since the authors intend to characterise the cell migration to understand the behaviour of the cell under various physiological conditions. It aims to study some phenomena such as tissue maintenance, immunity, and metastasis 2.

Usiigaci proposes a stain-free, instance-aware cell tracking software for Zernike’s phase contrast microscopy (PCM) by using deep learning approaches. The authors use a Mask-RCNN3 _{to segment the cells, pre-trained with the Microsoft COCO dataset with 50} man-ually annotated PCM images. The image segmentation performed by Mask-RCNN makes it possible to have a detection of every single instance.

For labelling the objects, they assign unique IDs which contain information about the detected instance such as location, diameter, perimeter, eccentricity, orientation, and actual solidity. After labelling, the algorithm proceeds to transmit those features to the TrackPy library[1]. Figure 3.4 presents the overall architecture of Usiigaci.

The results achieved by Usiigaci are quite acceptable for the segmentation stage, managing cell sticking and proximity in most of the cases. Figure 3.1 shows the results after the

2_{Metastasis refers to the spread of cancer cells within the body}

3_{Mask-RCNN is a Convolutional Neural Network capable of detecting and segment instances within} an image[24]

(49)

tracking stage. The segmentation similarity between a human-labelled image and Usiigaci oscillates around 72.23%. The precision and recall of the neural network are above 83%, whereas the accuracy is around 96%. It is remarkable given the nature of the problem and its complexity.

Regarding the tracking quality, Usiigaci scores 92% of valid tracks, when there was human intervention to filter some spurious tracks. In case of not filtering, the quality decreases to 19.5%.

Carpenter et al. propose the CellProfiler suite for characterising cells within video sequences[10]. It provides several techniques to perform image segmentation, even on images with low SNR and contrast. They mention in [10] the experiments performed on classical com-puter vision approaches, and their effectiveness to address the problem of cell tracking. Moreover, the suite also integrates algorithms developed in-house tailored for the use case. CellProfiler also has some extra modules to monitor protein levels and some morphological assays such as DNA study, and shape profiling. They also provide a methodology to deploy the tool on a cluster to exploit supercomputer centres, and speed the analysis up from several hours to minutes. Potentially, according to the dataset provided for this work, the analysis might take roughly an hour compared to one hundred as mentioned in the last section. The experiment cited by the authors was a comparison between a single CPU analysis against a 100 CPU. It means that there is a linear scalability of the suite. Similar to Usiigaci, CellProfiler is also open source and available in Github, open for contributions and modifications.

Perner proposes an automatic binary thresholding on the incoming frames to remove the noise together with the background and create cell regions of interest. The cell detection occurs within a window, which is often smaller than the actual frame size. It helps to discard spurious detections on the borders.

Then, the author proceeds to compute some characteristics of the detected instances, such as the centre of mass, mitosis/normal classification, and proceeds with analysing the frame before with a window-based similarity approach, where the algorithm looks for the closest similarity within a given window expanded from the object region of interest (ROI).

The features used by Perner utilise the position and pixel displacement as an anti-feature4. The process finalises by saving the trajectory into a database and update the reference for the next frame [49].

Within the CNR-IOM, the researchers currently use the ImageJ suite for analysing the images. It has an extension (TrackMate) able to make single-particle tracking, allowing to reconstruct the tracklets and their possible bifurcations. The TrackMate code is licensed under GNU Public Licence version 3, making it accessible for research purposes. Dif-ferently from other solutions, which aim a global analysis (or multi-particle), TrackMate

(50)

focuses on analysing a single instance, and analyse the cell lineage, making a tree based on the frames and cell evolution[56].

Figure 3.5: TrackMate functionality example. It is possible to reconstruct the tracklets on the left and observe the lineage on the right. Retrieved from [56]

For tracking, TrackMate uses particle-tracking approaches. The first method relies on the LAP framework proposed in [30], based on computing costs from the square dis-tance between particles, ideal for Brownian motion5. The other particle-tracking method available within the extension is based on Kalman Filters [31] to tackle linear movement, performing the particle linking based on nearest-neighbour search.

The developers tested the extension with the dataset presented in [14]. The developers also encourage users to tailor their algorithms to the extension to get higher performances.

NanoSciTracker: an object tracking library for microbiology and an industrial collimation algorithm optimisation