Analysis of object recognition methodologies for situation awareness in airborne autonomous controlled missions

(1)

Department of Information Engineering Master’s Degree in Automation Engineering

Thesis

Analysis of object recognition methodologies

for situation awareness in airborne

autonomous controlled missions

Candidate

Davide Mameli ____________

Supervisors

Prof. Alberto Landi ____________ Prof. Emanuele Crisostomi ____________ Eng. Andrea Masini ____________

Prof. Lorenzo Pollini ____________

(2)

dated in virtual scenarios. A first method, based on object shapes, has shown faster evaluation time but lower precision and accuracy than the second one. The second method, based on object details, has shown complementary per-formances. Hence, this complementarity has suggested to combine them in a third methodology, composed of the cascade of both methods, which has obtained better accuracy, precision and specificity than the original systems composing the cascade.

An UAV attention system is designed to automatically extract possible ob-jects from the scene, and a detection method is experimented which allows real-time performances. The detector designed demonstrated to be accept-able in virtual environments, but it needs further development and integra-tions for a practical application.

The UAV object recognition system is designed with a look on human sit-uation awareness in order to completely automate the UAV comprehension of scenes, limiting to the minimum possible the human supervision and to aid UAV decision systems to take autonomously correct actions for improved mission success probabilities.

The system performances have been compared systematically to commonly adopted approaches based on the percentage of matching features from a database image to an input image, and it has been tested in simulated and real scenarios. In virtual environments, a database of four objects, three ve-hicles and an airplane, has been loaded and matched against images contain-ing objects in various and mimetic environments, in different light conditions and even partially occluded. The proposed recognition system has recog-nised almost a thousand of object instances and rejected almost a thousand of negative images with better precision, specificity and accuracy than other adopted approaches. While real-time object recognition performances of the entire system were successful in virtual tests, in real complex scenarios per-formances could degrade between one or two seconds per frame and to lower

(3)

accuracy, because of the lack of both effective and fast object extraction and shape estimation techniques. Future works will identify possible improve-ments for the proposed methodologies, test more objects, experiment new complementary acquisition hardware in order to obtain similar extracting performances in real scenarios and add an higher level decision system which automatically defines the best action to be executed relatively to the analysed situation. The conclusions resulting from this work are that shapes, which are commonly not considered when matching features, determine advantages in terms of performances and accuracy in object recognition problems, and that unmanned aerial systems with situation awareness capabilities improve the mission success and remove the current necessity of human supervision or result in a single human simultaneously supervising a larger number of vehicles.

(4)

2.1.5 Morphological operators . . . 19

2.1.6 Contours and bounding rectangles . . . 21

2.2 Object classification . . . 24

2.2.1 Invariant moments . . . 24

2.2.2 Scale Invariant Feature Transform . . . 26

2.2.3 Viola-Jones cascade classifier . . . 30

3 Improvements in recognition methods 31 3.1 Dataset creation . . . 31

3.2 Canonical shape recognition methods . . . 33

3.3 Local Shape Pattern method . . . 35

3.4 Canonical feature matching method . . . 36

3.5 Homography-based Matching Consistency method . . . 38

3.6 Shape-Details Cascade recognition system . . . 39

3.7 SDC complementary systems . . . 41

3.7.1 Attention system . . . 42

3.7.2 Coding system . . . 48

3.7.3 Memory management system . . . 50

4 Experimental results 53 4.1 Performance indexes in Computer Vision . . . 53

(5)

4.2 Simulation environment . . . 58

4.2.1 LSP validation . . . 58

4.2.2 HMC and SDC validation . . . 66

4.2.3 Virtual scenario . . . 70

4.3 Real environment . . . 74

(6)

deployed for environment monitoring, security, surveillance and military ap-plications, but also policing, fire-fighting, and commercial applications are greatly growing in number. UAVs are often preferred for missions that are too dangerous or that require repetitive mansions. Nevertheless, many de-ployed unmanned vehicle systems are teleoperated, semi-autonomous or rely on very fragile autonomous capabilities [1]. A common solution is to equip UAVs with a system that in the real time provides operators with the in-formation necessary for control and decision making, and allows effective supervision. Consequently, mission success still requires the human to be an important system component, and human operators are essential to support UAVs in cognitively demanding tasks associated with the missions. As in the majority of cases vehicles reside at remote locations, the operating con-ditions may limit the human understanding of the vehicle and scene, making it difficult to handle unpredictable events. In fact, the limited bandwidth in the communication does not allow telepresence to satisfactory levels.

UAVs potentially may gain advantages over humans with regards to percep-tion. UAVs can provide dedicated vigilance for persistent surveillance and may have the capability to store all perceived information throughout the entire mission. Besides, UAV basic sensors may be accompanied by night vi-sion, auditory perception outside the human perceptual range and zooming technologies which allow in principle to surpass human sensory limitations.

(7)

Adaptive sensing incorporates the ability to determine which sensor to use at a particular time, modification of the sensor processing, or combination of sensors required based upon the task. The same concept may be adopted in the selection of what is relevant and in applying the correct method for data extraction, similarly to human mental models of the task which priori-tise the importance and meaning of the information with regard to mission goals. Current UAVs either rely on the human operator to understand the situation or blindly execute the mission. A human limitation is that individ-uals new to a situation cannot easily acquire the mental models from who is already experienced. Therefore, their performances are typically lower than those of experienced individuals. UAVs may overcome this problem by the transfer of comprehension capabilities between unmanned entities, thus elim-inating or reducing the learning phase necessary to humans beings.

Situation Awareness (SA) is an important concept introduced in the

con-text of leadership and roles involving time-critical applications, but later extended to UAVs and human interactions. An interesting definition of SA has been proposed by Endsley [8]: “Situation Awareness is the perception of the elements in the environment within a volume of time and space, the comprehension of their meaning and the projection of their status in the near future”. Humans are able to understand dynamic and complex environments thanks to their cognitive capabilities. One component of these cognitive ca-pabilities is SA. Existing UAV that incorporate intelligence caca-pabilities are uncommon, but do not necessarily possess situational comprehension. The hypothesis made during this work is that UAVs with human-like SA capabil-ities improve the mission success and remove the current necessity of human supervision or result in a single human simultaneously supervising a larger number of vehicles. In order to achieve such a result, the UAV, accordingly to Endsley’s definition, must perceive the environment, understand it, anal-yse the situation and employ that comprehension to predict the near future and determining the best action to be executed.

The UAV functionalities simultaneously required to process the high-density information present within a single image are sensor data acquisition, selec-tion and extracselec-tion of the task-relevant informaselec-tion from the input, allocaselec-tion of attentional and memory resources, decision making, selection of appropri-ate actions and automatic command execution. Attention in this context typically is defined in relation to a particular mission that must be attained. Knowledge of the mission goals determine the information required by the system to successfully complete it; hence, the data processing method may change for different objectives. As the comprehension of the situation suc-cess is directly dependent upon attaining the best result as possible during the data extraction process, this work focuses on attention, data extraction

(8)

The main limitation in the use of Machine Learning methods is that classi-fiers are often to be considered as black boxes, with no clear meaning. For this reason algorithms and decisions are in this work logically separated, and ex-perience and learning do not directly affect well-founded algorithms, but only the weights descriptors have in the decision process or the method parameters that can also be dependent on the context. Besides, while recommending the use of properly designed classifiers, in this logic machine learning classifiers can also be placed without effort when they are demonstrated to be effective in practice (e.g. probabilistic line detection, or background subtraction). It is important to note that, while the techniques are shown for UAVs, they can be applied on various Computer Vision application that aims to automate devices or systems which relies for its decisions, interactions and behaviours on extrapolated information from video sequences.

Chapter 1 introduces the main concepts and methodologies in Object Recog-nition, studying their advantages and disadvantages and evaluating a possible application in the system later designed. Chapter 3 introduces two criteria for object recognition developed during this work, which are then merged together into the overall object recognition system. Chapter 4 validates the proposed methods and recognition system, and tests in simulated and real scenarios are discussed, and their performances are compared to commonly adopted methods. Chapter 5 contains the final considerations on this work and defines further possible developments.

(9)

Methodologies in Object

Recognition

2.1 Object detection

Detectors are widely used in Computer Vision to identify regions of interest or distinctive features. In the context of this work, both the purposes are necessary to extrapolate objects from the scene and for object classification. A wide range of problems can in principle make good use of segmented re-gions, when such segmentations have been reliably and efficiently computed. In literature there are many different approaches and techniques, which, at present, do not work in general conditions and depend on some method pa-rameters that need to be defined by the designer and that modify algorithm performances, with lack of generality. The aim of this chapter is to identify the most salient techniques for object extraction, which are demonstrated to work in a high number of cases, in order to apply them to real and simu-lated scenarios and to pass automatically to object recognition systems image sub-windows containing candidate objects to be recognised without human effort.

2.1.1 Canny-edge detector

Canny-edge detector is an edge detection operator, proposed in [5], that uses a multi-stage algorithm to detect a wide range of edges in images. In object detection applications it may be useful as object contours extractor. In fact, the edge detection process serves to simplify the analysis of images by drastically reducing the amount of data to be processed, while at the same time preserving useful structural information about object boundaries. The detector consists of the optimal solution for the following three criteria:

(10)

is commonly used in research [21]. The methodology can be resumed into three steps:

• noise filtering: smoothing of the image to remove noise;

• gradients calculation and non-maximum suppression: gradients within the intensity image are evaluated in order to determine the locations where they have a large magnitude. Only local maxima can be consid-ered as edges;

• double thresholding and edge tracking: in most cases it is impossible to specify a threshold at which a given intensity gradient starts to be or stops from corresponding to an edge. The double thresholding technique allows to better handle this decision step by asserting that the probability that a pixel adjacent to an edge pixel maintains an high probability to be an edge itself. Vice versa a pixel adjacent to a non-edge pixel maintains a low probability to be an edge pixel. This corresponds on keeping as final edges all pixels connected to very likely edge pixels.

As acquired images inevitably contain some amount of noise and edge locali-sation requires the best performances in determining the actual edge position, it is important to reduce every source of errors, in particular to prevent that noise is mistaken for edges. Hence, noise must be reduced for example ap-plying a Gaussian filter on the image.

Intensity gradients are evaluated locally at each pixel in the smoothed image. Canny applied the most common gradient estimation using a kernel known as the Sobel operator. Gradient are evaluated both in the x and y direction

(11)

respectively by applying the kernels Ker(GX) =    −1 0 1 −2 0 2 −1 0 1   , (2.1) Ker(GY) =    1 2 1 0 0 0 −1 −2 −1   . (2.2)

Then, if the kernel are applied to the input image A, the gradient image can be derived from the respective kernels as

GX = Ker(GX) ∗ A, (2.3)

GY = Ker(GY) ∗ A. (2.4)

The gradient magnitude image G and the gradient direction image Θ can then be determined easily from the two components images

G =

q

G2

X + G2Y, (2.5)

Θ = atan2(GY, GX), (2.6)

where all the operations are performed pixel-wise. The direction matrix Θ is quantized to contain 8 possible directions every 45° and allows to better localise edges. Basically this is done by preserving all local maxima in the gradient image, and deleting everything else: if the edge magnitude of the current pixel is larger than the edge magnitude of the pixels in the positive and negative gradient direction, preserve the pixel and the value of the edge, else suppress the current pixel.

Edges computed using the non-maximum suppression technique may contain spurious edges caused by unfiltered noise or colour variations (e.g. rough sur-faces). The simplest way to identify real edges is to maintain only edges with a strong evidence. Canny adopted the double thresholding technique: edge pixels with a magnitude larger than the high threshold are marked as strong, while the edge pixel with magnitude between the high and the low threshold are marked as weak; all remaining pixels are suppressed. Finally, weak edges are included if and only if they are connected to strong edges. This last step is justified by the consideration that noise is probably distributed indepen-dently of edges on the entire image.

An example of the Canny-edge detector applied to the simulation environ-ment in Figure 2.1 is shown in Figure 2.2; the same detector applied on a real image acquired during an UAV mission, shown in Figure 2.3, is depicted

(12)

tained from a simulated environ-ment.

Canny-edge detector applied to a simulation frame.

in Figure 2.4. As it can be seen, edges in the simulated environment are a good indication of object presence, as the simulation was not much detailed. In real pictures the canny edge identifies more lines, as it detects also mor-phological details, and some other pre-processing methods are necessary for real object extraction applications.

2.1.2 Saliency map

Saliency maps are an approach to the design of a visual attention system and may be associated to the human automatic focus on image subregions containing prominent features. Visual attention is the process of selecting information based on saliency (bottom-up process) as well as on prior knowl-edge about objects (top-down process). For biological visual system it seems to be an important component, however, biologically inspired systems mix-ing bottom-up and top-down processes, as far as it is known, have never performed better in terms of classification error than purely bottom-up ap-proaches when dealing with complex recognition problems, as researched in [22]. There are various different saliency descriptors in literature, all with their drawbacks, as for example the non-applicability to scenes containing more than one object or the hypothesis of background and foreground well identifiable within an image. In this work, it has been selected the VOCUS (Visual Object detection with a CompUtational attention System [13]); but only a single component of the saliency map, called conspicuity map and corresponding to the intensity image, has been considered, similarly to [4]. The system is showed in Figure 2.5. In fact, in aerial applications vehicles may be mimetic or partially assimilated to the background, making colour

(13)

Figure 2.3: An input frame ac-quired from an UAV in a real en-vironment.

Figure 2.4: The result of the Canny-edge detector applied to a real frame.

and orientation information unreliable. The intensity map is calculated con-verting the original colour image into a grayscale image. Then, a Gaussian image pyramid is obtained by applying a third-order Gaussian filter to the grayscale image and rescaling it down by a factor of two four times, this yields to five images. The VOCUS system for the intensity conspicuity map takes into account the information present in the smallest scales (images i2,

i3, and i4). Accordingly to [4], first centre and surround are defined

surround(x, y, s, σ) = x0=σ P x0_=−σ y0=σ P y0_=−σis (x + x0_{, y + y}0_{) − i} s(x, y) (2σ + 1)2_{− 1} , (2.7) center(x, y, s) = is(x, y). (2.8)

Then, every pixel of each intensity sub map is calculated as

IntOn,s,σ(x, y) = max{center(x, y, s) − surround(x, y, s, σ), 0}, (2.9)

IntOf f,s,σ(x, y) = max{surround(x, y, s, σ) − center(x, y, s), 0}, (2.10)

where s ∈ {2, 3, 4} represents the image scale, σ ∈ {3, 7} the surround value,

On and Of f the on-center and off-center differences respectively. In the last

step, on-center and off-center intensity maps are obtained by scaling back the six on-center and the six off-center intensity submaps into the largest scale (i2), and then summing pixel-wise the respective images

IntOn= ⊕s,σIntOn,s,σ, (2.11)

(14)

Figure 2.5: The VOCUS[13] visual attention system.

The intensity conspicuity map of the simulated in Figure 2.1 is shown in Figure 2.6. Objects without distinctive intensities with respect to the background do not appear clearly and threshold techniques could not segment objects properly. In fact the mimetic tank has a very low saliency value, similar to the local background, while other textured objects may contain high saliency variability. The method applied to Figure 2.3 determines the conspicuity map in Figure 2.7. The car is not as salient as the street line and some background near it, and other processing techniques have to be applied to correctly extract the object.

(15)

Figure 2.6: The intensity con-spicuity map of Figure 2.1.

Figure 2.7: The intensity con-spicuity map of Figure 2.3.

2.1.3 Graph-based image segmentation

In this work, image segmentation is fundamental for the higher-level prob-lem automatic object recognition, and it also may address probprob-lems such as figure-ground separation and recognition by parts. In fact, other low-level techniques such as edge detection are combined with this technique in a wide range of Computer Vision tasks. As image segmentation and grouping re-main great challenges for Computer Vision, Felzenszwalb [10] proposed an effective method to segment regions within colour images. Felzenszwalb’s method seeks for the following properties:

1. perceptually important groupings or regions capturing: those areas are important because often they reflect global aspects of the image. In general, two central issues are to provide precise characterizations of what is perceptually important, and to be able to specify what a given segmentation technique does actually;

2. efficiency: the methods should run in time nearly linear in the number of image pixels. In order to be of practical use, any segmentation method should run at speeds similar to edge detection or other low-level visual processing techniques.

Obviously, segmentation quality depends on the image: smooth areas with clear steps between different regions are ideal for segmentation algorithms. While current methods work well with simple examples, they all break down given cases with enough clutter and camouflage. For more complex images, automatic segmentation provide less reliable indicators of region boundaries, and their utility for object recognition becomes questionable. Humans prob-ably use a combination of Object Recognition methods in conjunction with Image Segmentation methods, both for segmentation and recognition tasks.

(16)

adjusting the segmentation criteria.

The predicate D adopted in the method for evaluating whether or not there is evidence for a boundary between two components (i.e. regions) in a segmen-tation. This predicate measures the dissimilarity between elements along the boundaries of both components, weighted with a measure of the dissimilarity among neighbouring elements within each of the two regions. The resulting predicate is thereby adaptive with respect to the local characteristics of the data, as it compares the differences between adjacent regions and between el-ements within both the components. The internal difference of a component

C ⊆ V is defined as the largest weight in the minimum spanning tree (MST) M ST (C, E) of the component, where the MST is one of the subgraphs, also

a tree, that connects all the vertices. That is

Int(C) = max

e∈M ST (C,E)w(e). (2.13)

The external difference between two components C1, C2 ⊆ V is defined as

the minimum weighted edge connecting the two components. That is,

Dif (C1, C2) = min

vi∈C1,vj∈C2,(vi,vj)∈E

w((vi, vj)). (2.14)

The region comparison predicate evaluates if there is enough evidence for a boundary between a pair or components by checking if the external differ-ence Dif (C1, C2) is relatively large when compared to the internal difference

within at least one of the components Int(C1) or Int(C2). Then, a threshold

function is used to control the degree to which the difference between compo-nents must be larger than minimum internal difference. Hence, the predicate selected by Felzenszwalb in his method is

D(C1, C2) =

  

true if Dif (C1, C2) > M Int(C1, C2)

(17)

where M Int(C1, C2) is the Minimal Internal Difference introduced above

and defined as

M Int(C1, C2) = min(Int(C1) + τ (C1), Int(C2) + τ (C2)), (2.16)

where τ is the threshold function introduced to control the degree to which the difference between two components must be greater than their internal differences, in order to detect a boundary between them (predicate D to be true). For small components, Int(C) is not a good estimate of the local characteristics of the data. In the extreme case, when |C| = 1, Int(C) = 0. Therefore, we use a threshold function based on the size of the component,

τ (C) = k/|C|, (2.17)

where |C| denotes the size of C and k is a design parameter deriving from the consideration that for small regions it is required stronger evidence for a boundary.

The method is applied three times for colour images, once for each of the colour channel and then intersected. A possible way to handle the problem is to put two neighbouring pixels in the same component when they appear in the same component in all three of the colour plane segmentations. Hence, it is possible to consider a grayscale image without lack of generality. The method implementation adopted (cf. Section 3.7.1) constructs the edge set

E by connecting pairs of pixels that are neighbours in an 8-connected sense.

This yields a graph where m = O(n), so the running time of the segmentation algorithm is O(n log n) for n image pixels. The edge weight function has been based on the absolute intensity difference between the pixels connected by an edge

w((vi, vj)) = |I(pi) − I(pj)|, (2.18)

where I(pi) is the intensity of the pixel pi. Also, a Gaussian filter can be

added in the process to smooth the image before the computation of edge weights, in order to filter out noise and other errors affecting the production of the digital image. Let’s consider a graph G = (V, E), with n vertices and

m edges in input to the method, then it operatively consists on the following

steps:

1. Sort E into π = (o1, . . . , om), by non-decreasing edge weight;

2. Start with a segmentation S0_{, where each vertex v}

i is in its own

com-ponent;

3. For every q = 1, . . . , m construct Sq given Sq−1 as follows. Let viand vj

(18)

(vi, vj). If viand vj are in disjoint components of Sq−1and w(oq) is small

compared to the internal difference of both those components, then merge the two components otherwise do nothing. More formally, let

C_iq−1 be the component of Sq−1 _{containing v}

i and Cjq−1 the component

containing vj. If Ciq−1 6= C q−1

j and w(oq) ≤ M Int(Ciq−1, C q−1

j ) then Sq

is obtained from Sq−1_{by merging C}q−1 i and C

q−1

j . Otherwise Sq = Sq−1;

4. Return S = Sm_.

The output of the method is a segmentation of V into components S = (C1, . . . , Cr). An example of the method applied to a simulated environment

is shown in Figure 2.8, to a real environment in Figure 2.9. In both cases the method results are very effective, and it allows to correctly segment even mimetic objects while maintaining the object shape almost unvaried. The only drawback of the method is the computational time required, which has been measured in more than a second and an half per frame, too slow for real-time applications.

2.1.4 Hough lines

Let’s suppose to have detected contours in an intensity image, the result has been contained within a contours binary image, and that it is necessary to filter out straight lines that may be associated to street lines or changes in the environment, for the purpose of concentrating in the object recognition problem. Due to imperfections in either the image data or the edge detector, however, there may be missing pixels on the desired curves as well as spatial deviations between the ideal line and the noisy edge pixels obtained from the edge detector. For these reasons, it is often not trivial to group pixel

(19)

into appropriate sets of lines. There are available methods to address this problem by finding some basic two dimensional shapes within noisy contour images. The method is called Hough Transform and was first developed for the purpose of line detection and later extended to circles and ellipses; it has been then generalised for every two dimensional shape [2]. The search for lines is performed in the parameter space, instead than in the image space, from which object candidates are obtained as local maxima in a accumulator

space that determines the evidence of a line from the results of a voting procedure. Since lines are described by two parameters, usually the slope m and the intersection to one axis, the number of evidences of lines can be

described in a intensity grayscale image, where the different rows correspond to quantized values of the first parameter, different columns correspond to quantized values of the second parameter, and the intensity represents the number of pixels for which it and its two or more neighbours are aligned with that couple of parameters. Hough proposed to use, instead of the usually adopted parameters, the set of polar coordinates r (the distance of the line from the origin) and θ (the angle of the vector orthogonal to the line and pointing toward the half upper plane). If the nearest point is located above the origin, θ is simply the angle of the vector from the origin to its line closest point. The definition of polar coordinates is represented in Figure 2.10. A line can be described in terms of the polar parameters as

y = −cos θ sin θ ! x + _r sin θ . (2.19)

If r is required to be positive, then θ must vary in [0, 2π). All the lines that go through the pixel (x0, y0) are determined by the following equation

r(θ) = |x0cos θ + y0sin θ|. (2.20)

This equation corresponds to a sinusoidal curve in the (r, θ) plane, and it is unique for a given point. An important property as been identified in [2]: a set of pixels that form a straight line in the image produces sinusoids in the accumulator which cross at the line parameters. Hence, the problem of detecting collinear points has been converted to the problem of finding con-current curves.

Operatively, the linear Hough transform algorithm uses a two-dimensional array, or accumulator. The dimension of the accumulator equals the number of the selected and quantised parameters. For each pixel at (x, y) and its neighbourhood, the Hough transform algorithm determines if there is evi-dence of a straight line and eventually it will calculate the corresponding (r, θ) and increments the value in the accumulator’s bin assigned to the cou-ple. By finding the bins with the local maxima in the accumulator space,

(20)

Figure 2.10: The polar coordinates used to describe lines in a param-eter space, for the purpose of line detection.

Figure 2.11: The Hough trans-form, a grayscale image where rows correspond to quantized r values and columns to quantized θ values. Local maxima parameter couples determine the estimated line parameters.

the most likely lines can be extracted, and approximately determined by the parameters. The simplest way of finding these peaks is then by applying a threshold to the Hough transform, an example of which is shown in Figure 2.11. The lines returned do not contain any length information, then it is necessary in the next step to find which parts of the image match up to those lines. The final result of the linear Hough transform is the accumulator ma-trix: one dimension of this matrix is the quantised angle θ and the other dimension is the quantised distance r.

In object recognition this method may be used to eliminate line blobs ob-tained from image segmentation techniques, as the Graph-based Image

Seg-mentation discussed in Section 2.1.3.

2.1.5 Morphological operators

Object detection techniques usually need further post-processing methods to correctly extract a candidate object for further analysis. In these terms, mathematical morphology provides a wide range of operators for image pro-cessing, which are particularly useful for the analysis of binary images. There exist also more complex structuring elements which are not binary that can be used for advanced post-processing purposes, such as thinning or grayscale

(21)

morphological operators. Common applications include edge detection, noise removal, image segmentation and enhancement.

Binary morphological operators take two inputs, a binary image or mask A and a structuring element or kernel B, and give an output binary image. Bi-nary images are useful to separate foreground (the candidate object) to the background (unimportant information). The structuring element defines the operator and it is designed for the application purposes. It consists usually of a square binary image containing a pattern with origin placed in the central element. When a morphological operation is carried out, the origin of the structuring element is typically translated to each pixel position in the im-age. For each pixel in the input image, the structuring element defines which adjacent image pixel values are to be considered and compared to obtain the result. There are two basic morphological operators, dilate and erode. In this context, they are both defined by a square structuring element containing all ones (i.e. B = Ik,k where k is the kernel size), but they are complementary

as: dilate operator sets 1 to the output pixel if and only if there is at least one foreground pixel in the input image selected by the structuring element; erode operator sets 0 to the output pixel if and only if there is at least one background pixel in the input image selected by the structuring element. Formally, the dilatation of a binary image is defined as

A ⊕ B = {p |Bp ∩ A 6= ∅}, (2.21)

where p represents a pixel of the output binary image. Similarly the erosion of a binary image is defined as

A B = {p |Bp∩ A ≡ Bp} (2.22)

with the same meaning of p. These two basic operators are important because when combined they result in two more useful and practical operators, called

opening and closing operators. These two operators can be considered less

disruptive than dilation and erosion. The effect of opening is to maintain as foreground the pixels of the structuring elements completely contained in foreground regions of the input image, while labelling all other regions as background pixels, it is derived formally as

A ◦ B = (A B) ⊕ B. (2.23)

It can be seen that to open an image corresponds to dilate the correspond-ing eroded image. While erosion can be used to eliminate small clusters of noise quite effectively, it has the disadvantage of affecting all foreground blobs indiscriminately. Opening gets around this problem by performing

(22)

Figure 2.12: The effects on the Canny-edge detector output in Figure 2.2 of the morphological close operator.

both an erosion and a dilation on the image. The dual operator of opening is the closing operators, as it only maintains background pixels belonging to structuring elements completely containing background pixels in the original image. Closing is defined as

A ◦ B = (A ⊕ B) B. (2.24)

Closing is useful fill in particular background regions of an image. In order to achieve this goal, a suitable structuring element has to be designed ac-cordingly to the regions to be preserved, while at the same time it does not fit inside regions that are to be removed.

In Figure 2.12 it is shown the benefit that may be obtained on applying mor-phological operators to enhance certain processed images. In this case it has been used a Kernel of size 6, but higher values may connect different objects together or objects may become confused to the background. There is not a general rule for which these techniques may obtain enhancing results.

2.1.6 Contours and bounding rectangles

The problem of determining the number of objects present within a binary image can be solved by counting the outermost borders of the blobs present. An affective algorithm to determine blob contours has been proposed by Suzuki [25], as it can be effectively used in component counting, shrinking, and topological structural analysis of binary images. Despite several works

(23)

have been reported on the topological structural analysis of binary pictures using raster scan and labelling, the Suzuki’s border following algorithm for such analysis remains attractive when it is also necessary to analyse the topological structure of an input image, as this is the case for the object shape matching problem.

The bounding rectangle of a contour is defined as the minimum area rectangle with borders parallel to the image borders. It can be easily determined from a contour using the following algorithm:

1. take a random point belonging to the contour and save its coordinates appropriately as starting pixel and as minrow, mincol, maxrow and

maxcol;

2. follow the contour taking a random direction, and if

• minrow is smaller than the current point row, refresh its value; • mincol is smaller than the current point col, refresh its value; • maxrow is greater than the current point row, refresh its value; • maxcol is greater than the current point col, refresh its value; 3. if the contour pixel is different from the starting pixel reiterate from

the beginning, else stop.

The algorithm proposed results in the estimation of the candidate object location within the input image, and it can be stored efficiently by keeping only the values of minrow, mincol, maxrow and maxcol.

(24)

Figure 2.13: The bounding rectangle method applied to extrapolate objects from the image in Figure 2.12.

The method of combining borders with bounding rectangles is efficient for the purposes of object recognition as it has the following advantages:

• effective reduction of the amount of memory necessary to store detected object location and silhouette;

• efficient object extraction by easily evaluating the object contour bound-ing rectangle;

• conversion of the problem of iterating the object recognition process for every detected object, to the problem of iterating for the contour bounding rectangle.

As it can be seen in Figure 2.13 which applies the bounding rectangle method on Figure 2.12, the method is effective on extrapolating portions of the image containing candidate objects, in order to apply in following steps other object recognition methods.

(25)

2.2 Object classification

Object classification is an important task in Computer Vision concerning the problem of identifying objects in an image or video sequence. Humans are able to recognize a huge multitude of objects in images with little effort, despite the fact that objects may be seen from different view-points, in many different sizes and scales or after translations or rotations, in many different light conditions and when objects are partially occluded or obstructed from view. Hence, the main objective is to design methods with the following properties:

• photometric invariance: objects may look differently in different light conditions, or be not clearly visible. An object recognition method should be not affected from these changes, such as in the luminance value;

• scale invariance: an object should be recognised when it is larger or smaller with any factor;

• translation invariance: this is the most simple but the mainly required aspect, as an object should look exactly the same after a simple trans-lation the recognition system should identify it when it moves rigidly; • rotation invariance: a rotated object may be challenging to be

recog-nised when it is rotated. If it is rotated with respect to an axis parallel to the line of sight it may still be simple to recognise it. When the problem is generalised to every axis of rotation the problem becomes very complex, and it is still unsolved;

Scale, translation and rotation invariance may also be resumed into the term

affine invariance, which is obtained when the object is recognised when it

is rotated, scaled, or translated. Many approaches have been implemented over multiple decades [7][19][15], but this task remains still a challenge for Computer Vision systems [14]. This chapter analyses the main methodologies for object recognition developed, with a focus on non-learning (e.g. not requiring Machine Learning) methods.

2.2.1 Invariant moments

An image moment is a weighted average of the intensity image, or a function of such moments. In general, the two-dimensional (i + j)th order moments

(26)

recognition because they can be associated uniquely to a single image and are useful to describe objects after the segmentation process. Furthermore, simple properties of the image can be derived easily from image moments, such the image area (or total intensity), the centroid and the image orien-tation. For the purposes of object recognition, moments defined in equation (2.25) are still not useful, being variant with translation, scale and rotation. In 1962 Hu [15] presented a theory of two-dimensional moment invariants for planar geometric figures, which can be demonstrated to be a complete system of moment invariants under translation, similitude and orthogonal transfor-mations. Hu achieved maximum utility and flexibility using a method in-sensitive to variations in shape and providing improving performances with repeated trials [15]. First, Hu defined central moments, demonstrating that are insensible to translations of coordinates [15], as

µij = X x X y (x − ¯x)i(y − ¯y)jf (x, y), (2.26) where ¯ x = M10 M00 , y =¯ M01 M00 . (2.27)

To obtain invariance to similitude transformations (i.e. image scaling) it is sufficient to divide every translation invariant moment by the properly scaled first translation invariant moment, accordingly to the following formula

ηij = µij µ(1+ i+j 2 ) 00 . (2.28)

Now it is possible to define Hu moments [15] which are invariant under trans-lation, changes in scale, and also rotation as

(27)

I2 = (η20− η02)2+ 4η112 I3 = (η30− 3η12)2+ (3η21− η03)2 I4 = (η30+ η12)2+ (η21+ η03)2 I5 = (η30− 3η12)(η30+ η12)[(η30+ η12)2− 3(η21+ η03)2] +(3η21− η03)(η21+ η03)[3(η30+ η12)2− (η21+ η03)2] I6 = (η20− η02)[(η30+ η12)2− (η21+ η03)2] + 4η11(η30+ η12)(η21+ η03) I7 = (3η21− η03)(η30+ η12)[(η30+ η12)2− 3(η21+ η03)2] −(η30− 3η12)(η21+ η03)[3(η30+ η12)2− (η21+ η03)2]

As it can be seen, only Hu moments up to the third order are evaluated. In fact, for the purposes of object recognition, they are proved to be enough descriptive. While Hu moments do not have any clear physical meaning, I1

is analogous to the moment of inertia around the image centroid, if pixels intensities are seen as physical densities. The last one, I7, is skew invariant,

which enables it, if considered in sign, to distinguish mirror images of other-wise identical images (i.e. under a reflection it changes its sign only, while its absolute magnitude remains reflection invariant).

2.2.2 Scale Invariant Feature Transform

Scale Invariant Feature Transform (SIFT) is an algorithm published by

David Lowe in 1999 [19] and patented in the USA. In Computer Vision it is widely used to detect and describe local features in images. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition and video tracking. The method tries to solve one of the more difficult problems in the design of a visual pattern recog-nition system: the selection of a set of appropriate numerical descriptors or

features to be extracted from the object of interest for classification purposes.

As the success of any practical system depends critically upon this decision [7], the method has received much attention by researchers. There is not a general theory to guide in the selection of features for an arbitrary problem, it is possible to state some desirable attributes of features for identification of objects. Among these are the following:

1. the features should be informative: the dimensionality of a vector of measurements (feature vector) should be as low as possible, consistent with acceptable recognition accuracy;

2. the features should be invariant with translation of the object normal to the camera optical axis and with rotation about this axis;

(28)

terns, but it is limited in its ability to recognise patterns independently of position, size and orientation.

Lowe’s SIFT usage determine a property-list approach where SIFT features are the properties. SIFT advantage is that it transforms an image into a large collection of local feature descriptors, each of which is invariant to im-age translation, scaling, and rotation, and partially invariant to illumination changes and affine or 3D projection.

SIFT are identified by using a staged filtering method. The first stage iden-tifies locations in scale-space which are maxima or minima in a difference-of-Gaussian function of input pixels. Each point is used to generate a feature vector, a descriptor, that describes the local image region relatively to its scale-space coordinate frame. The features achieve partial invariance to local variations by blurring image gradient locations. Each image may generate more than 1000 keypoints. The descriptors derived from an image are used in a Nearest-Neighbour (NN) approach to identify the best candidate descriptor in object models. When at least 3 keypoints agree on the model parameters with low residual, there is strong evidence for the presence of the object. Keypoints are computed efficiently by building an image pyramid with re-sampling between each level, they are located at regions and scales of high variation, hence are particularly stable for characterizing the image. The input image is first convolved in both the horizontal and vertical directions using the following 1D Gaussian function

g(x) = √1 2πσe

−x2_/2σ2

. (2.29)

In the Lowe method, the input image is first convolved with the Gaussian function using σ = √2 to give an image A. This is then repeated a second time with a further incremental smoothing of σ =√2 to give a new image, B,

(29)

which now has an effective smoothing of σ = 2. The difference of Gaussian function is obtained by subtracting image B from A, or in other words it is obtained as the difference of Gaussian blurring of an image with two different

σ. To generate the next pyramid level, Lowe resampled the smoothed image B using bilinear interpolation with a pixel spacing of 1.5 in each direction,

as it was frequent enough to detect peaks. Maxima and minima of this scale-space function are determined relatively to each pixel to its neighbours in the pyramid, which are the 8 neighbours in the same level and 9 neighbours for both the adjacent levels in the pyramid.

After the detection of keypoints, associated to a stable location, scale and orientation, Lowe’s method associates descriptors to each of them, in a way that is invariant to the same transformations and to small variations in the object which may arise in non-rigid objects or after some 3D transformations. They are computed first computing the smoothed image A at each level of the pyramid, which then is processed to extract image gradients and orientations. At each pixel, Aij, the image gradient magnitude, Mij, and orientation, Rij,

are computed using pixel differences

Mij =

q

(Aij− Ai+1,j)2+ (Aij − Ai,j+1)2 (2.30)

Rij = atan2(Aij − Ai+1,j, Ai,j+1− Aij) (2.31)

Robustness to illumination change is improved when gradient magnitudes are thresholded at a value of 0.1 times the maximum possible gradient value. Rotation invariant is achieved when to each keypoint is assigned a canoni-cal orientation, selected as the peak in a histogram of locanoni-cal image gradient orientations, created using a Gaussian-weighted window with σ of 3 times that of the current smoothing scale. These weights are multiplied by the thresholded gradient values and accumulated in the histogram at locations corresponding to the orientation, Rij. The histogram has 36 bins covering

the 360 degree range of rotations, and is smoothed prior to peak selection. In particular robustness to local geometric distortion is obtained by repre-senting the local image region with 8 orientation planes, which is an image for each quantised gradient direction, with linear interpolation for interme-diate orientations. Each orientation plane is blurred and resampled to allow for larger shifts in positions of the gradients. The pixels that fall in a square of 16 × 16 pixels around the key location are inserted into the orientation planes, approximating a circle of radius 8 pixels. The orientation is mea-sured relatively to the keypoint orientation. To avoid all boundary affects in which the descriptor abruptly changes at region limits, trilinear interpo-lation is used to reduce contributes of bins which are more distant. This is obtained by multiplying by a weight of 1 − d for each dimension, where d is

(30)

the normalised distance of the sample from the central value of the bin. The descriptor is formed with a 4 × 4, as this size shown the best results, array of histograms with 8 orientation bins in each. Therefore, SIFT implementations use a 4 × 4 × 8 = 128 sized descriptor for each keypoint.

Finally, the feature vector is modified to reduce the effects of illumination change. As a change in image contrast determines the value of the gradient at each pixel multiplied by a constant factor, the vector is normalized. A brightness change determines a constant value added to each image pixel, this will not affect the gradient values also, as they are computed from dif-ferences of Gaussians. Therefore, the descriptor is invariant to affine changes in illumination. However, non-linear illumination changes can also occur due to camera saturation or other effects. These effects generally affect only gra-dient magnitudes, without changing their orientations. Lowe reduced the influence of large gradient magnitudes by thresholding the values in the unit feature vector to each be no larger than 0.2, and then renormalizing. In other words the matching of large magnitudes is not much considered, and the distribution of orientations has greater emphasis. The value of 0.2 has been justified experimentally analysing the results on images containing dif-fering illuminations in the same scene. An example of the features found by the SIFT algorithm in two simulated images is shown in Figure 2.14. In the figure, features in the first image are matched with their nearest neighbours in the second image. As it can be seen it is necessary to apply some other criteria for the purposes of object detection. A possible method to assert if the image corresponds, even partially, to a second image is proposed in Chapter 3.

(31)

2.2.3 Viola-Jones cascade classifier

[26] The Viola-Jones cascade classifier [26] is a method for combining in-creasingly more complex classifiers in a cascade. This method has the main advantage of quickly discarding background regions of the image, keeping more computation on possible object regions. This cascade may be seen as a focus mechanism which provides statistical guarantees that discarded re-gions are unlikely to contain some objects. The method has been successfully applied in the domain of face[27], as it is the de facto standard, and pedes-trian[28] detection; the cascade yielded detection rates comparable to other best previously designed systems [26]. An example of the application of the method to face detection problems is shown in Figure 2.15.

Figure 2.15: An example of Viola-Jones[27] cascade-based face detection.

The power of the Viola’s classification is the use of a cascade of simple and highly specialised classifiers, each one able to reject a candidate window; it is also relevant because it takes advantage of two techniques: integral im-ages, which speed up the computational time and the AdaBoost[12], which is a general method for improving the accuracy of any given learning algorithm and that reduces the number of features that need to be tested, while usu-ally not being affected by overfitting problems. The main drawbacks of the method proposed by Viola resides in the amount of images to be used during the learning phase reducing its flexibility, as it is required to re-evaluate the entire classifier if new data has to be added within the dataset, and in the not clear meaning of the classifiers derived, as they are obtained from an automated Machine Learning method.

(32)

Shape-Details Cascade SDC, built selecting a cascade-like structure.

3.1 Dataset creation

In object recognition a database of objects to be identified has to be built. The system tries to match the information present within the dataset of images and, accordingly to some criteria, to identify known objects. There are two main approaches to manage such information:

• to label every object with a name. This method is more flexible than the following, as new objects can be added even in real time and it does not require negative images (e.g. images that do not contain the object instance), but requires a huge computational effort if the dataset becomes huge, as an input image has to be matched to every dataset image. This method is flexible enough to potentially identify objects only similar to the object present in the dataset;

• to let the system to automatically identify the patterns present within the images, and build an object or a dataset classifier able to recognise the same patterns (e.g. the Viola-Jones cascade classifier commonly adopted in face detection applications [26]). The main drawbacks of

(33)

this method is that the learning phase, which usually requires a long time, has to be performed for each dataset (it has a fixed structure), and that the dataset has to be designed appropriately containing non-object instances, in order to identify which patterns are discriminative of that object. It usually happens that selective patterns are present only for certain object poses (e.g. frontal faces), and hence the dataset and the classifiers will be different for different object poses. This method is very performant if the objects to be recognised are well defined and show small variability.

As the UAV visual system has to be designed the most flexible as possible, to recognise even objects not present within the dataset associating some simi-larity measure, the first method has been adopted. To maximise the a priori knowledge of a given object, different poses of the same object have been ordered in a circular pattern, sampling at some angular interval while rotat-ing around the object. As the UAV can see objects from high altitudes and the recognition methods can be selected to be rotation, scale and translation invariant, it is only required to sample the object in a portion of a semisphere to be defined accordingly to the UAV flight specifications. An example of a similar approach is given in Hao [14], in which objects have been sampled from viewpoints located in circles of different radius, altitude with respect to the object and centred in a point of the z-axis orthogonal to the ground, and then the decision process is performed efficiently by canonical feature match-ing accompanied by a feature 3D location consistency check. The hypothesis made for the methods that will be introduced in Sections 3.3 and 3.5 is that the database is built maintaining the adjacency information between poses. For simplicity poses will be considered as taken in a single circle around the object nevertheless this is not required. Two views of an object are called distinct if and only if they produce images which do not merely differ from each other by a rotation, reflection, or a combination of rotation and reflec-tion. The significant range of values of ρ, θ and φ (spherical coordinates defined in Figure 3.1) is the one which at least covers all distinct views of the object. Three dimensional objects viewed from at least zero altitude have a significant range for spherical coordinates as follows

∀ρ ∈ R, (3.1)

−π ≤ θ ≤ π, (3.2) 0 ≤ φ ≤ π

2. (3.3)

In practice, the significant range of values of radius and angles is discretised; then, for all the possible combinations of these discretised values, the images

(34)

Figure 3.1: The spherical reference system defined to obtain the pose acqui-sitions for a given object centred in the origin O. The z axis is assumed to be orthogonal to the ground.

Figure 3.2: An object sampled in a circular manner from an higher position. Such a set of poses constitutes the object model.

3.2 Canonical shape recognition methods

Canonical shape recognition methods consist on measuring the distance be-tween input and model shape descriptors and appealing non-probabilistic

(35)

classification procedures or defining empirically a fixed value for which the detection has been successful. An example of implementation of this method is described in Dudani [7] for aircraft recognition. Dudani assumed that an object to be recognized belongs to one of a set of known classes, and that all elements of a given class are essentially "cast from the same mold". A training sample set of the given object is obtained sampling for a fixed ρ (as selected descriptors are scale invariant) and various values of spherical angles selected from the significant range defined in (3.2) and (3.3).

Dudani selected as informative descriptors for the shape of the aircraft to be recognised the Hu invariant moments described in Section 2.2.1. However, he doubled the number of the descriptor components by evaluating the Hu moments both for the object contour and silhouette. There may be some crit-icism for this choice, as for complex scenes and mimetic or occluded objects it is difficult to extract reliable object contours, and Hu moments relative to contours are noisy descriptors.

After selecting informative descriptors, a classification procedure has to be performed. An example is the Distance-Weighted k-Nearest Neighbour Rule [7] that assumes that similar observation have the same classification and that it is reasonable to weight more heavily the contribute of a neighbour in the evidence of other neighbours close to them, than the contribute of an-other neighbour at greater distances. In an-other words, the method results in a procedure that assigns a weighted distance. Let’s suppose that an observa-tion is described by some real vector ρ and that its k-nearest neighbours in a dataset have been identified and ordered by ascending euclidean distance di,

i = 0, 1, . . . , k. The distance is then evaluated again adopting the following

weights wj =    dk−dj dk−d1, dk 6= d1 1, dk = d1 . (3.4)

The input image is assigned to the class for which it has the minimum weighted distance. It should be noted that this approach would assign any object to an aircraft if a dataset of aircrafts only has been considered, as it does not assign a rule to exclude matchings. The main disadvantage of the method is that any difficult result regarding the relationship between recognition accuracy and the number of objects to be recognised can be de-termined. As performances for a same object will be different for differently built datasets, the method output are uncertain in real applications as the classification accuracy is heavily effected by object shape similarities or dis-similarities within the dataset.

(36)

where W is the window size that defines the interval, in object poses, for which the covariance matrices of local shape descriptors are evaluated. The score defined measures if an object is likely to fit in a certain category having a certain shape. In other words, the exponent of the measure contains the

Mahalanobis distance [20] between two shapes, weighted by the local shape

variation pattern contained within C. The Mahalanobis distance may be interpreted as a dissimilarity measure between two random vectors m and hi

of the same distribution with the covariance matrix CW(i). It differs from

Euclidean distance because it takes into account the correlations of the data set, and it is scale-invariant. The last property allows to take into account with more weight shape moments which are more discriminant in the de-scription of a certain class of objects and vice versa, giving an appropriate rescale factor for each moment in order to be compared in a single distance measure.

As different shapes may obtain a different minimum value for which the asso-ciation to a certain object category may be considered successful, a minimum required score is defined to accept the association with a certain category. Nevertheless, it is not possible to determine a global minimum score common to every object to state if the matching has been successful or not. In order to determine the minimum score, a dataset validation process has to be pre-viously performed, an example application of this approach is explained in Section 3.7.3. The category minimum required score is defined automatically from the dataset as

M inP assScoreLSP(W, O) = min i∈[1,N ] j∈PW(i)

{e−√(hj−hi)TCW(i)−1(hj−hi)_{} ∈ [0, 1],}

(3.6) where PW(i) is the set of W adjacent poses to the pose i also containing the

(37)

in a similar manner

M inP assScoreLSP(O) = min i,j∈[1,N ]{e

−√(hj−hi)TC−1(hj−hi)_{} ∈ [0, 1],} _(3.7)

and CW(i) in equation (3.5) is substituted by C. It is important to note that

using the proposed definition for the minimum passing score, the LSP method is able to recognise the database itself without false negative outcomes. W may be considered as the method parameter, and its effects will be discussed in Section 4.2.1. The shape matching will be considered positive if

ScoreLSP(m, W, O) ≥ M inP assScoreLSP(W, O). (3.8)

The minimum score is evaluated once during the validation phase of the memory system initialization, two possible different criteria have been iden-tified:

• dataset minimum dataset local score: the dataset is considered locally at each pose, with a varying window width; also the covariance matrix is evaluated at each pose for that window size, and it is different for every pose. The purpose of this method is to consider the local shape variations with 3D rotations, capturing the local law of the change in aspect. In general this allows the dataset to recognise itself without false negative outcomes;

• dataset maximum non-object score: this method is meant to reduce the a priori false positive rate of the method, but in principle it may determine that the dataset would not be able to recognise itself entirely. As it appears to be more selective and less accurate than the first method, it has not implemented in the SDC system described in Section 3.6.

3.4 Canonical feature matching method

A simple but commonly adopted method considers the relative number of

reliable matched features to determine if the decision process had been

suc-cessful[18]. The relative number of matching features from one database image containing fi features of which fiM have been matched to a input

image is defined as

ScoreRM =

fiM

fi

(3.9) This method is not useful if a definition of reliable match is not provided, as all the matches may be incorrect in principle. To filter considerably the

(38)

Figure 3.3: The Lowe [18] PDF of correct and not correct matches obtained using a database of 40,000 keypoints. The solid line shows the PDF of the ratio in equation (3.10) for correct matches, while the dotted line is for matches that were incorrect.

As it can be seen, by selecting a ratio of distances of about 0.8 would in principle maintain the 95% of correct matches while discarding about the 90% of incorrect matches. This measure of reliability allows the score defined in equation (3.9) to be applicable in object recognition problems, its performances are compared with HMC in Chapter 4.

(39)

3.5 Homography-based Matching Consistency

method

Homography-based Matching Consistency(HMC) is an improvement of

canon-ical feature matching methods. If a number of features fi is detected within

a database image and the corresponding descriptors matrix D is evaluated, then the consistency of the transformation matching fiM keypoints of the

database image to keypoints of the input image may be analysed. When considering similar poses, the transformation matrix T describing an homog-raphy transformation, as for a detailed dataset the exact transformation may be considered planar locally, is

T =    t11 t12 t13 t21 t22 t23 t31 t32 1   ∈ R 3x3 , (3.11)

where the elements t13 and t23 define a planar translation, the elements t31

and t32define a perspective transformation, and the square block of order two

containing t11, t12, t21 and t22 define a combination of planar scale, rotation

and shear transformations. As the system is designed to be able to recognise similar objects, transformation as rotations, proportional scales, changes in perspective and translations should not affect the recognition process. As the remaining operations affect only the square block A, extracted from T , it is analysed in terms of its QR decomposition [11]

A = QR = t11 t12 t21 t22

!

∈ R2x2_, _(3.12)

where Q ∈ R2x2 _{is an orthogonal matrix describing a rotation, and R ∈ R}2x2

is an upper triangular matrix. Q may be seen as the rotation contribute and

R as the scale contributes (its diagonal elements) and shear contribute (upper

diagonal element). For second order matrices, defining uT

1(A) = [a11 a21]T

and pr(A) = a11a12+ a21a22, the solution of the QR decomposition is

Q = q 1 t2 11+ t221 t11 −t21 t21 t11 ! , R =   ||u1(A)||2 pr(A) ||u1(A)||2 0 _||u1(A)||2det(A)  , (3.13)

A new criteria for object recognition is based on the elements rij of R. If

a database object matches to an input image involving shear and different scale factors for the x-axis and y-axis, or if it matches to the input image after distorting the object, then the match is probably wrong. A measure of the global consistency of the matching process is defined as

(40)

M inP assScoreHM C(O, i) = min

j=adj(i){Score i→j

HM C} ∈ [0, 1]. (3.16)

3.6 Shape-Details Cascade recognition system

The Shape-Details Cascade (SDC) recognition system is a new method for ob-ject recognition that involves both the LSP and HMC methods in a cascade-like structure, maintaining the benefits of both methods. The method con-sists on determining first the category to which an input object belongs to, and later its most similar object in that category (i.e. discrimination). In par-ticular, categories are differentiated from their shape using the LSP method, and then further processed in terms of their details. it determines if the window is worth to be further analysed or to be rejected. It is possible to design the system in two ways:

1. only the best category is taken, and further analysis will be performed on objects belonging to the most likely category;

2. further analysis is performed on every object in the union of the cate-gories for which the system could not reject the window.

It is important to note that, as it will be showed in section 4.2.1, the shape is not a discriminant factor in object recognition, but the advantage is that categorisation can be performed quickly than discrimination between objects. Discrimination is performed by details matching consistency with the HMC method.

The cascade (cf. Section 2.2.3) is computationally efficient, as the classifica-tion process discards many unlikely objects, and later it performs accurate recognition applying a more selective criterion. Furthermore, the process may be generalised inserting further blocks in the cascade. For example,

(41)

objects may be guessed if dimensions can be estimated during image pro-cessing and if area upper and lower bounds for an object are provided during the dataset construction. A method to estimate the area of an object is discussed by Dudani [7] for aircraft recognition purposes. Also, a block to discard lines can be inserted to efficiently reject background windows and to improve the performances of the overall recognition system. For example lines may arise within street detections or in presence of morphological vari-ations. An implementation of this process may be obtained detecting lines using the Hough Transform described in Section 2.1.4. The basic system, consists in classification followed by discrimination only, as it is shown in Figure 3.4.

Figure 3.4: The new recognition system that merges shape-based (LSP) and details-based (HMC) recognition. It is interesting to confront the similarity of these decision steps with the human decision steps of categorisation and discrimination respectively.

From a practical point of view, the SDC system is not a complete object recognition system, as it needs candidate objects in input. For this reason, other complementary block have been designed to assist the decision process, similarly to the human visual system.