Building vision applications through deep neural networks using data acquired by a robot platform

(1)

SAPIENZA Universit`

a di Roma

Dottorato di Ricerca in Ingegneria Informatica

XXXI Ciclo

Building Vision Applications through Deep

Neural Networks Using Data Acquired by a

Robot Platform

Ali Youssef

Dipartimento di Informatica, Automatica e Gestionale Sapienza Universit`a di Roma

(2)

(3)

SAPIENZA Universit`

a di Roma

Dottorato di Ricerca in Ingegneria Informatica

XXXI Ciclo

Building Vision Applications through Deep

Neural Networks Using Data Set Acquired by a

Robot Platform

Ali Youssef

Thesis Advisor Reviewers

Prof. Daniele Nardi Prof. Primo Zingaretti

(4)

AUTHOR’S ADDRESS: Ali Youssef

Dipartimento di Informatica, Automatica e Gestionale “Antonio Ruberti” Sapienza Universit`a di Roma

Via Ariosto 25, 00185 Roma, Italy. Ro.Co.Co. Laboratory.

E-MAIL: youssef@diag.uniroma1.it

(5)

Acknowledgements

Firstly, I would like to express my sincere gratitude to my advisor Prof. Daniele Nardi for the continuous support of my Ph.D study and related research, for his patience, motivation, and immense knowledge. He inspired me by his hard working and passionate attitude. His guidance helped me in all the time of research and writing of this thesis.

My sincere thanks also goes to Dr. Domenico Daniele Bloisi, who gave me the chance to face a challenging research topic and to apply my research to real-life scenarios. I appreciate his vast knowledge and skill in many areas. I am grateful for his support and patience in overcoming numerous obstacles I have been facing through my research.

My sincere thanks also goes to Dr. Mar´ıa T. L´azaro for sharing her knowl-edge and supporting. Another special thank goes to all the members of the Ro.Co.Co.Lab, SPQR Team, and SPQReL, in particular Dario Albani, Vin-cenzo Suriani, and Ester Latini.

I would like to express my gratitude to my family for their supporting and encouraging. Then, I would to thank a special person: my girlfriend Simona for her love.

Moreover, I wish to thank the Ministry of Foreign Affairs and International Cooperation (MAECI) for the offered grants that support my research.

(8)

Abstract

Deep learning has demonstrated to be a successful approach in multiple com-puter vision applications. The significant achievements of deep learning systems has always been coupled with the presence of powerful computation and large quantity of data. Moreover, importing deep learning to robotics application raises additional challenges that have not been widely addressed by computer vision and machine learning community.

On the other hand, rule-based representation approaches have provided ro-bust and effective solutions for different computer vision tasks in the past years. Developing and designing solutions based on predefined models require a strong understanding and knowledge of the problem structure. However, generalizing those solutions to handle the complex real-world scenarios becomes one of the main drawbacks.

This thesis tries to combine the knowledge obtained by the rule-based rep-resentation approaches with the generic solution provided by the deep learning based end-to-end techniques. This work aims at giving a general review on both techniques, and proposes a pipeline, modular in its design, for image classifica-tion, object detection and recognition and semantic image segmentation tasks to be used on data coming from mobile robotic platforms. The main contributions of the thesis are:

• A combined approaches based on rule-based representation and end-to-end;

• Pipeline for object detection and recognition with mobile platforms; • A multi-sensor approach for people detection with social robot;

• Addressing the pixel-wise semantic image segmentation in challenging ap-plications;

• An experimental evaluation for the proposed solutions.

In this thesis, the end-to-end architecture for computer vision is described and different solutions for improving its performance and addressing its limita-tion in image classificalimita-tion, object deteclimita-tion, and semantic image segmentalimita-tion are proposed. Quantitative and quantitative evaluation metrics, through exper-imental results in real-world applications, for the proposed solutions are shown and future directions are discussed.

(9)

(10)

Chapter 1

Introduction

Nowadays we are witnessing a revolution of artificial intelligence (AI). Many optimistic articles have been published recently towards fully autonomous cars and robots that can share and help in everyday human activities. Many startups have been launched, and research and development teams have been expanded in companies in order to keep up to date with the success of AI1. The objective of Horizon Work Programme 20202_{is to automate the process and improve the}

productivity through using intelligent robots, which take into consideration the decreasing of worker accident rate, and utilizing robot for complex tasks where human presence is impossible and replicating job with accurate performance becomes tedious.

The initial thought by AI communities was that perceptual capabilities of robot requires less effort and easier to be executed than other intelligence cogni-tive components (e.g., learning, reasoning, and planning) [1]. The facts demon-strate enabling robot to see in human-like manner is still an open research challenge. Obtaining an accurate and generalized approach to handle various vision problems, where many challenges are presented in unstructured and un-constrained environment, is extremely hard. Even if the renaissance of computer vision approaches, machine learning techniques and the availability of data sets and processing power, such as both data-driven [2] and model-driven [3] vision approaches are utilized, relaying on one of them only is insufficient to address robot vision applications and to handle limited resource platforms.

Computer vision as one of the core components of machine understanding, and strongly connected to AI, is a wide and complex field of research that over-laps and interchanges with many classical and modern fields (Fig 1.1) such as mathematics and statistics, machine learning, signal processing, neuroscience, and physics. Moreover, the development of theoretical and algorithmic ap-proaches of computer vision comes out with great benefits for other fields that inherits from it (i.e., robot vision and machine vision).

Vision system is one of the key ingredients for robot perception due to the significant information provided by it, and its complementary role for other sensor measurements and robot modules. Robot vision is slightly different from computer vision. In Computer vision the high-level understanding of the transformed information from image space to computer space is used to elicit

1_{https://www.cbinsights.com/research/artificial-intelligence-top-startups/} 2_{https://ec.europa.eu/programmes/horizon2020/en/h2020-section/robotics}

(11)

Figure 1.1: Computer vision relations with other field of science.

actions that are focused more on specific tasks like tracking object in a sequence of images, scene reconstruction, and image restoration.

The robot is represented as an active agent, that has to interact with its surrounding, decide, and plan in order to achieve a target goal. Thus, robot vision system can be seen more in relationship-oriented tasks, in which it has to cooperate with other systems modules and components provided on robot platform. Processing data (i.e., image data) acquired by moving robot platform adds more challenges to be addressed by computer vision approaches. The temporal challenges present regarding the approaches used in image acquisition (e.g., synchronized data and temporally correlated images), and the temporal changes of the scene over time (e.g. lighting in indoor and outdoor environments, and the change of object appearance and pose).

On the other hand, moving in the environment can be seen as an opportunity for robot vision and data acquisition. Where, more viewpoints obtained for the object help in better distinguishing its semantic properties. Moreover, it enables the decomposition of clutter environment, the separation of the object from its background, and overcoming occlusion problem. Camera sensor in robot vision system can be controlled, this enriches the captured information, provides multiple viewpoints of the scene, and overcome the ambiguities exist in complex real world (e.g., occlusion, reflection, and sudden changes in the scene). In addition, image data can be fused with other information measurements provided on mobile platform for better representation of the scene geometry, and improving the capability of the robot interaction with its environment (e.g., object manipulation).

Image classification, object detection, and semantic image segmentation have been addressed through two types of computer vision approaches. The rule-based representation approaches [4,5], which involves two main steps of process-ing: feature extraction, and optimization. It has widely been used in machine

(12)

vision and robot vision applications. The second is the end-to-end pipeline [5]. By which, computer vision field has noticed a great paradigm shift after its successes across visual object recognition tasks (e.g., deep Convolution Neu-ral Network (CNN)) [2]. The successful has been achieved by leveraging large amount of data set and computation (i.e., Graphics Processing Unit (GPU)). Recently, the robot vision community has sought to import that advantage to improve robot understanding of the environment where it is deployed, and to obtain robust vision systems that handle complex real world challenges.

The trade-off between accurate performance and computational load is one of the challenges produced by applying the deep learning in robot vision [6]. Moreover, the development of an automatic vision system based CNNs on lim-ited resources platforms (e.g., robot) is a complex task, due to a series of in-herent challenges. Obtaining huge amount of annotated data set, required to train deep model, by a robot is costly. Specially, when the data set has to contain the interaction with environment, and to cope all possible scenarios of complex real-world. The other challenge is the computational load of CNNs, and the requirement of real-time processing by the robot (e.g., no presence of GPU processing).

The computational solution of computer vision methods can lie at differ-ent point of a programming-data spectrum [7]. On one side specific hard-coded solutions presence. These solutions are developed with strong knowledge and understanding of the problem, where no data are required (e.g., mean com-putation and normalize operation) [7, 8]. On the other side of the spectrum presence the data-driven approaches (i.e., deep neural network DNN and deci-sion trees), which depends on the size of training data set and has the flexibility to fit complex functions. Model-driven are in-between approaches, where better understanding of the problem structure are required and explicit representation and rule are followed (e.g., Histogram of Oriented Gradient (HOG) [9]). The programming-data spectrum is illustrated in Fig.1.2. The figure shows an ex-ample of people detection [9] that has been addressed using sub-solutions from different point of the spectrum. On the other hand, the solution for object detection [10] using CNN lies on extreme side of the spectrum.

Vision approaches based on deep CNNs can be considered as generic learn-ing approaches [7]. Such generic solutions are sufficiently able to solve many visual recognition tasks, but the knowledge of the problem based on deep neural network is still difficult to be extracted and far from being algorithmic [11]. How-ever, those approaches are able to provide robust solutions for computer vision problems. Thus, combined solutions from different points of the program-data structure can satisfy the requirements of strong knowledge and understanding of the problem, and the capability of solving complex problem. Additionally, the solutions provided by generic approaches can be an opportunity for better understanding of the problem structure and its weak points [7].

Moreover, the methodology of combining deep neural network with prior knowledge obtained by other approaches related to the task can help in handling the drawbacks of both approaches; the computational load, and the lack of required data set for CNNs based approaches; the robustness and generalization for feature extraction obtained by rule-based representations. Consequently, human-like vision system becomes handy and appropriate for robot platform.

(13)

Figure 1.2: Programming-Data computational spectrum of computer vision approaches.

1.1 Problem

The problem with building a vision system can be seen in the drawbacks of both the rule-based representation approaches used by computer vision to solve the tasks, and the capability of modern end-to-end learned approaches to handle the requirements and the challenges of the task.

The rule-based representation based computer vision system aims to emu-late the performance of the most successful vision systems (i.e., the biological vision system). In fact, the way that information are processed in order to find the correlation between the visual input and the predefined models is far from how biological system does. Moreover, the computational structures that are designed for computer technologies, and the use biological functions in the tech-nical systems makes this emulation harder. However, the rule-based pipeline addresses this limitation through range representation [12] that has a flexible order, which tries to demonstrate the gap between high-level and low-level pro-cessing of human vision system.

With the presence of rich and complex visual world, computer vision ap-proaches resort to mathematical fields (probabilistic and statics) and geomet-rical information to quantify the various states of world, and to distinguish between the possible solutions [12, 13]. Machine learning techniques are used to provide high level understanding of the extracted information. They try to provide the probabilities that relate the extracted features to some labels or cat-egories. Developing a statistical model that outcomes a high probability that fits the observation is a complex process. It highly depends on the capability of feature extraction to provide appropriate information. Moreover, the process embodies different assumptions regarding the distribution of the data, and the requirement of sufficient data. Consequently, much efforts by computer vision researchers have been made to generalize the model to various situations [12]. In addition, more efforts to combine different feature extraction approaches, which in turn increase the computation cost [14,15].

(14)

The advantage of the end-to-end approaches is that they emulate the way that human perceive and learn the visual world, and their capability to model complex data. The use of end-to-end pipeline, such as deep learning, in robot vision faces challenges represented by three contexts: learning, embodiment, and reasoning [7]. Another context can be added is the computational load.

In order to develop an effective and robust solutions for computer vision applications or where computer vision methods are exploited, such as machine vision and robot vision, it is important to understand the problem structure, the task requirements, and the capability of resources. As proposed in [7], a general processing for understanding the whole problem structure and algorithmic infor-mation can be achieved by decomposing the general problem into sub-problems and recomposing the optimal solutions of them. Consequently, an automating solution can be obtained to address various task challenges.

Object detection, image classification, and semantic image segmentation have been successfully addressed using data-driven computer vision approaches [16, 17,18,19]. Analyzing and processing data sets acquired by limited-resource mo-bile platforms (i.e., robot platforms) raise other challenges and requirements:

• Real-time and on-line processing: law-cost approaches and high accurate performance are always required.

• Temporal changes: lighting changes and dynamic scenarios.

• Captured object properties: deformation, articulation, position and ori-entation in real-word coordinates, and multi-scale.

• Robot setup: sensors placement, calibration and localization.

• Platform resources: low resolution sensors, limited computational power; • Data sets: large number of samples are required to learn the model (e.g.,

CNN model);

The embodiment of deep neural network as a stand-alone learned approach can provide a generic solutions for computer vision and its inherited applica-tions [7]. In spite of the success of those approaches, they can not satisfy the previous challenges and requirements. The trade-off between accurate perfor-mance and computational cost remains an open problem related to use CNNs in robot vision system and real-time applications. Moreover, solving image classi-fication, object detection, and semantic image segmentation using programming and model-driven approaches have main drawback in being general and robust to handle the challenges present in real-world, and the complexity of data set acquired in robot application.

1.2 Motivation

Object detection, image classification, and semantic image segmentation on data set obtained by mobile platform such as robot Pepper 3_{, Nao} 4_{, BoniRob} 5

3_{https://www.softbankrobotics.com/emea/en/pepper} 4

https://www.softbankrobotics.com/emea/en/nao

5_{https://www.bosch-presse.de/pressportal/de/en/agricultural-robot-bonirob-in-action-72082.}

(15)

Figure 1.3: Pepper, DIAGO, BoniRob, and Nao robots.

and DIAGO6 _{(see Fig.}_{1.3) require solutions that belong to different points of}

programming-data spectrum, in addition to exploit other data resources (i.e., Laser Range Finders (LRF) and depth field information).

Using CNNs as black boxes or memorization tool is expensive because of the computational load. A winning strategy consists in integrating different approaches developed for specific problem with the ones learned from data in order to exploit their advantages whilst minimizing drawbacks. In the work of this thesis, we want to show that prior knowledge and understanding of the problem provided by programming and model-driven solutions can be used to address the limitation of using CNNs on limited resources platforms. As an example, using the combination of fast color based segmentation approach for background removal and CNN classifier to address traffic sign recognition in autonomous car, it is possible reducing the search space of CNNs, consequently reduce its computational cost. Moreover, the prior knowledge provided by the color segmentation can reduce the requirements of CNNs to huge data set that cope with all possible dynamic scenarios.

This thesis focuses on the use of CNNs, to handle challenges present in image classification, object detection, and semantic image segmentation on acquired data set by moving platforms with limited resources. This works aims to utilize deep learning techniques on data sets, that have been collected in very challeng-ing real-world scenarios by movchalleng-ing and fix-installation camera. In addition, the work addresses the integration of different vision approaches, model-driven and data-driven ones, in order to create robust solutions that aid in transferring the witnessed revolution in computer vision approaches, represented by CNNs, into robot vision. Moreover, the integration handles the use of other data sources (e.g., LRF) that provide prior spatial information in order to address the com-putational cost of CNNs, and improve the performance of approaches that are singly used one each source of data.

(16)

Figure 1.4: Proposed pipeline from which the solutions for object detection, image classification, and semantic image segmentation are adapted.

1.3 Outline

In this thesis, image classification, object detection, and semantic image segmen-tation solutions adapted from the the proposed pipeline in Fig.1.4for automatic vision systems are presented, providing the descriptions of the approaches that have been proposed in the literature and introducing data sets used for the ex-perimental evaluation. Moreover, a series of novel solutions related to data set acquired by mobile robot platform and real time applications are detailed. The manuscript is organized as a background and related work in Chapter 2. The chapter describes two pipelines used by vision systems. The rule-based represen-tation is detailed in Section 2.1, which describes the rule-based representation approaches as features extraction and optimization steps. The approaches in the rule-based representation are divided into stages based on the complexity and the provided information. The second pipeline is the one used in the last decade, end-to-end pipeline in Section 2.2. The end-to-end pipeline explains the learning representation of image information, where all stages present in the range of rule-based representation are integrated in one system that learns the low, mid, and high level features from labeled data.

An evolution history of Artificial Neural Network (ANN), as an example of the end-to-end pipeline, presents in the Subsection 2.2.1. The math behind the basic utilized operations in neural networks, and optimization approaches for solving deep CNN model, in particular the ones are used in this thesis, are descried in Subsection2.2.2. The software sources, including the deep learning frameworks and publicly available date sets present in Subsection 2.2.4. The Section 2.2.3provides the recent related work and computer vision approaches that handle image classification, object detection, and semantic image classifi-cation based on deep CNNs. The evaluation metrics for the classificlassifi-cation, de-tection, and semantic segmentation performances, which are used in this work are detailed in Section2.3.

Chapter3 is organized reflecting the difficulties of image classification, ob-ject detection, and image segmentation from a computer vision perspective [20]

(17)

Figure 1.5: The difficulty order of computer vision tasks (images are adapted from [20]).

(See Fig.1.5). The chapter aims to describe the work developed in this thesis, in addition to performance evaluation of the proposed solutions. In Section3.1 we describe the use of CNN as off-the-shelf classifier to address the sub-category image classification in maritime environment, and we present a reduced CNN architecture to improve the classification performance on fine-grained classifica-tion of challenging vessel data set.

Specifying a tight bounding box around the main object in image is what we called localization. Object detection extends the localization concept to handle multiple object with multiple size presented in image. Section 3.2 proposes a novel methods for robust and fast detection performance on specific objects by data set acquired by moving mobile platforms, which can be adapted to other objects. A modular architecture of combining color based regions of interest (ROIs) with HOG, and CNN based classifier for road sign detection and recog-nition presents in Subsection3.2.1. The environmental lighting challenges, and the computational load of CNN is tackled in the proposed approach developed for Nao robot in Subsection3.2.2. Pedestrian detection performance based on two different sensors on DIAGO platform (RGB camera and LRF) is improved based on the proposed integration approach described in Subsection 3.2.3.

Image segmentation lies beyond labeling pixels in image and finding the contour area of object. In Section3.3, we address the pixel-wise semantic image segmentation based on fully CNN. The proposed solutions handles image data from both moving and fixed camera sensor on two challenging applications: skin lesion segmentation in Subsection 3.3.1, and crop-weed classification in Subsection3.3.2, where, the robust solution provided by data-driven approaches faces the lack of available data set. Moreover, the generalization of the proposed solution is highly required by those applications. Finally, the conclusions and future direction are introduced in Section4.2.

(18)

1.4 Contributions

The main contribution of this thesis is the definition of robust solutions that address image classification, object detection, and semantic image segmentation in challenging data sets, including the ones acquired by robot platform. For this purpose, we follow a general modular pipeline (Fig.1.4) that aims to handle the previous tasks based on the combination of sub-solutions that lie at different point in programming-data spectrum.

The pipeline is designed to be highly modular, where the processing stages are not only sequential, the processed information is loosely ordered [12], with the capability of using various approaches in each processing unit based on the application requirements. The combination of different processing stage in the pipeline takes into consideration the computational cost and limited resources of robotic mobile platforms, and handles the application of deep neural networks on challenging data sets. The provided solutions includes:

• Deep CNN for boat sub-category classification [21, 22]: transfer learning is used to fine tune pretrained deep CNN model on challenging data set acquired by boat traffic monitoring system (ARGOS). Moreover, a deep CNN that has no downsampling operations are explored to define the gain obtained by learning only the convolutional features.

• A novel approach for speeding up road sign detection recogni-tion [23] which aligns color based ROIs extraction with HOG detection, and deep CNN classifiers. Concretely, improving the detection and recog-nition performance of the model-driven approach. Moreover, using the spatial prior knowledge provided by the fast color segmentation to enable the use of deep CNN on limited computational resource.

• A novel approach that addresses the use of CNNs on robot plat-forms [24] which combines fast color based ROIs extraction with CNN validation step. Concretely, robust detection performance to tackle the unconstrained and unstructured real-world environment.

• Unsupervised color based image segmentation [23,24] that utilizes color information for ROIs extraction, and localizing the target object in image. The proposed solutions use low cost computer vision approach based on predefined color model of the target object.

• Dynamic white balance and exposure regularization procedure [24] to handle the lighting changes, and to improve the color based segmenta-tion process.

• A fast multi-sensor approach for people detection on mobile robot [25]: the work exploits the presence of other source of measure-ments on robot platform (i.e., LRF). The proposed approach aims to im-prove the laser based pedestrian detection performance, and to enable the application of deep CNN on limited computational hardware on robot platforms.

• Fully CNNs for pixel-wise semantic image segmentation [26, 27]: where the work demonstrates the segmentation performance and gener-alization of pixel-wise based on deep convolutional auto encoder-decoder

(19)

neural networks. The developed approaches have been tasted on two chal-lenging data sets: medical images for skin lesion segmentation, and agri-cultural images acquired by farming robot for crop/weed classification. The work on agricultural images deploys CNNs in two steps to overcome the lack of data set.

• Annotated data sets have been created and made publicly available [23, 24].

The above list of contributions are described in details in Chabter3, where the sections are organized regarding the difficulty we illustrated in Section 1.5. The references to related papers published in International conferences and workshops are provided in Section1.5.

1.5 Publications

Publications related to the contents of this thesis are listed in the following: [1] Vision Based Vessel Classification for Green VTS. M. Fiorini, A. Youssef,

D. Albani, D.D. Bloisi, In the 13th International VTS Symposium, 2016. [2] Fine-grained Boat Classification using Convolutional Neural Networks. M. Fiorini, A. Youssef, D.D.Bloisi, A. Pennisi, In International Journal of e-Navigation and Maritime Economy, 2017.

[3] Fast Traffic Sign Recognition Using Color Segmentation and Deep Con-volutional Networks. A. Youssef, D. Albani, D. Nardi, D.D. Bloisi, In the International Conference on Advanced Concepts for Intelligent Vision Systems, 2016.

[4] A Deep Learning Approach for Object Recognition With NAO Soccer Robots. D. Albani, A. Youssef, V. Suriani, D. Nardi, D.D. Bloisi, In the 20th RoboCup International Symposium, 2016.

[5] Deep Convolutional Pixel-wise Labeling for Skin Lesion Image Segmenta-tion. A. Youssef, D.D. Bloisi, M. Muscio, A.Pennisi, D. Nardi, A. Fac-chiano, In the International Symposium on Medical Measurements and Applications (MeMeA), 2018.

[6] Fast People Detection for Mobile Robot based on Multiple Sensors and Convolutional Neural Networks. A. Youssef, D. Nardi, In Workshop on Robotic Co-workers 4.0: Human Safety and Comfort in Human-Robot In-teractive Social Environments, IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018.

[7] Crop and Weeds Classification for Precision Agriculture Using Context-Independent Pixel-Wise Segmentation. M. Fawakherji, A. Youssef, D.D. Bloisi, A. Pretto, D. Nardi, In the 3rd IEEE International Conference on Robotic Computing (IRC), 2019.

(20)

1.6 Collaborations

The work presented in this thesis has been carried out in collaboration with other people and researchers. The fine-grained boat classification approach for maritime environment has been developed together with Dr. Michele Fiorini, engineering manager at Leonardo, Rome, and Dr. Andrea Pennisi at Vrije Uni-versity of Brussel. The segmentation approach, the fast traffic sign recognition, and DITS data set have been developed together with Dario Albani at the De-partment of Computer, Control, and Management Engineering (Dipartimento di Informatica, Automatica e Gestionale (DIAG)), Sapienza University of Rome. The development of the object recognition with NAO soccer robot has been car-ried out in collaboration with Vincenzo Suriani and Dario Albani at the DIAG, Sapienza University of Rome. The SPQR NAO data set has been collected dur-ing my participation in RoboCup 2016 Leipzig as a member of SPQR Robot Team. People detection based on multiple sensors has been developed in col-laboration with Dr. Maria Teresa Lazaro (as a member of SPQReL Team for service robot competitions) at DIAG, Sapienza University of Rome. The pixel-wise semantic skin lesion segmentation approach has been carried out with a collaboration with Antonio Facchiano, molecular oncologist in melanoma Field at IDI-IRCCS, Rome. The crop-weed classification approach has been devel-oped in collaboration with Dr. Alberto Pretto at DIAG, Sapienza University of Rome.

(21)

(22)

Chapter 2

Background and Related

Work

In this chapter we introduce an overview of the common computer vision tech-niques that are used in vision systems. We provide a description of two pipelines used to address image classification, object detection and recognition, and se-mantic image sese-mantic. The rule-based representation is introduces in Sec-tion2.1. We provide a description of different range representations, and more details on the used approaches in the work of this thesis. The second pipeline, called end-to-end, is the most used one in the last decade [6,7]. The Section2.2 introduces the used techniques in the end-to-end approaches (represented by CNNs) to address the computer vision tasks. An overview of the evolution his-tory of ANN, which led to the current achievement of CNNs, is introduced in Subsection2.2.1. Then, a mathematical background, including the basic opera-tion of CNNs, and optimizaopera-tion methods, is introduced in Subsecopera-tion2.2.2. In this sub-section, we provide also a description of the used CNN architectures in the work of this thesis, in particular for image classification and pixel-wise semantic segmentation. In addition, the recent related work, that address the previous vision tasks, are introduced in Subsection2.2.3. Moreover, we mention the common software sources for deep learning represented by frameworks and publicly available data for training and testing the CNNs on image classification, object detection, and semantic image segmentation tasks. The common used metrics in evaluation the computer vision approach on classification, detection, and segmentation are introduced in Section2.3.

2.1 Rule-Based Representation

Computer vision aims to emulate human vision system in solving image classi-fication, object detection and recognition, and semantic segmentation problems based on captured digital images. The rule-based representation used by com-puter vision approaches can be seen in two processing steps [28]: feature extrac-tion and optimizaextrac-tion (See Fig.2.1). Feature extraction here aims to construct a higher level understanding of information from raw images. In other words, it tries to provide distinguished properties of the captured objects (e.g., shape and color). Recently, those features are known as hand-designed patterns or

(23)

feature-Figure 2.1: The range of rule-based representation.

engineered such as HOG [9], Scale-Invariant Feature Transform (SIFT) [29], and GIST [30].

In most cases, the optimization process of this pipeline represents the ma-chine learning technique used to solve the task, in particular it aims to find a classification model of the extracted features. Some of the used techniques in computer vision are Random Forest Tree [31], Logistic Regression [32], and Support Vector Machine (SVM) [33]. For instance, HOG feature descriptor is used along with SVM binary classification to address pedestrian detection [9].

More in detail, the rule-based representation can be seen as a composition of three levels of processing: low-level, mid-level, and high-level processes. In the following, we will go through the three levels of process, providing more details on the used techniques and approaches by each level to address specific computer vision application.

Low-level is an important early processing capabilities required by vision systems to represent the image, and to start the process of image understanding and analysis [34]. Extracting visual information from an array of numbers is difficult at human level. The processing at this stage aims to transform image into another spatial-image [35,34], frequency-image [3], or color-image [36] space which are useful by themselves or for higher level of processing.

The efficiency of low-level process can be seen in enhancing image repre-sentation [37] in order to improve the description of the extracted information without defining any particular relations between adjacent pixels. Thus, it im-proves the object perception and its discrimination from background [38, 39], and enhance the deriving of spatial coherent clusters [36, 39]. Pixel-level and window-level (convolution with weighted kernel) operations are used in this early processing. Mean computation and normalization [40], histogram [39], and edge detection [38] are used on scalar gray-level images to enhance image quality, and to improve the description of the extracted information.

In frequency domain the 2D-Fourier-transformation can be used to define the spatial-to-frequency domain transformation and its inverse [34]. Low-pass, high-pass, and band-pass filters are used to improve image visualization, edge detection, and generation or elimination of sharpness, blur, and contrast [41,34]. In the context of colored images, vector-valued image (e.g., Red-Green-Blue

(24)

(RGB) image) are presented. The vector-valued components of color images are treated as correlated signals with each other (they could be handled inde-pendently) and with image formation (lightning, reflection, and shade) [39,42], sensor design (exposure and shutter speed, noise, sensor chip, Gamma and white balancing) [43,44, 39], and with human sensing of color. Color information is an important component for visualizing image data, and to separate object from background or its adjacent objects (i.e., image segmentation).

Color information is inhomogeneous (e.g., the effect of lighting on color perception) [45, 46], but color-to-color space transformation can be obtained in colored imaged domain to handle the lighting challenges, and to find bet-ter representation of color information (e.g., Hue-Saturate-Value (HSV) color space) [36,39].

The importance of the color information process can be seen in applications such as automated crop imaging [46]. Where, the vegetation indices are defined based on chromatic color information for better visualization of green color, and better separation of crop from its background. Moreover, those indices are used in higher level process to achieve the vegetation segmentation [47, 46], and to provide a prior knowledge [48].

In particular, the visible light, represented by RGB image, are used in com-puting vegetation indices. The distributed values of captured light is firstly normalized in range of0,255, in order to reduce the effect of lighting changes. Then, the chromatic indices r, g, and b are calculated as:

r = Rn Rn+ Gn+ Bn g = Gn Rn+ Gn+ Bn b = Bn Rn+ Gn+ Bn (2.1)

where Rn is the normalized value of red channel at the pixel i in the image I,

and it is computed as Rn= IR/255, Gn is the normalized value of green channel

at the pixel i which is computed as Gn = IG/255, and Bn is the normalized

value of blue channel at the pixel i that is computed as Bn= IB/255.

Some of the common vegetation indices that use the color information and the chromatic indices are Excess Green Vegetation (ExG) [47], Normalized Dif-ference Vegetation Index (NDVI) [49], and Color Index of Vegetation Extraction (CIVE) [45]. Those indices can be computed at each pixel location i in the image I, based on the normalized values defined in Eq.2.1, as:

ExGi= 2 ∗ (g − b) − (r − g) − g − b r − g N DV Ii= g − r g + r CIV Ei= 0.441 ∗ r − 0.811 ∗ g + 0.385 ∗ b + 18.787 (2.2)

However, color information based approaches provide prior information and understanding of the problem (e.g., perceive and process green color), and low-cost computations. On the other hand, the generalization of those approaches to handle various real world scenarios (e.g., vegetation segmentation under different field conditions) can be seen as a main drawback of them [48].

(25)

Mid-level processing is concerned with bridging the gab between image representation and predefined models of the objects and the world [12]. In other words, it is a range of representation that explains the image information, describes, and abstracts various image features (e.g., geometrical features) in order to connect the visual input with final decision or interpretation [50, 12]. Mid-level processing involves characterizing the components and arranging the adjacent relations of image features rather than applying image-to-image trans-formation [9]. Connectedness, contours and shape are extracted from binary image for further use in object recognition [51, 52].

Macroscopic features, such as circular arcs and line [53,52], and regions with homogeneous color are defined in order to be used in deriving properties in scene understanding and image analysis context [54, 50,36]. Non-linear morphology operation [55], such as erosion and dilation, are applied on binary and gray level images for distortion removal caused by noise or the absence of light components. Extraction of scale-invariant and geometric-transform-invariant keypoints and combining them in one descriptor (e.g., SIFT and HOG) are other common processes at this level. Multiple views and data fusion are also addressed by mid-level approaches in order to improve the feature representation, to over-come problems such as overlapping and occlusion, and to increase the field of view [56,57,58,59].

The integration of different mid-level approaches is widely used in robotic applications. Where, a wealth information are obtained by calibrated sensors equipped with the robot (e.g., LRF and RGB camera [60], and RGB with Depth field [61, 56]). For example, pedestrian detection on robot platform has been addressed based on information fusion of multiple sensors (e.g., LRF based leg detection [62] and upper-body depth based detection [63]). Moreover, the calibrated sensors provide intrinsic and extrinsic information that facilitate the data fusion, and the obtaining of real world coordinates. For example, the height of target object (e.g., pedestrian) can be calculated in image coordinates [60] based on fused information provided by LRF and camera sensors as:

Height = HeightRealW orld× HeightCameraF rame× F ocalLength HeightCamera× DistanceLRF

(2.3)

where Height is the height in pixel of target object in camera frame,

HeightRealW orld is the real world physical measurement of the target object

in meter (e.g., the tall of a person), F ocalLength is the intrinsic parameter related to the lens of the camera, HeightCamera is an extrinsic parameter

rep-resenting the height of the camera sensor with respect to the LRF sensor, and DistanceLRF represents the distance of the target object centroid in LRF

frame-work.

High-level processing aims to map the attributes obtained from previous processing into recognized level (e.g, object detection [64] and classification [65]), verification of specific assumptions (e.g., inspection of object shape in industrial application [42]), estimation of pattern parameters (e.g., 3D-position [66, 67]), and object tracking [68,56]. This processing gives high-level interpretation of image that is understandable at humane level. It answers the problems regarding to what object a human can see in an image [69](i.e., image classification [65]), which objects and where they are located in an image (object detection [64]), and which objects and which are the pixels that belong to their contours (semantic image segmentation [70, 71]).

(26)

The rule-based representation approaches are widely used in machine vision and robot vision systems. The low-level approaches are low-cost computation, which make them suitable for limited resources that required real time process-ing. In machine vision application, many assumptions and predefined conditions are required to efficiently apply such approaches. For example, visual inspec-tion and monitoring requires the predefiniinspec-tion of lighting condiinspec-tion and sensor installation [3]. This can be impractical in robot vision, where the real world challenges require huge efforts for being predefined. For example, robot vision system for NAO can not handle the challenges in RoboCup competitions1based on only use of low-level approaches, where indoor and outdoor environments, variety color and size of objects, and cluttered scene are presented.

Mid-level approaches can provide the robustness in many applications, but they are costly, which limits their use on real time applications (e.g., HOG re-quired GPU processing [56]). Moreover, the successful that has been achieved by CNNs in many competitions on computer vision tasks has drawn the atten-tion of the researchers in the last decade to investigate the capability of feature extraction by their layers [72,73,74,75].

High-level approaches can vary based on the problem complexity. For in-stance, using SVM with linear and non-linear kernel [76, 77], cascade detectors for accurate and fast detection [78], and tracking people with multiple sen-sors [56]), but the improvement is restricted to the extracted features. The extracted features can not always be adapted or generalized to handle more complex situation that can be faced in real-life scenarios [79, 80, 50]. Conse-quently, the optimization process is unable to improve the performance when the feature extraction lacks to the required representation.

2.2 End-to-End Pipeline

End-to-end processing aims to obtain the output of vision system (e.g., classifi-cation) directly from input raw features through the learning of a map function. This can be achieved taking into consideration the requirement of large labeled data for optimal performance. Thus, we locate the pipeline at the extreme data side of the programming-data spectrum.

Deep CNNs is an end-to-end process (See Fig.2.2). In which, the high level interpretation are hierarchically learned through several layers that try to bridge the gap between low-level and high-level representation of image. CNNs handles this gap by the use of trainable feature extraction, and learning the internal representation between layers [81]. In other words, it substitutes the efforts of feature extraction designing by trainable feature transformation. Where, the low-level layers carry general features among various categories (e.g., edges and colors), more invariant and global features are acquired at higher layers (e.g., shape and pattern), and the prediction or the target’s class is obtained by a trainable classifier or a prediction layer [82,81].

Computer vision through rule-based representation faces difficulties in rein-venting basic unconscious capabilities of biological vision system, which are taken for granted (e.g., color and lighting perception) [12], moreover the pipeline is different from how a human vision system explores the world (e.g., learning to recognize object and scene). On the other hand, emulating the biological vision

(27)

Figure 2.2: The end-to-end pipeline.

system can be seen more similar to the philosophy behind the ANN [7], where feature representation is learned based on labeled data. However, integrating all process levels in one system computationally costs, and requires large amounts of labeled data. That can be seen as a limitation of deploying deep learning on limited resources and real-time applications.

Recently, deep learning, in particular deep CNNs, has dominated the com-petitions on image classification, object detection, and semantic image segmen-tation [83, 84, 85, 86, 87]. In the following we will explore more about this end-to-end technique, its evolution history, the main used operations and func-tions in this technique, the current related work, and the available software sources that have been successfully used in different application domains.

2.2.1 Evolution-History of ANN

The state-of-the-art in various traditional tasks in artificial intelligence AI, such as object detection in computer vision [88,89] and speech recognition [90, 91], have considerably improved thanks to deep learning. Moreover, deep models has been extended to modern domain of AI such as face recognition [92], identi-fication of tumor [93], and autonomous driving car [94]. One of the main reason behind the successful of ANNs is the biological relation with the neural networks of the human brain and its deep architecture. In the following, a brief review on the growth of ANNs and deep learning model is introduced starting from simple ideas that constructed earlier models to the current forms of deep model that dominate the recent researches.

The first piece of the biological inspiration was introduces in 1943 as a math-ematical model of neural network [95]. In 1950, Alan Turing declared ”what we want is a machine that can learn from experience” [96,97] , and after that in 1952, the first program to learn the machine was created by Arthur Samuel who defined the term machine learning as a ”Field of study that gives comput-ers the ability to learn without being explicitly programmed” [98]. After that, in 1958 Frank Rosenblatt came with an impressive electronic device called per-ceptron based on biological principles [99] and he presented the first shallow neural network. The perceptron had learning procedure enabled it to converge to right solution in letter and number recognition. The work of Rosenblatt was more hardware than algorithmic. The fact that the perceptron was an imple-mentation of linear model, that lead to failure in learning nonlinear functions. However, the neural network entered in the AI’s first winter and less attention

(28)

had been given to Rosenblatt’s attempt. Moreover, the training procedure did not guarantee the achievement of optimal results obtained in theory.

In 1965, Ivakhnenko started the first step in modern deep learning model [100]. He developed an algorithm that was able to learn the extraction of best fea-tures of each layer in the neural network. The group method of data handling (GMDH), that addressed the problem of multidimensional model optimization, was applied by Ivakhnenko to neural network, which in turn supported his first attempt to create a deep neural network in 1971. The first CNN inspired by the visual cortex organization in animal was introduced by Kunihiko Fukushima in 1980 [101]. He developed Neocognitron, which was a multi-2D neuron layers that recognize visual patterns. Each layer was designed for specific functionality (i.e., feature extraction, pooling, and average). The weight was set through response expectation instead of automatic training routine. The use of backpropagation in neural network to compute the error in training procedure was applied in 1985 by Geoffrey Hinton [102]. The presence of backpropagation overcame the disadvantage of the perceptron represented by XOR problem. It provided dis-tributional representations of the error, consequently making the possibility of adding more layers and facilitating the training procedure. The possibility of increasing the number of layers in the network, adding finite number of neurons in layers, using activation functions, and feed forward through multi-layer struc-ture extended the capability of neural network in learning nonlinear functions, and proved that ANNs can be universe approximate [103,104].

Moreover, the successful of neural network in various tasks (e.g., shape recog-nition and handwritten digit recogrecog-nition [105]) had been extended. Yann LeCun in 1989 demonstrated the practical application of backpropagation by combin-ing it with convolutional neural network in [106]. Despite the second AI winter and the presence of support vector machine [107] that limited the progression of ANNs on larger problems, Geoffrey Hinton in 1996 completed his research on how human brain could work. He steeped deeper in neural networks by provid-ing unsupervised learnprovid-ing approach for pretrain each new additional layer, and use the final process to initialize the parameters of the whole network [108,109]. This approach was the base step that imported neural network to deep neural network, and proved that deep network structure moved beyond shallow neu-ral networks. A combination of stochastic gradient decent and backpropagation was applied to character recognition [110]. Since 1999, the power computational computers and GPUs has been developed, consequently processing data became faster. Which in turn, enabled the ANNs to prove their advantages when more training data set are involved ”Data drives learning”— Fei-Fei Li.

In 2009, the Stanford vision lab released ImageNet [111], free source of more than 14 million labeled images. The significant increase in GPU speed and power, and the free labeled databases enabled the advance training of deep neu-ral network instead of pretraining each layer. Moreover, deeper neuneu-ral networks have started to win international competitions such as ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [19].

It is worth noticing that although the leverage of large data set and the powerful computers, deep learning has an evolutionary attempts, milestones, and failure and success till its dominance over traditional machine learning has been proved on different research field[7].

(29)

2.2.2 Mathematical Background

Relying on representing the real world knowledge and making decisions through only hard-coded approaches, turns the task of AI system into very complex level. The capability of AI system to extract the knowledge from raw data is known as machine learning [81]. Data representation have a great influence on machine learning methods performance. The representation learning aims to use the machine learning techniques in order to map the data to the desired outputs, and to find out the representation itself [81,112].

On the other hand, learning the feature extraction from visual input faces the problem of variation in data observation (e.g., multiple scale object, and lighting changing in image data). CNNs aim to address this variation through learning the features in terms of other, where higher level representations and complex concepts (e.g., pedestrian) are defined by the combination of simpler features representation (e.g., contours), which in turn are obtained in terms of color information and edges. As it can be seen, deep CNNs decompose the learning of complex concept into simpler ones. This can be described as forming a mathematical function based on composing simple functions [81].

In spite of the presence of high level frameworks and libraries that pro-vide good implementation of various deep neural network, understanding the math behind the neural network helps in improving its performance on various tasks by selecting and developing best network architecture and optimization ap-proaches, and tuning the hyperparameters. In the following, we will go through some basic concepts and the math behind them for better understanding of deep CNNs.

Multilayer Perceptrons One of fundamental concept used in deep learn-ing model is the multilayer perceptron (MLP), which is called also deep feed-forward neural network [81]. The multilayer perceptron can be seen as a math-ematical block that aims to approximate a function f . This function can be defined in classification task of input sample x into predefined label y as:

y = f (x) (2.4)

and in the context of feedforward neural network, the approximation of the function f is defined as:

y = f (x, θ) (2.5)

where θ is learned parameter that aims to find out best approximation of f . To this end, we have passed by three main concepts: neural,newtwork, and deep. The concepts network is used to represent the acyclic graph composition of the functions, that forms the mapping of the input into the desired output value. The mapping function f is decomposed into other functions

f(1)_{, f}(2)_{, ..., f}(n)_{, which construct a connected chain. This chain can be}

math-ematically expressed (for n = 4) as:

f (x) = f(4)(f(3)(f(2)(f(1)(x)))) (2.6) The communication between those functions aims to enable the computer to biologically learn the relation in data in a way inspired by neuoroscience. The basic components of this communication is the neuron. Each decomposed

(30)

function represents a layer, and each layer consists of multiple neurons. The number of neurons in a layer specifies its width, while the number of layers in the architecture defines the depth of the model. The last layer of the network is the output layer, in which the predicted label ˆy is assigned to the input sample. The core components of neural network is the neuron. Each neuron has a set of n-value in its input, and it outcomes a prediction ˆy based on a set of parameters called weight w and bias b. Each neuron unit updates its parameters (w and b) through learning procedure in order to find best prediction of ˆy. So, the architecture of neural network can be seen as finite set of layers and each layer has a finite number of neurons.

Inspired by biological relation, the layer-to-layer relation is represented as to-vector and the the layer-to-neuron relation is represented as vector-to-scalar [81]. The mapping process to obtain the output prediction Z for each neuron (i) in the layer (l), which accepts a vector of input samples

X = {x1, x2, ..., xn} is: zl_i= W · X + bl_i = n X n=1 w_nT · xn+ bli (2.7)

Activation Function Multilayer perceptrons have been developed to han-dle real and complicated problems, which can be represented by non-linear and high dimensional data. The sum of product in Eq.2.7 is a linear function that maps the input vector X into a predicted output value z. The limitation of linear transformation has been addressed through deploying non-linear transformation on the input of each neuron, which is called activation function A(·).

In other words, the linear transformation W T · X + b remains the core of the neural network calculation, but instead of applying it directly on the input of each neuron, it is applied on the non-linear transformation function A(·) of that input (i.e., the output of previous neuron in the previous layer). This can be written as:

z_il= W_iT· Al−1_{+ b}l

i and A

l_{= a}l_(zl

i) (2.8)

Neural networks with no activation function is a simple linear approxima-tion (polynomial funcapproxima-tion of one degree). Handling complex data, such as row images, requires the in-between hidden layers (i.e., non-linearity of activation function). Thus, the neural network can be considered as universal-approximate function, that can flexibly learn and represent complex functions.

The simple example that prove the requirement of non-linearity is solving XOR operation using neural network [81]. The mapping function can be rep-resented as XOR = f(2)(f(1)(X)). The hidden layers accept the input vector X, and each neuron applies the defined transformation in Eq. 2.7. Then, the non-linear element-wise transformation A(z) is applied on the top of the linear transformation.

As it can be seen, applying the activation function gives the ability of neural network to represent complex mapping, and the powerful to handle more com-plicated input data. Many types of activation functions are existed. Beside the non-linearity, the utilized activation function in neural should be differentiable in order to enable the computation of its gradient.

(31)

(a) Linear activation function (b) Threshold activation function

(c) Than activation function (d) Sigmoid activation function

(e) ReLU activation function (f) LeakyReLU activation function

Figure 2.3: Activation functions and their derivatives (images are adapted form [113])

In the following we will pass through the most common activation function used in representation learning using neural network, where:

• a(x) is the neuron unit output

• a0_{(x) is the derivative of the activation function}

Linear Function: it is a scaling function by a factor c of its input. The definition of the function is:

a(x) = cx

a0(x) = c (2.9)

As it can be seen, it is a simplistic mapping that lacks to complexity. The output of linear function is unlimited, and the scale parameter c requires to be tuned. In case of c = 1 the function turns into identity one, and many layers in neural network will act as one mapping from input to output. The neural network based on linear activation has always a constant factor (derived from the activation function) that can be apply in tuning the wights in learning procedure.

Threshold Function: is generally used in output layers. The function is defined as:

a(x) = (

0, x < 0

1 _{x > 0} (2.10)

The binary threshold can be used in classification problem, when the output of the neural network should outcome with a label of its input (e.g., binary

(32)

classification or one-versus-all classification). This function is not strongly rec-ommended for the neurons in hidden layers [81].

Sigmoid Function: is known also as logistic function [114]. It is defined as: a(x) = 1 1 + e−x a0(x) = 1 1 + e−x 1 + 1 1 + e−x (2.11)

The sigmoid function limits the input between two values (0 and 1). The deriva-tive of sigmoid function has a hill-shaped which facilitates the movement to one direction, making the function suitable for the use in output layers (especially for classification task). It is used in general in shallow neural network for bi-nary classification, and modeling the logistic regression[115, 116]. The main drawback of this function in deep neural network architecture is that the gradi-ent (computed during the backward pass) decreases to almost zero at the early layers. This means that the computed error during the learning procedure has no effect on the weights update of those layers (the layers are paralyzed), and it should be avoided when initializing the neural network from small random weights [116,117].

Hyperbolic Tangent Function (tanh): is a smoother zero-centred func-tion with and output range between [−1 and 1] [116,112]. It is defined as:

a(x) = 2 1 + e−x − 1

a0(x) = 1 − a2(x)

(2.12)

Tanh activation function overcomes the problem of sigmoid function in train-ing deep neural network [118, 115]. This improvement can be seen by using negative input to the neuron in order to compute the gradient, instead of ig-noring them (the tanh maps it to a negative value instead of 0). It has been used in speech recognition and natural language processing [119,120]. However, the vanish gradient problem is presented with the use of tanh, and the gradi-ent of value 1 is obtained when the input value is equal to 0 resulting in dead neuron [116].

Rectified Linear Unit(ReLU): is widely used in deep learning applica-tions [121,120]. It has a simple formula that helps in learning faster, improving the performance, and generalizing the deep learning model [122,123,124]. The ReLU relation is defined as:

a(x) = max(0, x) a0(x) = ( 0, x < 0 1 _{x > 0} (2.13)

The ReLU function is nearly linear mapping making the optimization of gra-dient descent easier [112]. The rectification of zero values beside to overcoming the vanishing gradient, and the low computational cost make the use of ReLU in

(33)

hidden and output layers suitable for object classification and speech recognition applications [125,17,120,124]. The main drawback of ReLU is the overfitting problem, but the presence of dropout regularization (we will see later in this section) turns in an effective use of ReLU in deep neural network architectures. In addetion to overfitting, ReLU function causes in some cases zero gradients preventing the weights of some neuron from being updated [126]. The modified versions of ReLU function have been presented, such as Leaky Rectified Linear Unit LReLU [120], and Parametric Rectified Linear Unit PReLU [17], to ad-dress that problem and keep on the updating of the layers wights during the backward pass of learning procedure.

Softmax is an activation function used to compute probability distribution from a vector of real numbers [116]. The Softmax maps its input into a range of value between 0 and 1, with a sum of probabilities equal to value of one [126]. The Softmax function is defined as:

a(xi) = exi P ie x i (2.14) The common use of Softmax function is in output layers of deep neural network architectures [125, 127, 128], that handles the multiple classes. The main difference from Sigmoid function is that the former is used for binary classification task, while the later can determine the class probabilities of the input samples (multi-class classification).

Forward propagation The forward pass of the neural networks can be seen as the application of the linear transformations on the non-linear ones in order to get the output value of the input x. In order to understand the steps, we are going to decompose the formula in Eq. 2.6into steps, which we refer to by computation node function.

The output of neural network can be seen as the output of the accomplished computations on neuron node (we refer it by hw(x), where hw(x) = a

4 1 means

that output value of neuron unit number 1 in the layer 4). In the term hw(x),

we simplify the representation of the function by an assumption that no biases terms are considered, in case we want to include them we can refer to the term as h_θ(x). In fully connected neural networks, all neurons are connected with each other, and the output of each neuron is the input of the others in the next layer. That can be described as [126]:

a1= X (2.15)

zj+1= WjT · aj _(2.16)

aj+1= ρz(j+1) (2.17)

h_w(x) = aL= ρz(L) (2.18) So we can write the predicted value of the input x, which is obtained by a 4-layer shallow network through the neuron 0 (the third hidden layer has 4 neurons) as:

(34)

h_w(x) = a(4)₀ = ρ

3

X

i=0

wiT · a(i) _(2.19)

The generalized computation function of a node in a neural network with multiple hidden layers and finite numbers of neurons can be defined as [126]:

aL_n = " ρ X m wL_nmT · " ... " ρ X j w_kj2T · xi+ b1j + b 2 k ! ... m + bL_n n (2.20)

Loss Function The neural network aims to learn a model that correctly maps a set of input samples into a set of outputs. The mapping process is carried out through a large number of parameters represented by wights and biases. In the training process, this map is represented by predicting the output h_w(x) of samples in the training data. Then, optimization algorithm takes place to find better values of the wights that make this prediction as close as possible to the target value y.

The measurement of how the predicted output is far from the target one is known as objective function. It can be referred to by the function to be minimized or maximized. When the objective is to minimize the function, it may also be called the cost function, loss function, or error function [81].

The loss function carries the estimation error that outcomes from a set of weights in the learned model. It represents the designed goal, where a poor error function leads to unsatisfactory results [129]. The choice of loss function should handle the presence of large space of possible solutions, where the optimization algorithm iteratively seeks to provide the optimal one by updating the wights parameters.

The maximum likelihood framework aims to optimize the parameters of the statistical model by maximizing a likelihood function, in which the framework tires to find the best prediction of the model parameters derived from training data [129]. It is similar to what a learning neural network model is trying to, minimizing the dissimilarity between the distribution of the model prediction, and the distribution of the target given in training data through updating the weights [81]. Moreover, the learning procedure through the maximum likelihood framework tries to use the estimation of the loss function to measure this dissim-ilarity. The framework is consistency, where learning the parameters converges to the true values when the numbers of training samples are too large [81].

The cross-entropy is used under the maximum likelihood framework to mea-sure the previous dissimilarity. In learning the neural network model we aim to define the form of cross-entropy function [81]. Practically, learning the neural network model describes the cross-entropy by the loss function. The definition of loss function is coupled with representation of the output of the neural network (mainly the used activation function in the last layer). In addition, based on the type of the learning task, we can classify the loss function into two categories: regression loss and classification loss.

The regression loss function (e.g., the Mean Squared Error (MSE) function) is used to define the difference of a predicted real value quantity y0 = h_w(x) from the target one y. In training the neural network, the MSE is used when a linear activation function is coupled with the output neuron. MSE can be defined as [81]:

(35)

M SE_loss(h_w(x), y) = 1 n n X i=1 yi− hw(x) 2 (2.21) In case of label prediction, we can differentiate between two type of predic-tion: the binary classification; in which the neural network aims to infer on the last layer (involves one neuron) the value of 0 or 1, the single label among multi-target classification; in which the neural network aims to infer on the last layer (involves one neuron for each target class) the value between 0 and 1, and the multi-label multi-target classification; in which the last layer holds a neuron for each target class and it infers a probably value between 0 and 1 for each class.

In the binary classification, the output neuron combines the defined sigmoid activation function in Eq. 2.11with cross-entropy (can be referred to as Log-arithmic loss) to compare the distribution of the model prediction {h_w(x), 1 − h_w(x)} with the right target distribution {y, 1 − y}. The Logarithmic loss is defined as [129]:

J_loss(h_w(x), y) = −y log(h_w(x)) + (1 − y)log(1 − h_w(x)) (2.22) The prediction of multiple labels among multiple classes uses the sigmoid function, and the prediction are inferred by the model for each label. The cross-entropy is used to quantify the difference, for each class m, between the prediction and the true distributions. It can defines as [129]:

J_loss(h_w(x), y) = − m X i y log(h_w(x)) + (1 − y)log(1 − h_w(x)) (2.23)

The prediction of single label among multiple classes uses the SoftMax func-tion (the probability distribufunc-tion of the output sums to 1). The cross-entropy quantifies the difference between the model prediction distribution

h_1w(x), h_2w(x), ...., h_mw(x) and true target distribution y1, y2, ...., ym. The

cross-entropy relation is defined as [81]:

J_loss(h_w(x), y) = − m X i y log(h_w(x)) (2.24)

The loss measurement is used to diagnose and evaluate the optimization process of the neural network. It can be used with other metric (e.g., accuracy) to control the overall performance of the training model in the task context, and to know how well the model is learning. In practice, we expect to obtain low values of training and validation loss, with a high tanning and validation accu-racy (See Fig.2.4). Moreover, the information obtained by those measurements is important to control the overfitting problem of the model.

Back-propogation The computation of scalar cost through the function J (w) in Subsection 2.2.2is accomplished after the forward propagation of the information estimated through the layers of network (i.e., the prediction h_w(x)). The term back-propagation refers to the methods that are used to compute the