Human Activity Recognition based on Depth Camera for Assistive Robotics

(1)

Phd Course

BioRobotics

Accademic Year

2016/2017

Human Activity Recognition based on

Depth Camera for Assistive Robotics

Author

(2)

(3)

iii

We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the light into the peace and safety of a new dark age.

H.P. Lovecraft

I like to think (it has to be!)

of a cybernetic ecology

where we are free of our labors and joined back to nature, returned to our mammal brothers and sisters, and all watched over

by machines of loving grace.

(4)

(5)

v

Scuola superiore di studi universitari e di perfezionamento Sant’Anna

Abstract

The Biorobotics Institute

Doctor of Philosophy

Human Activity Recognition based on Depth Camera for Assistive Robotics

by Alessandro MANZI

Nowadays the development of technologies strictly connected to humans increased exponentially. Smart-phones, smart bracelets and any sorts of Internet of Things (IoT) devices permeate our daily life. Realistically, in the near future, technologies related to smart homes will further increase, and any kind of sensors, domestic robots, computers and wearable devices will share the home environment with us. Therefore, the importance to be aware of human behaviors is beyond doubt. Human activity recognition plays an important role especially in the context of ambient assisted living, providing useful tools to improve people quality of life.

This dissertation proposes a human activity recognition system based on depth camera. Furthermore, it inves-tigates and discusses its application in assistive context. The developed human activity recognition system is based on skeleton data and represents an activity using a set of few and basic postures. Machine learning techniques are employed to extract key poses and to perform the classification. In particular, basic postures are extracted by means of unsupervised clustering methods (i.e. K-means, X-means). Informative activity features are generated from the dataset samples and supervised machine learning classifiers are trained (i.e. support vector machines, k-NN, random forest).

The thesis deals with different aspects of the main topic, ranging from classification performances of daily living activity considering both single and two-person interactions. Additionally, to overcome some limitations of the employed sensor, the system is further expanded integrating Cloud and robotics technologies. While the first introduces concepts such as sharing of the knowledge and offload computation, the latter allows to extend the limited device field of view employing a platform that is able to autonomously move in an indoor environment.

Finally, a multi-modal approach, which extends the system with the introduction of features extracted from a wearable device, is proposed. Skeleton and inertial features are combined using a fusion at decision level scheme. The use of information provided by a wearable device allows to further extend the applicability of the system because the user does not need to be in the field of view of the depth camera. Furthermore, it also allows to address the occlusion issue of the skeleton tracker.

The system is tested and evaluated on publicly available datasets as well as realistic experiments. According to the specific dataset, the outcomes show improvements or comparable results with the state-of-the-art. A general improvement of the activities related to the assistive context is appreciable.

Results from realistic experimental tests show that a system having the described features is feasible for the real application in assistive contexts. The use of skeleton trackers extracted from depth maps allows to have a system that is tolerant to illumination changes, meaning that it can work in poor light conditions. Moreover, it can provide much more user privacy than standard video camera systems.

(6)

(7)

vii

Acknowledgements

First of all, I would like to thank my Supervisor Dr. Filippo Cavallo, for his guidance and constant encouragement in boosting the activities. I want to thank Prof. Paolo Dario for providing me with the opportunity to complete my Ph.D. study at The BioRobotics Institute of Scuola Superiore Sant’Anna.

I would like to thank also Prof. Peter Sinˇcák and all the labmates for the friend-liness and the warm welcome in Košice. I really appreciated every drops (oh well, I should say liters. . . ) of pivo.

Mostly, I have to say thank you to the incredible group of Peccioli, the oldest and the newest, to endure my nazi-open-source philosophy. Old members know it and new will know it soon.

Thanks to my friends, for sharing good and bad times, health food and junk food. Even if our paths will go in different directions, don’t lose ourselves!

Special thanks to my mom, my dad, and my bro to be as they are. Always present despite the distance. I would also thank my parents-in-law for their everyday sup-port.

Last but not least, the biggest thank goes to my lovely wife to be a concrete pillar of my life. Let’s go ahead and always together!

(8)

(9)

ix

List of Figures

1 Popular consumer depth cameras. . . 6 2 Infrared pattern of the PrimeSense sensor (retrieved fromhttps://

structure.ioby Occipital).. . . 8 3 Skeleton model of the (a) NiTE tracker (retrieved from https://

structure.io/openni) and (b) Kinect v1 SDK (retrieved fromhttps: //msdn.microsoft.com/en-us/library). . . 8 4 Skeleton model of the Kinect v2 NUI SDK (retrieved fromhttps://msdn.

microsoft.com/en- us/library). . . 9 5 Samples of the CAD-60 Dataset (retrieved from http://pr.cs.cornell.edu/

humanactivities/data.php). . . 9 6 Samples of the TST Dataset (retrieved fromhttp://www.tlc.dii.univpm.it/blog/

databases4kinect).. . . 10 7 Samples of the ISR-UoL 3D Social Activity Dataset (retrieved from

https://lcas.lincoln.ac.uk/wp/research/datasets- software/isr- uol- 3d- social- activity- dataset). 11 8 Samples of the SBU Kinect Interaction Dataset (retrieved from http:

//www3.cs.stonybrook.edu/~kyun/research/kinect_interaction/). . . 11 9 Examples of worst skeleton detection. (a) The person is far from the

sensor and at least two joints are missed. (b) The user is lying on a sofa and the skeleton is fused with the sofa. (c) The person falls down in front of the sensor and the output seems unusable. . . 14 10 Three examples of the samples extracted from the dataset: (a)

stand-ing, (b) sitting and (c) lying posture. . . 15 11 Difference between skeleton representation. The Kinect SDK (left)

uses 20 joints, while the NiTE SDK (right) only 15. The missing joints are replaced with the closest. . . 17 12 The Fall Detector module reads the output of the neural network

con-tinuously. According to the posture and to appropriate threshold, it is able to send warning or emergency signals. . . 20 13 The system is composed of 4 steps. The skeleton data are obtained

from the sensor, and the posture features are processed. Informative postures are extracted from the sequence. Then, the activity features are generated from the basic postures. Finally, a classifier is applied. . 24 14 Representation of the human skeleton: (a) Original 15 joints (b) Subset

of joints for the posture feature (white), and the reference joint (torso) is in gray. . . 25

(14)

xiv

15 Example of activity feature instances using a sliding window of L = 5 elements. Duplicates increase the instance weight. . . 27 16 Subset example of activity feature instances using a window length

equals to 5 and a skeleton of 7 joints (torso omitted). . . 28 17 Software architecture of the training phase. The skeleton data are

gathered from the depth camera, and the skeleton features are se-lected and normalized. Then, the input is clustered several times to find the informative postures for the sequence. The activity features are generated from the obtained basic postures and finally, a classifier is trained and relative models generated. . . 33 18 Software architecture of the testing phase. As for the training phase,

the skeleton features are extracted from the depth camera. Then, the optimal number of clusters is calculated using the X-means algorithm. A classifier is applied using the previously trained model and the gen-erated activity features.. . . 33 19 The software architecture of the activity recognition system. It is

com-posed of two phases: creation of the models (top), and online classifi-cation (bottom). The first phase trains a set of SVM classifier that will be used during the second phase. These phases share some software blocks, i.e. extraction of the skeleton features and generation of the feature activity. . . 40 20 Handshake Instance example of an activity feature using a window

length equal to 5 and a skeleton of 11 joints. . . 41 21 The fusion at decision level combines the models of the independent

classification by training a k-NN classifier.. . . 43 22 Architecture of the system. The teleoperation center accesses the robot

using the web browser through the Azure Cloud. The robot opens SSH tunnels on the virtual machine, avoiding the setup of a dedicated VPN. . . 51 23 DoRo, the domestic robot used for the teleoperation use-case.

Copy-right Scuola Superiore Sant’Anna. . . 52 24 Software architecture of the system. The network communication

be-tween the robot and the recognition system relies on the WebSocket technology. . . 58 25 The experiments have been conducted in a living room involving 11

people performing 4 activities. . . 60 26 In real scenarios, several false detection of skeleton joints can occur. In

the talking on phone activity, for example, the hand holding the phone is often detected along the body and rarely detected close to the head. 62 27 Overall software architecture of the system. . . 66 28 The DoRo robot. The depth camera is mounted on its head. Copyright

(15)

xv

29 The SensHand device. The present work considers the data acquired from the wrist and index finger inertial sensors. Copyright Scuola Superiore Sant’Anna (photo by Hauke Seyfarth). . . 68 30 Synchronization of heterogeneous data: at time tm, an array of data

is provided for each timestamp tnm−1 composed of IMU and Skele-ton data. SkeleSkele-ton data are computed at tn

m−1 as linear interpolation among tm−1and tm . . . 69 31 Representation of the experimental setting. It includes the DoRo robot,

and the SensHand. We considered only kitchen, bedroom, and living room. . . 71 32 Software architecture of the activity recognition system. Two

inde-pendent classifiers are trained on location, skeleton, and inertial data. A decision-level fusion approach is used to exploit the heterogeneous information. The dotted arrow refers to the configurations where the location features are used with the inertial features in the independent classifier. . . 72 33 Schematic representation of the system setup (left). Software modules

of the tracking algorithm used for identification and extraction of the five centroids (head, hands and feet) (right).. . . 85 34 Box plot of the root-mean-square errors (cm) relative to the X, Y and Z

coordinates measured with the RGB-D sensor and the optoelectronic system for the five clusters. Horizontal lines illustrate median values, inter-quartile ranges are illustrated by the upper and lower limits of the boxes, full ranges are illustrated by the upper and lower limits of the vertical lines, and the ‘+’ signs illustrate outliers. . . 88 35 3D plot for the five clusters measured: head (blue), hands (yellow and

(16)

(17)

xvii

List of Tables

1 Technical specifications of consumer depth cameras. . . 7

2 Confusion matrix of the Sofa experiments (NiTE). . . 18

3 Confusion matrix for the falling tests (NiTE) . . . 19

4 Accuracy (NiTE). . . 19

5 Overall precision and Recall (%) values using 4 clusters and different window activity sizes on CAD-60. . . 29

6 Precision and Recall values for each activity, using 4 clusters and a window size of 11 elements on CAD-60. . . 29

7 The confusion matrix of the “new person” test case using K = 4 clus-ters and N = 11 window activity size on CAD-60. . . 30

8 Precision and Recall values for different joint configuration on CAD-60. 30 9 Overall precision and Recall values using 4 clusters and different win-dow activity sizes on TST dataset. . . 31

10 Precision and Recall values for each activity, using 4 clusters and a window size of 5 elements on TST Dataset. . . 31

11 Number of clusters obtained with the X-means algorithm considering a different input split frame size for each activity of the CAD-60.. . . . 35

12 Overall precision, recall, and accuracy values using dynamic cluster-ing and a slidcluster-ing activity window of 11 elements on CAD-60. . . 35

13 The confusion matrix of the “new person” test case using a sliding window activity size of 11 elements on CAD-60 with a window split of 100 frames. . . 36

14 State of the art of precision and recall values (%) on CAD-60 dataset. . 36

15 Number of clusters obtained with the X-means algorithm considering a different input split frame size for each activity of the TST. . . 37

16 Overall precision, recall, and accuracy values using dynamic cluster-ing and a slidcluster-ing activity window of 5 elements on TST. . . 37

17 The confusion matrix of the TST Fall activity with 100 input frames. . . 38

18 ISR dataset: confusion matrices of the independent classification case (N = 11, L = 6). . . 46

19 ISR dataset: confusion matrices of the fusion at decision (a) and fea-ture level (b) (N = 11, L = 6). . . 47

(18)

xviii

20 SBU dataset: confusion matrix of the fusion at feature level, using N = 11skeleton joints, and a window length of L = 6 elements. The rightest column reports the total number of activity samples for the 5-fold evaluation. Precision 0.86, recall 0.87, accuracy 0.88. . . 48 21 Comparison between packet size, pps and throughput. . . 53 22 Coefficients of variation σ∗ evaluated on registrations. Data are

re-ported on three decimal places. Values marked are characterized by a coefficient of variations two orders of magnitude greater. . . 60 23 Results of developed classifier with different data used for training

phase. Performances are evaluated as True Positive rate, False Positive rate, Precision, Recall. . . 61 24 The confusion matrix of the “leave-one-actor-out” test case (Train 10

subjects, Test 1 subject).. . . 63 25 Activity Description within locations (B: Bedroom, L: Living Room,

K: Kitchen). . . 70 26 Description of the adopted configurations. . . 70 27 Classification Results according to different features in terms of

Accu-racy, F-measure, Precision and Recall (S=Skeleton, W=Wrist, L=Location, I=Index) . . . 74 28 The confusion matrix using only the robot (i.e. skeleton and location

features). F= Front, S= Side . . . 77 29 The confusion matrix using the combination of all the features

(skele-ton, location, wrist, and index). F= Front, S= Side. . . 77 30 Extraction of quantitative parameters of spontaneous movements for

both systems, the RGB-D sensor (ASUS) and the optoelectronic sys-tem (SMART). . . 90

(19)

xix

To my sweetheart and to our little treasure preciously

enshrined. . .

(20)

(21)

1

Chapter 1

Introduction

Nowadays the development of technologies strictly connected to humans is increas-ing exponentially. Smartphones, tablets, smart bracelets and Internet of Thincreas-ings (IoT) devices permeate our daily life. Realistically, in the near future, technologies related to smart homes will further expand [1–3]. Distributed environmental sensors, do-mestic robots, computers and wearable devices will share the home environment with us. The importance to be aware of human behaviors, especially in the context of the ambient assisted living (AAL), is therefore beyond any doubt. For this reason, the capability to recognize human activity represents a fundamental feature.

Human activity recognition is one of the most important areas of computer vi-sion research today. It can be described as the spatiotemporal evolutions of different body postures and its main goal is to automatically detect human activities ana-lyzing data from various types of devices (e.g. color cameras, range or wearable sensors). Although the recognition of human actions is very important for many real applications, it is still a challenging problem. In the past, research has mainly focused on recognizing activities from video sequences by means of color cameras [4,5]. These solutions are often constrained in terms of computational efficiency and robustness to illumination changes [6]. Other approaches are focused on activity recognition via wearable sensors [7]. Even if this type of technology is more intru-sive compared to standard video cameras, the technological advancements make it feasible in the near future the use of miniaturized sensors that a person can wear with low obstructiveness.

Besides the methods based on the analysis of 2D video sequence and wearable devices, there are several types of research that focus on 3D data, employing differ-ent technologies. In the past, maker-based motion capture or stereo camera systems have been widely used. Nowadays, technological progress has made available in-expensive depth cameras that can provide 3D data at suitable resolution rate. In the last years, these kinds of sensors have become very popular, leading new works on activity recognition from depth data [8,9]. Examples of these consumer devices are Microsoft Kinect or Asus Xtion that are able to provide both color and depth infor-mation. Moreover, specific software trackers, which can efficiently detect the human skeleton directly from the depth maps [10], have been implemented. The availabil-ity of these trackers excited the research communavailabil-ity and many algorithms have been

(22)

2 Chapter 1. Introduction

proposed recognizing activities using skeletal joint information [11–13]. Moreover, recent studies have demonstrated that the simultaneous use of inertial sensors and depth cameras could improve the recognition of daily activities. For instance, Cip-pitelli et al. [14] proposed a sensor data fusion approach for the assessment of mo-bility performance as a case study. Tao et al. [15] presented a comparative study on a dataset of common home actions acquired in a realistic setting using acceleration data from a mobile phone placed on the wrist and a depth camera mounted in a fixed position.

It is worth to note that most of the researches in human activity recognition are primarily focused on a single user performing daily activities [16]. On the contrary, few works have been conducted on the human activities related to social interactions [17], even if they represent a fundamental aspect of human life. The psychological aspect of social interactions is complex, and it depends on people’s feelings and thoughts, and is influenced by context, culture, and personal attitude. Certainly, a system which is able to detect and recognize social activities automatically can be applied in AAL contexts.

The employment of systems implementing human activity recognition has a great potential, including surveillance, automatic monitoring, and a large range of applications involving human-machine interactions. In particular, supporting el-derly people in daily life represents a concrete problem today and in the near future, because of several factors as low birth rate, aging and declining working-age popu-lation [18]. Among the solutions that are proposed to address this issue, Information and Communications Technology (ICT) and Robotics are the most studied [19,20]. Nowadays, the advances on robotics field are spreading the mobile robots out in our daily lives. Compared to last years, the recent technology progress has made robotic applications also economically feasible [21]. In particular, service robotics have re-ceived large considerations, especially from academia and industry in order to deal with the issues raising with the demographic changes [22], to encourage the social relationships and improve the feeling of safety delaying the physical and mental de-cline [23]. These solutions have the potentiality to enhance the independence and the quality of life of the users, and, at the same time, improve the health care sys-tem, reducing the overall costs [24]. A system with these requirements needs to be always available, accessible 24 hours and 7 days a week. For this reasons, one of the solutions that the research community is exploring is the integration of robotics and Cloud technology [25]. The Cloud Robotics paradigm has been recently defined as “Any robot or automation system that relies on either data or code from a network to support its operation, i.e., where not all sensing, computation, and memory is integrated into a sin-gle standalone system” [26]. Cloud technologies provide two main advantages in the field of robotics. Firstly, it can allow remote communication and the sharing of the knowledge, secondly, it can offload heavy computational tasks.

This dissertation concerns human activity recognition in AAL context, enclos-ing all the aforementioned concepts. In particular, it focuses on the use of skeleton

(23)

1.1. Thesis Contribution 3

joints extracted by consumer depth cameras. This type of data can be used to ex-tract features to develop technology and innovative services [27] in assisted living applications [28] simplify the problem of human detection and segmentation. In particular, skeleton features are not affected by environmental light variations while guaranteeing the user privacy much more than standard video cameras [29]. One drawback of depth camera is the small field of view and the difficulty of covering large space. This limitation will be overcome proposing a system, which uses a mo-bile robot that is able to autonomously move in apartment. However, this approach introduces other challenges for the activity recognition tasks, such as different field of views and data variability. These new issues will be addressed evaluating the performance of the system fusing skeleton data with wearable sensors.

1.1 Thesis Contribution

The aim of the thesis is to investigate and discuss the use of a human activity recog-nition system based on depth camera for assistive robotics. In particular, the dis-sertation focuses on skeleton data extracted from depth maps. The choice of this type of data is driven by the fact that skeleton joints are robust to environmental light variations. Moreover, they can guarantee much more user privacy than video cameras. As a consequence, these features have a great importance in developing application for AAL. Each chapter of this thesis deals with a different aspect of the main topic, ranging from classification performances of the developed method in daily living activity, application to social interactions, integration with robotics by means of Cloud technology, and a combined approach with a wearable sensor.

The dissertation investigates the benefits of using the skeleton trackers of depth cameras in assistive scenarios. The developed activity recognition system is evalu-ated both with public datasets and real-time experiments. The main drawback of this type of sensors is represented by the small field of view. To address this partic-ular limitation, a solution based on a mobile robot is presented. However, collecting data with a moving sensor raises additional problems, such as data variability and lower skeletons extractions. Hence, solutions based on the introduction of wear-able sensors is investigated. In details, the remainder of this chapter presents the related works on the activity recognition methods for AAL. In addition, it describes the adopted consumer depth cameras and the relative skeleton trackers. The section ends with details on the adopted dataset used for the evaluation of the classifiers.

The skeleton trackers have been implemented mainly for gaming. For this rea-son, the provided data are not very reliable in particular circumstances, such as when the body of the user is partially in view or occluded by an object. Chapter 2investigates these situations employing a feed-forward neural network to detect three target posture. Real-time tests were conducted to reproduce these challenging situations.

(24)

Chapter 3 describes the developed human activity recognition. It presents a novel method that represents an activity with few and basic informative postures. Conversely, from other works on this topic, the system models an activity with less number of basic postures. During the classification step, the system calculates on-the-fly the optimal number of informative postures by means of the X-means cluster-ing algorithm. This allows to feed the system with a subset of the input sequence. As a consequence, we evaluate the minimum number of frames to obtain an acceptable classification rate. The system is tested on two publicly available datasets.

In Chapter 4, the method developed in the previous chapter is adapted to the case of social activities, i.e. two interacting persons. The system is tested on a public dataset and evaluated following three configurations. In the first, two classifiers are tested independently. Then, a decision level and a feature level fusion schemes are adopted to improve the classification results.

Chapter5 proposes a Cloud robotics teleoperation system. It describes a basic tool for AAL, which can be extended with other functionalities using the "Software-as-a-Service" Cloud model. An example of this service is given in Chapter6, which describes the integration of the activity recognition system introduced in Chapter 3with the current system. Realistic tests were conducted to evaluate the classifica-tion performances, and the implicaclassifica-tion of a moving robot on the data acquisiclassifica-tion discussed.

A multimodal approach for activity recognition is proposed in Chapter7. It in-vestigates the use of skeleton features, wearable sensors, and data coming from the navigation system of the robotic platform. Realistic and challenge activity were col-lected to evaluate the developed approach.

Finally, Chapter8summarizes this thesis and overviews open problems and next challenges in human activity recognition.

1.2 Background of Human Activity Recognition in AAL

Activity recognition is an important and active research. Even though early studies using computer vision began in the 1980s, this field still presents challenges, and nowadays, researchers explore solutions using various technologies, ranging from video cameras to wearable devices. For instance, a comprehensive survey of ad-vances in activity recognition with sensors contained in modern smartphones (e.g. accelerometer, gyroscope, magnetometer, and light) is presented in [30, 31], while the use of wearable sensors is detailed in [32–34]. Usually, these devices can provide accurate information, but they are too intrusive for most people.

In the past, standard video cameras were also used [5, 6], and researchers de-veloped methods based on human silhouettes [35,36] or based on the detection of scale-invariant spatio-temporal features [37]. In general, these methods have draw-backs such as low robustness in the case of complex environments, light variations, and dynamic background.

(25)

1.2. Background of Human Activity Recognition in AAL 5

Other approaches focus on the use of 3D data. Marker-based motion capture (Mocap) systems, for example, use data generated from optical markers placed on the body [38–40]. Other methods rely on stereo vision, which reconstructs the three-dimensional shapes from two or more intensity images [41]. Examples of activity recognition that use stereo vision are based on body model [42], optical flow [43], and features from depth maps [44]. Reconstruction of 3D information from stereo images still remains a challenging task. Reflections, transparencies, depth disconti-nuities, lighting changes, and background clutters confound the matching process and all these issues limit its application [8].

Recently, researchers have evolved into the use of consumer depth cameras [8], also known as RGB-D cameras, which, with inexpensive costs and smaller sizes, are able to provide both color and depth information simultaneously. Some of these new inexpensive sensors are based on structured light, like the Asus Xtion or the Microsoft Kinect v1, or time-of-flight (ToF) technology, such as the Kinect v2. The first type of cameras calculates the depth by projecting a structured light onto the scene and comparing the reflected pattern with the stored pattern. Conversely, ToF cameras compute depth by measuring the time-of-flight of a light signal between the camera and the subject for each point of the image. Details on this two types of technologies are given in [45]. In a depth image, the value of each pixel corre-sponds to the distance between the real world point and the sensor, which provides the 3D structural information of the scene. Therefore, this technology offers several advantages compared to standard video cameras. In addition to color and texture information, depth images provide three-dimensional data useful for segmentation and detection. Moreover, a system that uses only depth information is able to work in poor light conditions, providing privacy at the same time [29].

In the last years, various methods have been proposed to address the issue of ac-tivity recognition from depth images. Aggarwal [8] divides them into five categories according to the adopted features: 3D silhouette [46,47], spatio-temporal features [48,49], local 3D occupancy features [50,51], 3D optical flow [52,53], and skeleton joints [10]. In particular, the methods belonging to this last category exploit specific software trackers [9,10] that are able to extract skeleton joints from depth maps. This feature excited the interest of the research community, leading to many algorithms that incorporate color and depth information with skeleton joints. For instance, Sung et al. [54] represent an activity as a set of subactivities, which is modeled using more than 700 features computing the Histogram of Oriented Gradient both on color im-ages and on depth maps. A hierarchical maximum entropy Markov model is used to associate subactivities with a high-level activity. Wang et al. [55] introduce the concept of actionlet, which is a particular sequence of features that are termed local occupancy features. An activity is described as a combination of actionlets. Zhu et al. [56] employ several spatio-temporal interest point features extracted from depth maps in combination with skeleton joints to classify actions with an SVM. These methods can reach good results, but typically, their performances depend on the

(26)

(a) (b) (c)

FIGURE1: Popular consumer depth cameras: (a) Asus Xtion Pro Live (retrieved from

https://www.asus.com/us/3D- Sensor/Xtion_PRO_LIVE/_{, copyright ASUS), (b) Microsoft Kinect v1, and (c)} Kinect v2 (retrieved fromhttps://developer.microsoft.com/en- us/windows/kinect/develop, copyright Microsoft).

complexity of the background scene and the noise present in the depth data. In [46], action recognition is performed extracting human postures. The postures are repre-sented as a bag of 3D points and the actions are modeled as a graph, whose nodes are the extracted postures. Space-Time Occupancy Patterns is proposed in [57] that divides space and time axes in multiple segments in order to embed a depth map se-quence in multiple 4D grids. In [58] the EigenJoints feature descriptor is proposed, combining static posture, motion property, and dynamics. They use motion energy to select the informative frame and use a Naive-Bayes-Nearest-Neighbor classifier for classification. Koppula et al. [59] use also object affordances and a Markov Ran-dom Field to represent their relationship with sub-activities.

Other approaches focus exclusively on the use of human skeleton models to ex-tract informative features to classify. Several joints representations have been pro-posed, Gan and Chen [60] propose the APJ3D representation computing the relative positions and local spherical angles from the skeleton joints. The HOJ3D, presented in [61], associates each joint to a particular area using a Gaussian weight function. The temporal evolution of the postures is modeled with a discrete Hidden Markov Model. Gaglio et al. [62] estimate the postures using a multiclass SVM and create an activity model using discrete HMM. Other works consider also trajectories of joints [13]. Some researchers focus on selection of the most informative joints to improve the classification results [12,63].

Considering the aforementioned literature overview, it is evident how some au-thors focus on the use of multimodal features [64] (i.e. combining color and depth information), while others exclusively rely on human skeleton data. In this dis-sertation, the activity recognition issue is addressed focusing on this last category. The implemented method, in fact, is based on informative postures known as “key poses”. This concept has been introduced in [65] and extensively used in the litera-ture [66,67]. Some authors identify key poses calculating the kinetic energy [68,69] to segment an activity in static and dynamic poses. In [70], an online classification method with a variable-length maximal entropy Markov model is performed based on the likelihood probabilities for recognizing continuous human actions. In [71] a clustering method is applied to extract relevant features and a multiclass SVM is used for classification.

(27)

1.2.1 Overview of Depth Camera Devices

RGB-D sensors, or depth cameras, combine RGB color information with per-pixel depth information. Sensors that provide such data have existed for years, including the Swiss Ranger SR4000 and PMD Tech products. However, these sensors are really expensive (around 10 000 EUR). By contrast, new consumer RGB-D sensors cost less than 200 EUR. Depth cameras are mainly based on two distinct technologies, structured light and time-of-flight (ToF).

The structured light sensing technology that is used in consumer depth cameras was developed by PrimeSense (United States Patent US7433024, owned by Apple Inc.). The technology is licensed for use in the commercially-available like Asus Xtion PRO Live (Figure1a) and Microsoft Kinect v1 devices (Figure1b). The sensor projects an infrared speckle pattern (Figure2) that is then captured by the infrared camera, and compared part-by-part to reference patterns previously stored in the device. The sensor then estimates the per-pixel depth based on which reference patterns the projected pattern matches best. The depth data provided by the infrared sensor is then correlated to a calibrated RGB camera. This yields an RGB image with a depth associated with each pixel. A popular unified representation of this data is a point cloud: a collection of points in three-dimensional space, where each point can have additional features like for instance the color [72].

In July 2014, Microsoft has released the Kinect v2 (Figure1c) based on ToF tech-nology that resolves distance based on the known speed of light, measuring the time-of-flight of a light signal between the camera and the subject for each point of the image. Moreover, this new sensor improves the accuracy. See Table1 for a technical comparison of the three devices.

The Kinect v1 and Asus Xtion have comparable characteristic since they are based on the same PrimeSense sensor. On the other hand, the Kinect v2 introduces improvements both in color and depth sources. However, it is worth to note that the depth images of this distinct technologies cannot be directly comparable, but, due to the use of time-of-flight as the core mechanism for depth retrieval, the new Kinect contains a real measured depth value with a much higher precision.

TABLE1: Technical specifications of consumer depth cameras.

Asus Xtion Kinect v1 Kinect v2

FoV (Horizontal) 57° 57° 70° FoV (Vertical) 43° 43° 60° Frame Rate 30 fps 30 fps 30 fps Resolution (depth) 320x240 320x240 512x424 Resolution (color) 640x480 640x480 1920x1080 Distance of Use 0.8m - 3.5m 0.8m - 3.5m 0.5m - 4.5m

(28)

FIGURE2: Infrared pattern of the PrimeSense sensor (retrieved fromhttps://structure.ioby Occipital).

(a) (b)

FIGURE3: Skeleton model of the (a) NiTE tracker (retrieved fromhttps://structure.io/openni) and (b) Kinect v1 SDK (retrieved fromhttps://msdn.microsoft.com/en-us/library).

1.2.2 Skeleton Tracker

Related to the aforementioned devices (Figure 1), specific software packages for tracking the user in real time are available. These trackers can provide a skeleton representation of the user framed by the sensor. Among the most used tracker, there is the NiTE [73], used in conjunction with the OpenNI framework [74]. It is generic and runs both for Kinect v1 and Asus Xtion on Windows and Linux operating sys-tems. Regarding the Kinect v2, only the tracker of the Microsoft NUI SDK [75] is available. Although these trackers are similar, the human model varies a lot. The re-sulting skeleton, in fact, is represented with a different number of joints. In details, the NiTE tracker provides 15 joints (Figure3a), while the Kinect v1 SDK represents the skeleton with 20 joints (Figure3b). Conversely, the NUI SDK of the new Kinect is able to output 25 joints for a human body (Figure4). Despite these differences, all the trackers provide for each joint the 3D spatial coordinates with respect to the

(29)

FIGURE4: Skeleton model of the Kinect v2 NUI SDK (retrieved from https://msdn.microsoft.com/en- us/library).

FIGURE5: Samples of the CAD-60 Dataset (retrieved from http://pr.cs.cornell.edu/humanactivities/data.php).

sensor origin frame. Nevertheless, all the trackers are affected by the same draw-backs. The skeleton tracking algorithms work well when the human subject is in an upright position facing the camera and there are no occlusions. If the human subject is partly in view, or the person is interacting with big objects, such as sofa or bed, the tracked skeleton is not very reliable.

1.2.3 Human Activity Datasets for AAL

The activity recognition system described in this dissertation is tested on four pub-licly available datasets. The remainder of this section provides further details.

CAD-60

The Cornell Activity Dataset (CAD-60) [54] focuses on realistic actions from daily life. It is collected using a depth camera and contains actions performed by four different human subjects: two males and two females. Three of the subjects use the right hand to perform actions, while one of them uses the left hand. It contains 12 types of actions: talking on the phone, writing on whiteboard, drinking water, rins-ing mouth with water, brushrins-ing teeth, wearrins-ing contact lenses, talkrins-ing on couch, relaxrins-ing

(30)

FIGURE6: Samples of the TST Dataset (retrieved from http://www.tlc.dii.univpm.it/blog/databases4kinect).

on couch, cooking (chopping), cooking (stirring), opening pill container, and working on computer (see Figure5).

The dataset contains RGB, depth, and skeleton data, with 15 joints available, hence the model is the one reported in Figure3a. Each subject performs the activity twice, so one sample contains two occurrences of the same activity. The dataset is used with two experimental procedures: the “have seen” and “new person” setting. In the first case, the classification is done with the data of all four persons, splitting the data in half. The latter uses a leave-one-actor-out cross-validation approach for testing. This means that the classifier is trained on three of the four people and tested on the fourth.

TST Dataset

The TST dataset (version 2) [76] is collected using the Microsoft Kinect v2 device, hence the human model is the one reported in Figure 4. It is composed of ADL and Fall actions simulated by 11 volunteers. The people involved in the test are aged between 22 and 39, with different heights (1.62-1.97 m) and sizes. The actions performed by a single person are separated into two main groups: ADL and Fall. Each activity is repeated three times by each subject involved.

The dataset provides 8 actions and 264 sequences for a total of 46k skeleton sam-ples with 25 joints each. Each person performs the following ADL movements: sit on chair, walk and grasp an object, walk back and forth, lie down, and the following Fall actions: frontal fall and lying, backward fall and lying, side fall and lying, fall backward and sit (see Figure6for an example).

ISR-UoL 3D Social Activity Dataset

The ISR-UoL 3D Social Activity Dataset [77] contains social interaction between two subjects. It consists of RGB and depth images, and tracked skeletons made of 15 joints acquired by an RGB-D sensor. It includes 8 social activities: handshake, greeting hug, help walk, help stand-up, fight, push, conversation, and call attention (see Figure7). Each activity is recorded in a period of approximately 40 to 60 seconds of repetitions within the same session, at a frame rate of 30 frames per second. The only exceptions are help walk (at a short distance) and help stand-up, which is recorded 4 times as the same session, regardless of the time spent on it.

(31)

FIGURE7: Samples of the ISR-UoL 3D Social Activity Dataset (retrieved from

https://lcas.lincoln.ac.uk/wp/research/datasets- software/isr- uol- 3d- social- activity- dataset).

FIGURE8: Samples of the SBU Kinect Interaction Dataset (retrieved from http://www3.cs.stonybrook.edu/~kyun/research/kinect_interaction/).

The activities are selected to address the assisted living scenario (e.g. happening in a health-care environment: help walk, help stand-up, and call attention), with poten-tially harmful situations, such as aggression (e.g. fight, push), and casual activities of social interactions (e.g. handshake, greeting hug, and conversation). The activities are performed by six persons, four males and two females, with an average age of 29.7 ± 4.2, from different nationalities. A total of 10 different combinations of individu-als (or sessions) is presented, with variation of the roles (active or passive person) between the subjects. Each subject has participated in at least three combinations, acting each role at least once to increase the generalization of the study regarding individual behavior.

SBU Kinect Interaction Dataset

The SBU Kinect Interaction Dataset consists of RGB, depth images, and tracked skeleton data acquired by an RGB-D sensor. It includes 8 activities: approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands (see Figure8). The dataset is composed of 21 sets, where each set contains

(32)

data of different persons (7 participants) performing the actions with a frame rate of 15 frames per second. The dataset is composed of manually segmented videos for each interaction (approximately 4 seconds), but each video roughly starts from a standing pose before acting and ends with a standing pose after acting. The evalu-ation is done by 5-fold cross validevalu-ation, i.e. 4 folds are used for training, and 1 for testing. The partitioning of the datasets into folds is performed so that each two-actor set is guaranteed to appear only in training or only in testing [78].

(33)

13

Chapter 2

Human Posture Classification and

Fall Detection using Skeleton Data

from Depth Camera

This chapter presents a first step in the evaluation of skeleton data extracted from depth camera to perform human posture classification. In particular, the main aim is to analyze the behavior of a neural network classifier even when the human skeleton is not reliable and confused. Real-time tests were carried out covering the whole operative range of the sensor (up to 3.5 meters). Finally, a fall detection application is presented1_.

2.1 Background

In the context of home care application, especially if we consider elderly people, one of the most desirable feature is the ability to detect a falling event. Older adult falls lead to reduced functions and premature loss of independence, and oftentimes a fall may indicate a more serious underlying health problem. For these reasons the importance of an automatic fall detection mechanism able to react quickly is crucial. In the past, video surveillance systems have been proposed to address this is-sue, but some of their limitations include the light conditions and the lack of privacy (see Sect. 1.2). The advent of depth sensors (see Sect. 1.2.1), has made it feasi-ble and economically sound to capture in real-time not only color images, but also depth maps with appropriate resolution and accuracy. A depth sensor can provide three-dimensional data structure as well as the 3D motion information of the sub-jects/objects in the scene, which has shown to be advantageous for human detection [54]. Several works about human postures detection with the RGB-D sensors exploit the use of skeleton tracking algorithms for rapidly transforming person depth in-formation to spatial joints that represent the human figure [79, 80]. Unfortunately, when these methods are used for real world applications the output is not always

1_{Adapted from Alessandro Manzi, Filippo Cavallo, and Paolo Dario. "A Neural Network} Ap-proach to Human Posture Classification and Fall Detection Using RGB-D Camera." Italian Forum of Ambient Assisted Living. Springer, Cham, 2016.

(34)

14 Chapter 2. Posture Classification and Fall Detection

(a) (b) (c)

FIGURE9: Examples of worst skeleton detection. (a) The person is far from the sensor and at least two joints are missed. (b) The user is lying on a sofa and the skeleton is fused with the sofa. (c) The person falls down in front of the sensor and the output seems unusable.

stable and reliable (see Figure9). The reasons that reduce their performance depend on several factors. Among these, we have the distance between the person and the sensor, the occlusions that occur when people interacts with environmental objects and also sideways poses that hides some parts of the user that are not visible to the sensor. According to Yu [81], a system for the falling recognition must have three main properties, it has to be reliable, unobtrusive and has to preserve the privacy. Starting from this requirements, a system using skeleton data extracted from a depth camera for the classification of three postures is presented. Using depth maps, the system can work in poor light conditions and guarantees the privacy of the per-son. To deal with the aforementioned problems that afflict skeleton trackers (noise and missed data), an artificial Neural Network (NN) is adopted. On the contrary of similar works in this field [82, 83], real-time tests, conceived to reproduce realistic and challenging situations for the tracker, covering the whole operative range of the sensor, have been carried out. During these experiments, the NN has been contin-uously fed with all the available joints generated by the skeleton tracker in order to analyse its robustness to unreliable and uncertain joints. In Section2.5, an applica-tion for smart homes and a scenario that includes a domestic robot that integrates the trained NN for falling detection is presented.

2.2 System Overview

The proposed human posture detection relies on a skeleton tracker algorithm that is able to extract the joints of a person from the depth map. Among the most used skeleton tracker we can find the Microsoft Kinect SDK [75], which works with its namesake device, and the NiTE SDK [73], used in conjunction with the OpenNI framework [74] that is generic and runs both for Kinect and Asus Xtion or Prime-Sense device (see Section 1.2.2). These software tools are similar and provide the 3D position of the skeleton joints combined with an additional confidence value for each of them. This datum can assume three values: “tracked” when the algorithm is confident, “inferred” when it applies some heuristics to adjust the position, and “un-tracked” when there is uncertainty. Both SDKs are affected by the same drawbacks when used in real world application. The undesirable conditions happen when the user is too close (< 1 m) or too farther from the sensor (> 3 m), or when the person assumes sideways poses and occlusions are present. In all these cases, the position of

(35)

2.2. System Overview 15

(a) (b) (c)

FIGURE10: Three examples of the samples extracted from the dataset: (a) standing, (b) sitting and (c) lying posture.

the calculated joints are not reliable and stable, so the associated confidence values are set as “untracked”. Figure9shows three examples of the aforementioned cases. A distant person produces fewer joints than usual, a human lying on a sofa confuses the tracker, while a person that falls down abruptly generates a messy output that seems unusable. Nevertheless, the “untracked” joints are anyway part of the whole skeleton and they have been used to analyse the robustness of the NN against noisy and uncertain values.

2.2.1 Samples

The aim of the system is to classify human postures to be used in a fall detection pro-gram. Although there are numerous online RGB-D datasets for activity recognition, very few of them contains people in lying position. Therefore, an ad-hoc dataset was built using samples taken from the MSRDailyActivity3D dataset [51], which contains skeleton made of 20 joints (see Figure3b). Three classes were created con-taining subjects in sitting, standing, and lying position, for a total of 120 samples (see Figure10for an example).

2.2.2 NN Architecture

The aim of the NN is to detect the three different postures using as input the skeleton joints extracted by the tracker algorithm. The structure of the NN has three layers, with 60 input neurons (3 coordinates for each 20 joints, as provided by the text files of the MSRDailyActivity3D dataset) and 3 output neurons, whose values range from 0 to 1 according to the posture. The neuron number of the hidden layer needs to be minimized in order to keep the number of free variables, namely the associated weights, as small as possible [84], decreasing also the need of a large training set. The cross-validation technique is adopted to find the lowest validation error as a function of the number of hidden neurons. As a result, an amount of 42 hidden units has been found as sufficient value. The activation function for the hidden and the output layer is the sigmoid, defined as:

y = 1

(36)

where x is the input to the activation function, y is the output and s is the steep-ness (= 0.5). The selected learning algorithm is the iRPROP- described in [85], which is a heuristic for supervised learning strategy and that represents a variety of the standard resilient back-propagation (RPROP) training algorithm [86]. It is one of the fastest weight update mechanisms and it is adaptive, therefore does not need an explicit value of learning rate. The NN is developed in C++ using the Fast Artificial Neural Network Library [87].

2.2.3 Training, Validation and Testing Sets

In order to estimate the generalization performance of the NN and to avoid the over-fitting of the parameters, the dataset is randomly divided into a training set, to adjust the weights of the NN, a validation set, to minimize the over-fitting, and a testing set to confirm the predictive power of the network. There is no common splitting rule for the dataset. In the present work, we follow the procedure described in [88] in which is stated that the fraction of patterns reserved for the validation set should be inversely proportional to the square root of the number of free adjustable param-eters. In our case, these sets are divided in 63, 21 and 36 samples respectively.

All the data are recorded at a distance of about 2 meters from the sensor. If the NN is trained with them, the network will produce better result only around 2 meters. To overcome this issue, a preprocessing step has been introduced. The NN is trained with normalized joints to ensure a depth invariant feature. Each joint vectors is normalized with the Euclidean norm:

ˆ j = j

kjk where j is the joint and ˆj is the normalized joint.

During the learning phase, the Mean Squared Error (MSE) is separately com-puted for the training and for the validation set. To guarantee optimal generaliza-tion performance, this process is stopped when the validageneraliza-tion error starts to increase since it means that the NN is over-fitting the data [89]. The final error of the process is 2 ∗ 10−4for the training and 0.028 for the validation. This process took 339 ms on a Intel Core 2 2GHz 32bit producing a 100% recognition rate on all the 36 samples of the testing set.

2.3 Real-Time Tests

As expected, testing the NN with the samples of the dataset gives a True Positive Rate (TPR) of 100%, since the data are well acquired and free of excessive noise. To understand the real performance of the network, real-time experiments have been set up. Since the original dataset is built only with the Kinect SDK tracker, in order to prove the generalization power of the NN, tests were carried out with both Kinect SDK and NiTE tracker of the OpenNI framework. Although these two software

(37)

2.3. Real-Time Tests 17

FIGURE11: Difference between skeleton representation. The Kinect SDK (left) uses 20 joints, while the NiTE SDK (right) only 15. The missing joints are replaced with the closest.

programs behave in a similar way, they have a significant difference. The first one represents the human skeleton with 20 joints, while the latter uses only 15 joints. Therefore, to work with our trained NN, the input is preprocessed to fill the missed joints with the closest available, as depicted in Figure11.

2.3.1 Experimental Setup

Two different kinds of experiments were conducted. The first one is about the de-tection of the three human postures in daily life environment with a sofa, while the second tests how the network behaves when a person falls down. The output of the NN is “standing”, “sitting”, and “lying” according to the value of the output neuron that is closest to 1. To analyze the results, the outputs are compared with the actual posture of the person, but the intermediate poses between a posture and another are discarded, i.e. when the user is sitting down or standing up. All the tests run at 25 fps. Since the input of the NN is the skeleton data, which are extracted purely from depth maps, the light conditions do not influence the performed experiments.

Sit and Lie on a Sofa

This experiment was conducted in a real living room with a sofa. The sensors (Kinect and Xtion) have been placed at 1 meters from the ground facing the sofa. A person, starting from the left, goes to the sofa, sits for a while, lies down on it, and then gets up again and goes away. The experiments have been carried out with 6 people (3 male and 3 female) at 3 different distances (3.5, 2.5 and 1.5 meters). This setup is intended to address the human trackers problems about the distance (Figure9a) and the melting issue between human and objects (Figure9b).

Falling Tests

Given the lack of available dataset containing falling people, we want to understand the ability of the NN to recognize a falling as a lying posture. Therefore, we set up a series of tests in which a man falls down to the side and to the front of the sen-sor (Figure9c). The device is placed at 1 meter from the ground and the NN is fed

(38)

TABLE2: Confusion matrix of the Sofa experiments (NiTE) standing 1

sitting 1

lying .03 .97

standing sitting lying

with all the available joints, even if their confidence value is labelled as “inferred” or “untracked”. In this way, the robustness of the NN against data uncertainty has been evaluated. Falling tests are divided in frontal and lateral to take into account also the self-occlusion of some parts of the body. They were conducted in a kitchen environment with a person that falls down abruptly while moving toward and side-ways and repeated 5 times each. The frontal fall distance from the sensor is about 2 meters, while the distance of the lateral fall is about 3 meters.

2.4 Results

The experiment with the sofa involved 6 persons at different distances and two types of sensors, Kinect and Xtion, and trackers, Kinect SDK and NiTE respectively. The total number of analysed frame are 5214.

The NN output with the Kinect SDK proves to be extremely robust and reli-able, achieving a 100% for all the three postures. The output with the NiTE skeleton tracker is less reliable and it is summarized with the confusion matrix of Table2. As expected, lying is the most challenging posture to classify, since the skeleton tracker provides a clearer output with the other two postures. It is worth to know that, in all the cases, the actual lying posture can be misclassified only as sitting, and that neither standing nor sitting is classified as lying.

Considering the falling tests, the Kinect tracker yields a TPR of 100% for all the postures, while the NiTE is less reliable but still satisfactory.

Table 3 contains the confusion matrices for these experiments and the results are consistent with the previous tests. To be thorough, since the person falls down quickly, there are not actual sitting posture. Table4contains the accuracy calculated for the sofa and the falling experiments. In general, the real standing posture is al-ways recognized even when the user is sideal-ways, given that the output of the tracker is cleaner and reliable in this case. We have to point that these human trackers have been developed for natural interaction and gaming and the players must stand in front of the sensor.

As expected, the lying posture is the most challenging to detect, but the rate of the false positive is always null and the actual lying posture is misclassified only with the sitting class and never with the standing. Since the NN is trained with a dataset built on Kinect tracker, using it produces excellent results. However, the use

(39)

2.5. Fall Detector Application 19

TABLE3: Confusion matrix for the falling tests (NiTE) (A) Frontal Fall

standing 1

sitting 1

lying .07 .93

(B) Lateral Fall standing .99 .01

sitting 1

lying .05 .95

of the NiTE tracker does not involve bad effects. The most interesting result is about the falling test. The NN produces a TPR of 95.1% for the lateral test and 93.3% for the frontal. In particular, if we consider only the lying posture, the frontal fall test has a False Discovery Rate (FDR) of 0%, and the probability of the False Negative Rate (FNR) is 6.7%, while for the lateral fall test the FDR is 0% and the FNR is 4.9%. The overall accuracy is 98.4% and 98.3% respectively. Another important aspect to underline is that, for most of the cases, misclassification happens during postures transitions. These results make it feasible the use of the adopted NN for a falling event application that is described in the next section.

2.5 Fall Detector Application

Considering the results of the above experiments, a fall detector application, able to generate warning or emergency signals according to the NN output, is developed. Figure12outlines its flowchart. The event generator reads the outputs of the NN storing them with an associated timestamp. When a lying posture is detected and its internal state is not equal to warning, it finds the last standing detected posture and computes the delta time. As already stated by Fu et. al [90], if this value is less than 2 seconds, the system considers it as a falling event and generates a warn-ing signal. The detector still continues to check the input and if the posture stands in lying position for more than 10 seconds, it sends also an emergency signal. Ad-ditionally, preliminary tests were conducted integrating a domestic robot into the system. The robot retrieves the Fall Detector events and reacts consequently. If it receives a warning signal, it moves to the area of interest and starts an interaction procedure with the person to ask him/her if a help is needed. If no answer is re-ceived it warns a specific person (i.e. a caregiver or relatives) through a video-call

TABLE4: Accuracy (NiTE)

Sofa Frontal Fall Lateral Fall

standing 100% 100% 99.4%

sitting 99.5% 97.6% 97.5%

lying 99.5% 97.6% 98.1%

(40)

FIGURE12: The Fall Detector module reads the output of the neural network continuously. According to the posture and to appropriate threshold, it is able to send warning or emergency signals.

mechanism. If the robot receives an emergency signal, it starts soon an automatic video-call and at the same time, it moves to the area of interest to provide as much information as possible to the caregiver.

2.6 Conclusion

In this chapter, a feed-forward artificial Neural Network to detect three target pos-tures (i.e. standing, sitting and lying) by means of an RGB-D sensor is presented. The NN is trained with samples extracted from a public dataset recorded with the Kinect SDK, while the real-time tests are carried out both with the Kinect and the Asus Xtion Pro Live device using the Kinect proprietary skeleton tracker and the NiTE tracker respectively. The input data are preprocessed and normalized in order to be depth invariant, improving the results of the NN all along the field of view of the sensors. The output of these skeleton tracker algorithms in real world applica-tion is not always stable and accurate, especially when the user is not standing and parts of the human body are occluded by the person itself or by external objects.

A series of real-time experiments, conceived to analyze the behavior of the trained NN in challenging situations, were conducted. During these tests, the NN processes continuously the output of the skeleton tracker also when the joints are labelled as unreliable. Our results demonstrate its high robustness against the uncertainty of the data, achieving an accuracy of more than 98% for the falling tests. The NN, trained with a Kinect dataset, demonstrates its power of generalization also when it is fed with data produced by a different tracker (NiTE software). Following the results

(41)

2.6. Conclusion 21

of the experiments, a fall detector application which integrates the NN is also pre-sented. The proposed system runs in real-time and, since it is based only on depth maps that do not use color information it guarantees the privacy of the person and it is able to work also in poor light conditions.

(42)

(43)

23

Chapter 3

Human Activity Recognition using

Skeleton Data from Depth Camera

This chapter focuses on human activity recognition using the skeleton data extracted from a depth camera. It consists of two main sections. The first describes a system that employs machine learning techniques to generate activity features and perform the classification. These features are computed from basic postures extracted by a clustering algorithm. The latter presents a slightly modified implementation, which models each activity sample with different number of clusters (dynamic clustering). The last implementation allows to apply the classification also on a subset of the input sequence to evaluate the minimum number of frames needed for a correct classification. The two implementations are evaluated on publicly available datasets and compared. The results show how the use of the dynamic clustering improves the overall performances.

3.1 A 3D Human Posture Approach for Activity Recognition

Based on Depth Camera

This section describes a human activity recognition system based on skeleton data extracted from a depth camera1. The activity is represented with few and basic pos-tures obtained with a clustering algorithm. The software architecture of the system is composed of four main steps (see Figure13). At the beginning, the relevant skele-ton features (i.e. spatial joints) are extracted from the depth device. Then, the basic and informative postures are selected using a clustering method. Afterwards, a new sequence of cluster centroids is built to have a temporal sequence of cluster tran-sitions and an activity window is applied to create the activity feature. Finally, a classifier is trained to perform the recognition of the activity. The remainder focuses on the development of these phases starting from the feature extraction process.

1_{Adapted from Alessandro Manzi, Filippo Cavallo, and Paolo Dario. "A 3D Human Posture} Approach for Activity Recognition Based on Depth Camera." European Conference on Computer Vision. Springer International Publishing, 2016.

(44)

24 Chapter 3. Human Activity Recognition

FIGURE13: The system is composed of 4 steps. The skeleton data are obtained from the sensor, and the posture features are processed. Informative postures are extracted from the sequence. Then, the activity features are generated from the basic postures. Finally, a classifier is applied.

3.1.1 Feature Extraction

This section details the feature extraction process that involves three steps. First, the relevant skeleton features (i.e. spatial joints) are extracted from the depth camera device using a skeleton tracker [10]. Then, the data are clustered to find the basic and informative postures to describe the activities. Afterward, a temporal sequence is built for each set of clusters and a sliding window is applied to create the activity feature.

Skeleton Features

The coordinates of the human skeleton are extracted from the depth maps captured by the RGB-D sensor [10]. A human pose is represented by a number of joints that varies depending on the skeleton model of the software tracker (see Section1.2.2). The Figure14a shows a skeleton made of 15 joints. Each joint is described with three-dimensional Euclidean coordinates with respect to the sensor.

These raw data cannot be used directly, because they are dependent on the posi-tion of the human and the sensor and also on the subject dimensions, such as height and limb length. To manage these issues, the original reference frame is moved from the camera to the torso joint, and the joints are scaled with respect to the distance between the neck and the torso joint. This normalization step, adopted also in other works [62,68,71], yields data more independent with respect to the person’s specific size and to the position between the sensor and the user.

Formally, if we consider a skeleton with N joints, the skeleton feature vector f is defined as

f = [ j1, j2, . . . , jN −1] (3.1) where each jiis the vector containing the 3-D normalized coordinates of the ith joint Jidetected by the sensor. Therefore, jiis defined as

Human Activity Recognition based on Depth Camera for Assistive Robotics

Phd Course

BioRobotics

Accademic Year

2016/2017

Human Activity Recognition based on

Depth Camera for Assistive Robotics

Author

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

To my sweetheart and to our little treasure preciously

enshrined. . .

Chapter 1

Introduction

1.1

Thesis Contribution

1.2

Background of Human Activity Recognition in AAL

Chapter 2

Human Posture Classification and

Fall Detection using Skeleton Data

from Depth Camera

2.1

Background

2.2

System Overview

2.3

Real-Time Tests

2.4

Results

2.5

Fall Detector Application

2.6

Conclusion

Chapter 3

Human Activity Recognition using

Skeleton Data from Depth Camera

3.1

A 3D Human Posture Approach for Activity Recognition

Based on Depth Camera