Deep Psychology Recognition Based on Automatic Analysis of Non-Verbal Behaviors

(1)

Dipartimento di / Department of Informatics, Systems and Communication

Dottorato di Ricerca in / PhD program: Computer Science Ciclo / Cycle: XXXIII

Deep Psychology Recognition Based on Automatic Analysis of Non-Verbal Behaviors

Cognome / Surname: Khalifa Nome / Name: Intissar Matricola / Registration number: 836548

Supervisor: Prof. Mourad Zaied Supervisor: Prof. Raimondo Schettini Co-Supervisor: Dr. Ridha Ejbali

Coordinatore / Coordinator: Prof. Leonardo Mariani

ANNO ACCADEMICO / ACADEMIC YEAR 2019-2020

(2)

I dedicate this milestone in my life to all those precious people.

To my dear parents Mohamed and Moufida who have been the symbol of love, tenderness, sacrifice, happiness, peace, and source of encouragement and inspiration to me throughout my life. Thank you for your endless patience, moral support and your permanent valuable pieces of advice that have always guided my steps towards success. I will always do my best to let you feel proud and never disappoint you.

May God preserve you and grant you health and happiness.

To my sisters Achwak and Amira, this work is a sign of my attachment and love, we shared the most pleasant and difficult moments. You have always been by my side to help me reach my objectives.

To my lovely aunt Noura who has been a source of motivation, love, and strength during moments of despair and discouragement. May God protect you from harm.

To my big family, I dedicate you this work as an expression of gratitude for your encouragement and support during the difficult moments of my life.

To all those dear people I can never forget, I dedicate this work.

INTISSAR...

(3)

First of all, I want to thank my supervisor Prof. Mourad Zaied. I am very grateful to him for his tolerance in letting me to pursue my own academic path since without his counsel and early interest I would not have done any research at all. He provides me a lot of academic opportunities that would not have otherwise been possible.

His insight and feedback have been invaluable.

I want also to thank my supervisor Prof. Raimondo Schettini. I am honored to work with him. Despite all his occupations, he always found time to listen, discuss and provide guidance. During my period abroad, he has always been by my side to help and support me in very difficult moments.

I would like to express my gratitude to my co-supervisor Prof. Ridha Ejbali for his considerable time and availability of mind. He has been helpful in providing advice, precious suggestions, and insightful guidance.

I would like to express my deep recognition to committee members for the accepta- tion to evaluate my thesis and for the considerable time, energy, and availability of mind.

I also want to extend my thanks to all my colleagues and the doctors at the RTIM research unit, the beautiful times we spent together especially in meetings and work- shops will remain firmly in my mind.

I want to thank my friends in the Imaging and Vision Laboratory, I had the chance to discuss and reason about anything. I shared many pleasant moments, I never felt lonely despite my distance from my country and my family.

Special thanks to any person in the Department of Informatics, Systems and Com- munication who helped me in very critical moments.

(4)

Dale Carnegie

"The only way to change someone’s mind is to connect with them from the heart."

Rasheed Ogunlaru

(5)

Un aspetto estremamente cruciale nel dominio dell’interazione uomo-uomo è la comunicazione delle emozioni. Essere in grado di dedurre gli stati emotivi attraverso comportamenti non-verbali consente agli esseri umani di comprendere e ragionare su obiettivi ed intenti altrui. L’Affective Computing è una branca dell’informatica che mira a trarre vantaggio dal potere delle emozioni per facilitare un’interazione uomo-macchina più efficiente. L’obiettivo è dare alle macchine la capacità di es- primere, riconoscere e regolare le emozioni. In questa tesi, esamineremo in dettaglio il ruolo delle espressioni visive ed uditive nel comunicare emozioni, e svilupperemo modelli computazionali per il riconoscimento automatico delle emozioni: un’area di ricerca molto attiva nell’ultimo decennio. In generale, la comunicazione delle emozioni attraverso i segnali del corpo è compresa in misura minore rispetto a altre modalità. La psicologia sociale che ha ispirato molti approcci computazionali si è tradizionalmente concentrata sui segnali facciali. Tuttavia, la gestualità del corpo è una fonte significativa di informazioni, soprattutto quando altri canali sono nascosti o in presenza di sottili sfumature di espressioni. In questo contesto, proporremo diversi approcci per il riconoscimento di gesti con applicazione alle emozioni, utilizzando due modelli. Per il modello basato su parti, svilupperemo un approccio ibrido che incorpora due tecniche di stima del movimento e di normalizzazione temporale per la modellazione del movimento della mano. Passeremo poi a presentare il nostro approccio spazio-temporale profondo (deep) per modellare il movimento del corpo, ed infine ottenere lo stato emotivo della persona. In questa parte, dimostrereamo che la nostra tecnica basata sul deep learning supera le tradizionali tecniche di machine learning. Per il modello basato sulla cinematica, combineremo la stima della posa del soggetto (con applicazione al rilevamento dello scheletro) e la classificazione delle emozioni per proporre una nuova architettura profonda a più stadi in grado di affrontare entrambi i compiti sfruttando i punti di forza dei modelli pre-addestrati.

Dimostreremo che le tecniche di transfer learning superano le tradizionali tecniche di apprendimento automatico. Come ulteriore modalità, il parlato è la forma più comune e veloce per comunicare tra esseri umani. Questa realtà ci ha spinti a riconoscere le condizioni emotive del soggetto parlante in maniera automatica tramite la sua voce. Proporremo una rappresentazione profonda di tipo temporale e basata

(6)

delle emozioni del parlato. I risultati ottenuti per entrambe le modalità utilizzando i nostri metodi sono molto promettenti e competitivi rispetto ai metodi esistenti nello stato dell’arte. Riteniamo che il nostro lavoro sia pertinente sia per il social computing che per la psicologia organizzativa. Prendendo come esempio i colloqui di lavoro, un ambito ben studiato dagli psicologi sociali, il nostro studio può fornire informazioni utili su come sfruttare i segnali non verbali per supportare le aziende nel processo di assunzione. Questa tesi descrive la fattibilità di usare indizi estratti automaticamente per analizzare gli stati psicologici, come interessante alternativa alle annotazioni manuali dei segnali comportamentali.

Parole chiave: Comportamenti non-verbali, emozione, gesti del corpo, linguaggio parlato, modello basato su parti, modello basato su cinematica.

(7)

One highly crucial aspect in the domain of human-human interaction is the communication of emotions. Being able to deduce emotional states through non-verbal behaviors allows humans to understand and reason about each others’ underlying goals and intents. Affective Computing is the branch of computer science that aims to profit from the power of emotions to facilitate a more efficient human-machine interaction. The goal is to give the machines the ability to express, recognize, and regulate emotions. In this dissertation, we look in detail at the role of visual and auditory expressions for communicating emotions and we develop computational models for automatic emotion recognition which is an active research area over the last decade. In general, communication of emotions through body cues is less understood than other modalities. Social psychology that has inspired many computational approaches has traditionally focused on facial cues. However, body gestures are a significant source of information especially when other channels are hidden or there is a subtle nuance of expressions. In this context, we propose our approaches for emotional body gesture recognition using two different models. For the part-based model, we develop a hybrid approach that incorporates two techniques of motion estimation and temporal normalization for hand motion modeling, then we move to present our deep-spatio temporal approach for body motion modeling to have finally the person’s emotional state. In this part, we demonstrate that our deep learning technique outperforms traditional machine learning techniques. For the kinematic- based model, we combine human pose estimation for skeleton detection and emotion classification to propose a new deep multi-stage architecture able to deal with both tasks by exploiting the strong points of models pre-trained. We demonstrate that transfer learning techniques outperform traditional machine learning techniques. As another modality, speech is the fastest normal way to communicate among human.

This reality motivates us to identify the emotional conditions of the uttering person by utilizing his/her voice automatically. We propose a deep temporal-cepstrum representation based on the concatenation of spectral features, temporal derivatives features, and a deep learning classifier for speech emotion recognition. The results obtained for both modalities using our suggested methods are very promising and competitive over existing methods in the state of the art. We believe that our work

(8)

may provide insights for how non-verbal cues could be used by the companies for the hiring decision. In fact, our dissertation shows the feasibility of using automatically extracted cues to analyze the psychological states as an attractive alternative to manual annotations of behavioral cues.

Keywords: Non-verbal behaviors, emotion, body gestures, speech, part-based model, kinematic-based model.

(9)

Dedications i

Acknowledgements ii

Sommario iv

Abstract vii

List of Figures xi

List of Tables xiii

1 Introduction 1

1.1 Motivations and problem statement . . . . 1

1.2 Non-verbal behaviors . . . . 2

1.3 Social psychology and job interview . . . . 3

1.4 Thesis objective . . . . 5

1.5 Open Issues . . . . 6

1.6 Thesis plan . . . . 7

1.7 Contributions and publications . . . . 7

2 Emotion recognition system: Background and related work 9 2.1 Introduction . . . . 9

2.2 Notions about emotions . . . . 9

2.2.1 Emotion types. . . . 9

2.2.2 Emotion models . . . 10

2.2.2.1 Categorical model . . . 10

2.2.2.2 Dimensional model . . . 11

2.2.2.3 Componential model . . . 12

2.3 Emotional body gesture recognition: State of the art . . . 12

2.3.1 Body gesture cues . . . 12

2.3.2 Human body models . . . 14

2.3.2.1 Part-based model . . . 14

(10)

2.3.3.1 Hand gesture modeling . . . 17

2.3.3.2 Body gesture modeling . . . 21

2.4 Speech emotion recognition (SER): State of the art . . . 24

2.4.1 Vocalic cues . . . 24

2.4.2 Acoustic features for SER . . . 25

2.4.3 Existing techniques for SER . . . 25

2.5 Discussion . . . 28

2.6 Conclusion . . . 30

3 Part-based Model for Emotional Body Gesture Recognition 31 3.1 Introduction . . . 31

3.2 Hand motion modeling for psychology analysis . . . 32

3.2.1 Hand motion modeling using Kalman Filter method . . . 32

3.2.2 Overview of our proposed OF-HMI method . . . 33

3.2.3 Pre-treatment for hand detection . . . 34

3.2.4 Hand tracking using Optical Flow . . . 36

3.2.5 Feature extraction . . . 37

3.2.6 Hand gesture representation using OF-HMI . . . 38

3.3 Body gesture modeling for psychology analysis using deep spatio- temporal approach . . . 39

3.3.1 Overview of our method . . . 39

3.3.2 Apex frame extraction . . . 40

3.3.3 Temporal normalization for body gesture representation . . . 41

3.3.4 Emotional state classification based on deep learning architecture . . . 42

3.3.4.1 Principle of Sparse Auto-Encoder (SAE) . . . 43

3.3.4.2 Emotion classification using Stacked Auto-Encoder . 43 3.4 Experiments and results . . . 45

3.4.1 Dataset description . . . 45

3.4.2 Performance Evaluation . . . 47

3.4.2.1 Hand motion modeling result . . . 48

3.4.2.2 Body gesture modeling result . . . 52

4 Kinematic-based model for emotional body gesture recognition 56 4.1 Introduction . . . 56

4.2 Overview of our deep multi-stage method . . . 58

(11)

4.3.1.1 General architecture of CNN . . . 59

4.3.1.2 Skeleton features extraction based on MobileNet model 60 4.3.2 Detection-based approach . . . 62

4.4 Human pose representation . . . 63

4.5 Emotion classification based on VGG architecture . . . 65

4.6 Experimental results . . . 66

4.6.1 Dataset description . . . 66

4.6.2 Pose estimation result . . . 67

4.6.3 Emotion classification results . . . 69

5 Speech emotion recognition 74 5.1 Introduction . . . 74

5.2 Proposed deep temporal-cepstrum representation approach . . . 74

5.2.1 Cepstral features extraction . . . 75

5.2.1.1 Static, dynamic, and acceleration features generation 76 5.2.1.2 Global features . . . 77

5.2.2 Speech emotion recognition based on CNN classifier . . . 78

5.3 Experimental results . . . 78

5.3.1 Datasets description . . . 78

5.3.1.1 Ryerson Audio-Visual Emotional Speech and Song (RAVDESS) dataset . . . 79

5.3.1.2 The Berlin Database of Emotional Speech (EMODB) 79 5.3.2 Performance evaluation . . . 79

5.3.2.1 Cepstral features extraction result . . . 80

5.3.2.2 Emotion recognition result . . . 81

6 Conclusions 88

Bibliography 89

(12)

1.1 Elements of personal communication . . . . 3

1.2 Non-verbal behavior categories . . . . 4

1.3 From behavior to emotion classification . . . . 5

2.1 Dimensional representation of emotions: Russel’s model [36] . . . 11

2.2 Componential representation of emotions: Plutchik’s model [42] . . . 12

2.3 Gesture categories [95] . . . 13

2.4 Object representations [12]: . . . 15

2.5 Part-based model vs kinematic-based model [42] . . . 15

2.6 Glove systems examples [83]: . . . 18

2.7 Real-time hand detection and finger counting . . . 20

2.8 Acoustic features categories [91] . . . 25

3.1 Emotional body gesture recognition system using the first scenario: . 31 3.2 Hand motion tracking using Kalman Filter . . . 34

3.3 Overview of our OF-HMI architecture . . . 35

3.4 Face detection process . . . 35

3.5 Pre-treatment for multiple hand detection . . . 37

3.6 Hand tracking using Optical Flow . . . 38

3.7 The result of application of OF-HMI in video sequence . . . 39

3.8 Overview of our deep spatio-temporal approach . . . 40

3.9 Pipeline of key frame extraction method . . . 41

3.10 Example of application of EBMI in video sequence of anxious person 42 3.11 Basic of Auto-Encoder . . . 44

3.12 Architecture of SSAE proposed for emotion classification . . . 44

3.13 Examples from FABO dataset . . . 46

3.14 Sample images from "Fear", "Happiness", "Uncertainty", and "Bore- dom" expressions videos in FABO database recorded by body. . . 46

3.15 Example of hand trajectory using KF . . . 49

3.16 Example of miss tracking using KF . . . 50

3.17 Hand tracking using HSOF. . . 50

3.18 Hand motion modeling examples (gesturing and waving hand): . . . . 51

(13)

ture recognition . . . 54

4.1 Emotional body gesture recognition system using the second scenario: Kinematic-based model . . . 57

4.2 Proposed multi-stage approach . . . 57

4.3 General structure of CNN . . . 59

4.4 Feature extraction and heat map generation . . . 61

4.5 Keypoint detection and tracking . . . 62

4.6 2D pose reconstruction . . . 63

4.7 Hierarchical representation of human pose . . . 64

4.8 Fine-tuned VGG-16 for emotion classification . . . 65

4.9 Pre-treatment for emotion classification . . . 66

4.10 COCO dataset and skeleton with 18 keypoints . . . 67

4.11 Skeleton tracking for a person with fear expression: . . . 67

4.12 Illustration of the deep pose decoding method: . . . 68

4.13 Problem of overfitting in testing accuracy graph and loss graph. . . . 69

4.14 Model loss and accuracy graph for training and validation of 10 emotions 70 4.15 Average classification for 5 emotions. . . 71

4.16 Classification rate (%) comparison of 10 emotions between our method and Chen et al. methods [135] using FABO dataset . . . 71

5.1 General overview of speech emotion recognition system . . . 74

5.2 Pipeline of the proposed method: . . . 75

5.3 From input signal to the MFCC generation: . . . 76

5.4 Concatenation of static, dynamic, and acceleration features . . . 77

5.5 Examples from RAVDESS dataset. . . 79

5.6 Examples from EMODB dataset: . . . 80

5.7 Cepstral features extraction of the speech of fearful person: . . . 81

5.8 Learning curves obtained for classification of 8 emotions using RAVDESS dataset . . . 82

5.9 Confusion Matrix of 8 emotions using RAVDESS dataset . . . 83

5.10 Learning curves obtained for classification of 7 emotions using EMODB dataset . . . 83

5.11 Confusion Matrix of 7 emotions using EMODB dataset . . . 84

(14)

2.1 Classification based on categorical model . . . 11

2.2 Body gesture cues and their interpretations . . . 14

2.3 Keypoints output format for pose estimation . . . 16

2.4 Vocalic cues and their interpretations [53] . . . 24

2.5 A comparative chart: cepstral and non-cepstral features [143] . . . . 26

3.1 Performance analysis of tracking algorithms . . . 49

3.2 Tracking rate for hand gesture . . . 49

3.3 Correct classification rate for hand gesture . . . 51

3.4 Comparison between deep Spatio-Temporal method and state of the art methods using FABO dataset . . . 53

4.1 Comparison between our method and state of the art methods using FABO dataset . . . 72

5.1 CNN architecture. L can be 40 or 120 on the basis of the feature vector used . . . 78

5.2 Comparison between RF and CNN using Mean MFCC (40) on the RAVDESS dataset . . . 82

5.3 Comparison between RF and CNN using our temporal derivative cepstral features on the RAVDESS dataset . . . 84

5.4 Comparison between RF and CNN using Mean MFCC (40) on the EMODB dataset . . . 85

5.5 Comparison between RF and CNN using our temporal derivative cepstral features on the EMODB dataset . . . 85

5.6 Comparison between our method and state-of-the-art methods using RAVDESS dataset . . . 85

5.7 Comparison between our method and state-of-the-art methods using EMODB dataset . . . 86

(15)

We enumerate here abbreviations and acronyms recommended and mentioned in this report.

AC Affective Computing AI Artificial Intelligence

AE Auto-Encoder

BN Bayes Network BOW Bag of Words

BEL Brain Emotional Learning model BP Back-Propagation

COCO Common Object in Context CNN Convolutional Neural Network

CCCNN Cross-Channel Convolutional Neural Network DT Decision Tree

DTW Dynamic Time Warping DBN Deep Belief Network

DWT Discrete Wavelet Transform DCN Depthwise Convolutional Network DSC Depthwise Separable Convolution DCT Discrete Cosine Transform

ET Ensemble Tree

EBMI Energy Binary Motion Image FFT Fast Fourier Transform GMM Gaussian Mixture Model GNB Gaussian Naive Bayes GPU Graphics Processing Unit HMM Hidden Markov Model HMI History Motion Image

HOG Histogram of Oriented Gradients HNB Hidden NaÃ¯ve Bayes

HSOF Horn Schunck Optical Flow HPI History Pose Image

(16)

LSTM Long Short-Term Memory LPC Linear Predictive Coding LMT Logistic Model Tree

MCCNN Multi-Channel Convolutional Neural Network MFCC Mel-Frequency Cepstral Coefficients

MDS Multi-Dimensional Scaling NN Neural Network

OF Optical Flow

RF Random Forest

ReLU Rectified Linear Unit

SER Speech Emotion Recognition SVM Support Vector Machine SVC Support Vector Classifier STE Short Term Energy

SSAE Stacked Sparse Auto-Encoder SAE Sparse Auto-Encoder

SC Softmax Classifier

SGD Stochastic Gradient Descent TN Temporal Normalization TEO Teager Energy Operator VGG Visual Geometry Group ZCR Zero Crossing Rate

(17)

Introduction

1.1 Motivations and problem statement

Emotions color our lives, allow us to express different facets of our personality.

In the previous century, there was a lot of belief in the omnipotence of reason, forgetting emotion. It was considered an obstacle to the work of reason. Thanks to neuro-science and brain imaging, we know now that the human being is not a rational decision-maker and that emotion is a fundamental partner in human cognition, creativity, and decision-making [37]. The emotional state of humans can be obtained from a wide range of behavioral cues and signals that are available through visual, auditory, and physiological expressions of emotion:

• Emotional state through visual expression is evaluated according to the modulation of facial expressions, gestures, postures, and more generally body language. Data is captured by a camera, allowing non-intrusive configurations.

• Emotional state through auditory expression can be estimated as a modulation of the speech signal [97]. In this case, data is picked up by a microphone, which allows non-intrusive system configurations. In addition, the processing is difficult to handle when more than one voice is present in the audio stream.

• Emotional state through physiological representation is estimated by the modulation of the activity of autonomic nervous system (ANS) [31]. The main limitation is related to the intrusion of detection devices. Also, it is difficult for users to freely manipulate the physiological sensors in relation to facial expressions, body gestures or voices.

Since emotional content reflects human behavior, automatic emotion recognition is a topic of growing interest. Emotions play an implicit role in the communication process compared to the explicit message given by the lexical level. The

(18)

behavior to be recognized is complex and subtle, presenting diverse manifesta- tions and depending on many factors (social and cultural context, personality of the speaker, etc.).

Emotion Artificial Intelligence Programming, also called Affective Computing (AC), is a key research topic in Artificial Intelligence (AI) dealing with emotions and machines. AC is related to the study and development of systems and devices that can recognize, interpret, process, and simulate human emotions. It consists of the estimation and measurement of human emotions from a variety of data gathered from different sources like speech rate, voice tone, facial expressions, and body gestures.

This thesis explores the use of visual expressions (body gestures) and auditory expressions (speech rate and voice tone) to develop an emotion recognition system.

These expressions could be real indicators of the emotional state and significant sources of information when other channels like facial expressions are hidden by keeping a fake smile or neutral face, or there exists a subtle nuance of expressions.

There exists a limitless range of applications like:

• E-learning: Tutor expands explanation when the user is in state of confusion (surprised, bored, puzzled) and adds information when he/she is in state of curiosity (happiness, comfort, relaxation).

• E-therapy: Psychological health services (evaluate the psychological state through the patient’s gesture).

• Enhanced websites customization: Evaluate a product based on the web surfers’ emotions.

• Entertainment: Affect feedback on players’ satisfaction to change some char- acters in the video game.

• Software engineering: Emotions have significant impact on the developers’

productivity and code quality. Stress and boredom associated with time pressure will be induced.

• Job interview: The candidate’s behaviors during interview session affect the hiring decision. There exists a big relationship between behaviors, interview outcomes, and job performance.

1.2 Non-verbal behaviors

Non-verbal behavior expresses and reveals human emotions and represents, according to social psychologists, 93 % of our interactions with others [8] as shown

(19)

Figure 1.1: Elements of personal communication

in Figure 1.1. Non-verbal behavior can be divided into four categories [3]: proxemics, haptics, kinesics, and vocalics as shown in Figure 1.2.

• Proxemics: relates to how a person uses the space around his/her body in human interactions.

• Vocalics or paralanguages: refers to how speakers express their emotions through voice [154].

• Haptics: refers to the way how a human communicates and interacts through using touching sense. Some self-touch behaviors are related to negative feelings such as stress, psychological discomfort, and anxiety.

• Kinesics: relates to the movement of entire body or body parts such as gestures [42, 65] and facial expressions [169]. When we have spontaneous, unconscious, and non-communicative body gestures, we talk about adaptors as specified in the works of Kipp [95], Ekman and Friesen [39] and McNeill [32].

1.3 Social psychology and job interview

Companies always think about having the best employees. The selection of the most suitable person for the job depends on the answers of the candidate to the interviewer’s questions and his/her behavior during the interview session. The study of behaviors is based on the psychological interpretation of the movements and gestures accompanying the answers and discussions. In this context, social psychologists have long studied job interviews in order to understand the relationships between behavior, interview outcomes, and job performance. Several companies give importance to psycho-test based on the observation of the can-

(20)

Figure 1.2: Non-verbal behavior categories

didate’s behavior more than the answers given especially in sensitive positions like trade, marketing, investigation, etc. Psychology studies used to rely on the utilization of manual annotations by observers but in the last decade, the advent of inexpensive audio and video sensors in conjunction with improved perceptual processing techniques has empowered the automatic and accurate extraction of behavioral cues, encouraging the conduct of social psychology studies. The use of automatically extracted non-verbal cues in combination with machine learning techniques has led to the development of computational methods for the automatic prediction of emotional state. Many works have investigated the reliability [15] (the agreement level between judges for rating candidates) and validity [127]

(the amount of relationship between interview ratings and performance) of job interviews, as well as the correlation between high-level social variables like dom- inance [35], interest [23], emergent leadership [33], personality traits [43], and job performance. Particular attention has been put on the impact of the candidate’s non-verbal behavior on the interview outcome. Imada and Hakel [17] affirmed that the candidates who use more non-verbal behaviors (facial expressions, body orientation toward interviewer, eye contact) were perceived as being more com- petent, motivated, and hirable than candidates who did not. Forbes and Jackson [128] showed that candidates who were recruited nodded more, made more eye contact and hand gesture during interview sessions. The same for Anderson and Shackleton [106] who reported that the most selected applicants produced more facial expressions and gestures during job interview than non-accepted candidates. Parsons and Liden [30] showed that speech characteristics (voice tone, intensity pause, speaking rate, etc.) explained a remarkable amount of variance in the hiring decision. One explanation for the positive connection between candidate non-verbal behavior and recruiting decision can be based on the hypothesis, which establishes that the candidate reveals through his/her immediacy behav-

(21)

Figure 1.3: From behavior to emotion classification

ior (eye contact, smiling, hand gestures, etc.) a greater perceptual availability, which leads to a constructive outcome on the interviewer and therefore to a positive evaluation.

In this context, these are some applications that could be developed:

• Virtual agents for social coaching

• Job interview simulator

• Hirability impressions in video resumes

• Online selecting service based on pre-recorded questions

1.4 Thesis objective

Our work represents a combination between two interesting research topics in the last decades which are social psychology and affective computing. This combination is realized in order to replace the manual coding performed by an observer with a psychology recognition system based on automatic analysis of non-verbal behaviors by exploiting the strong points of deep learning and transfer learning techniques as shown in Figure1.3. According to reviews done in the last years, 95

% of researchers focused on facial expressions for emotions analysis [42, 22] and neglected body language or paralanguage (speech) that could be real indicators of the emotional state and significant sources of information when other channels like facial expressions are hidden or when there is a subtle nuance of expressions.

(22)

1.5 Open Issues

Emotional body gesture recognition is still an open research problem requiring investigation for many reasons. There exists a subtle nuance of expressions that leads to the confusion in the interpretation of gesture, also the fuzzy nature of emotional states and their instability along time entail difficulties so that, many emotions are still hard to detect. Moreover, searching for a robust method for detection and tracking is a challenging task because spontaneous body gesture characterized by its freedom is very different from specific and predefined gesture in front of the camera for the task of remote control. Then, the quality of training data influences the final results. The addition of several conditions degrades the performance and robustness of an approach and makes the use of such applications more limited. For emotional body gesture recognition from video sequence, we should focus on all these steps: human detection, pose estimation (detection and tracking (body parts or skeleton)), features extraction, and classification.

To attain our objective, these are some questions that should be asked:

• Structured environment: Are there restrictions on background, lighting, speed of movement of entire body or body parts?

• User requirements: Must the user wear anything special (markers, gloves, glasses, etc)?

• Features extracted: Which low-level features (hand crafted features) are computed (edge, region, silhouette, etc.) or is an automatic generation of features possible?

• Representation of time: How is the temporal aspect of gesture repre- sented and used in recognition?

• Body gesture models: which body model could be useful in our task?

Speech emotion recognition (SER) is also a complex task for several reasons. The first issue of all methods proposed in the literature is the selection of the best features that could be powerful to distinguish between various emotions [20].

Also, the existence of different languages, speaking styles, and accents represents a difficulty because they directly modify the extracted features like pitch and energy. Moreover, we can have more than one emotion in the same speech signal.

Each part represents a specific emotion so that defining the boundaries between these parts is a very challenging task. For SER, different steps should be un- dertaken which are pre-processing of the speech signal, features extraction, and classification.

(23)

1.6 Thesis plan

This document presents the released work to develop a psychology recognition system based on automatic analysis of non-verbal behaviors. In this chapter a general overview of the thesis is presented. Motivations and open issues of this work are discussed. Thesis structure as well as the contributions and the list of related publications are presented. The second chapter presents a literature review of previous studies dealing with emotion recognition system. The third chapter is reserved to explain our approach for emotional body gesture recognition using the part-based model. In the fourth chapter, we explain our proposed deep multi-stage approach using the kinematic-based model. The fifth chapter is devoted to present our proposed approach for speech emotion recognition. Finally, we conclude with a general conclusion and possible perspectives.

1.7 Contributions and publications

In this thesis, we propose novel techniques for psychology recognition based on automatic analysis of non-verbal behaviors and precisely the kinesics (body gestures) and vocalics (speech) by exploiting the strong points of deep learning and transfer learning techniques.

The contributions of our work can be summarized as follows:

1. Emotional body gesture recognition is a challenging task because of the complexity of gestures that are rich in varieties caused by high level of liberty.

In general, there is no exact definition for the output spaces and mostly based on geometrical representations which could be shallow. In this thesis we try to find the best way to have a good representation of motion information of body gestures that could be useful for the task of emotional classification. In this thesis two models are proposed for emotional body gesture recognition:

a. Part-based model considers the body as a set of components (hands, head, shoulders, torso, arms) which could be detected separately.

• Hands have been widely used in comparison to other body parts for gesturing to express the feelings and notify the thoughts. Hands gesture recognition confronts many challenges. In this thesis we propose a hybrid approach for the good representation of hand’s local movement in a global temporal template. Our multiple hand detection and tracking method could be useful for Human Computer Interaction applications based on hand gestures.

• Taking into consideration the coordination between body parts (face and hands), we propose a deep spatio-temporal approach that merges

(24)

the temporal normalization method with deep learning method. We demonstrate that deep learning techniques outperform traditional machine learning techniques.

b. Kinematic-based model was used to precise exactly the position of joints in the body and this leads to the good detection and tracking of human skeleton. Human pose estimation is generally used as a separate task in the context of activity and action recognition. In this thesis, we combine pose estimation and emotion classification to propose a new deep multi- stage architecture able to deal with both tasks by exploiting the strong points of models pre-trained. We demonstrate that transfer learning techniques outperform traditional machine learning techniques.

2. Speech is a significant source of information for emotion recognition especially when other channels like face or body are hidden. The shape of the vocal tract, tone of the voice, pitch, and other characteristics are influenced by human emotions. In this thesis, we propose a deep temporal-cepstrum representation of features that is effective in encoding those characteristics of speech. The results obtained prove the effectiveness of our method over existing methods in the state of the art.

Publications The research efforts presented in this dissertation are the sum- mary of these papers:

• Khalifa, I., Ejbali, R., and Zaied, M. (2018). Hand motion modeling for psychology analysis in job interview using Optical Flow-History Motion Image (OF-HMI). The 10th International Conference on Machine Vision, Vienne, Austria.

• Khalifa, I., Ejbali, R., and Zaied, M. (2019). Body gesture modeling for psychology analysis in job interview based on deep Spatio-temporal approach.

Parallel and Distributed Computing, Applications and Technologies. Com- munications in Computer and Information Science, vol 931. pp. 274-284, Springer, Singapore.

• Khalifa, I., Ejbali, R., Schettini, R., and Zaied, M. (2020). Deep Multi- stage approach for emotional body gesture recognition in job interview. The Computer Journal (accepted for publication).

• Khalifa, I., Ejbali, R., Napoletano, P., Schettini, R., and Zaied, M. Deep Temporal-Cepstrum Representation for Speech Emotion Recognition. (fu- ture submission to journal).

(25)

Emotion recognition system:

Background and related work

2.1 Introduction

Various works have been carried out to develop systems for emotion recognition based on automatic analysis of non-verbal behaviors. Until recently, body gesture and speech have been ignored by the community as a source of emotional information in comparison to facial expressions [42]. In this chapter, we present some notions about emotions like their types and their different models. Based on these, we present the state of the art for automatic emotion recognition using body gesture and speech.

2.2 Notions about emotions

The term emotion is a combination of energy and motion. It is a response of the organism to a particular stimulus (person, situation, or event). It is typically an intense, short-term experience and the person is usually well aware of it. Emotion episodes include various components like action preparation, appraisal of events, expressive behaviors, subjective feelings, and physiological responses [79].

2.2.1 Emotion types

Three types of emotions exist which are: primary or basic emotions, secondary emotions, and social emotions.

a. Basic emotions: They are activated by particular events or they manifest in precise circumstances by provoking specific behaviors. They are the basis of our reactions, which are not only determined by our rational judgment or

(26)

our individual past, but also by our ancestral past. There are six primary emotions: joy, sadness, anger, fear, disgust, and surprise. In fact, these basic emotions are like a primary material, from which all other emotions can be made [37].

b. Secondary emotions: They increase the intensity of reactions over time [37]. Some of these are triggered by thinking about what might have hap- pened or not, unlike basic emotions which are triggered by actual occur- rences (direct reaction to external event). When we feel angry, we may feel ashamed afterward or when we feel happy, we may feel proud. Several secondary emotions exist like pride, trust, confidence, relaxation, disappoint- ment, boredom, uncertainty, anxiety, confusion, etc.

c. Social emotions: These emotions are inherent in the relationship with others like guilty, jealousy, timidity, humiliation, etc. All these emotions are learned and are built up from the primary emotions [37].

2.2.2 Emotion models

The manipulation of emotions with computer raises many issues. First, at the level of their representations, it is a question of finding a formalism that agrees with the existing psychological results, while allowing a simple manipulation.

Then, for a given event, it is necessary to be able to determine the emotional potential associated with it. Based on the work in social psychology [5], some measures consider emotional states as categories, others as a multidimensional construct.

2.2.2.1 Categorical model

It is the most popular model in the literature. The universal character of emotions leads to the definition of basic emotions that can be observed in all individuals re- gardless of their ethnicity or culture. Based on the research of Ekman [112, 113], this model divides emotions into a set of distinct classes that can be described easily. Therefore, the affective denominations that don’t find their place in these classifications are considered as a mixture of primary emotions. The rationale be- hind the use of this model is that these basic emotions are clearly identifiable in the majority of individuals, especially through non-verbal communication. How- ever, the number of emotions that should be considered is still an open question because this number differs from a researcher to another as shown in Table 2.1.

(27)

Author Emotion classes

Ekman [114] anger, disgust, fear, joy, sadness, surprise

Tomkins [148] anger, interest, contempt, disgust, distress, fear, joy, shame, surprise

Izard [29] anger, contempt, disgust, distress, fear, guilt, interest, joy, shame, surprise

Plutchik [124] acceptance, anger, anticipation, disgust, fear, joy, sadness, surprise

Table 2.1: Classification based on categorical model

Figure 2.1: Dimensional representation of emotions: Russel’s model [36]

2.2.2.2 Dimensional model

It is a popular and theoretical approach in the psychology of human emotions [68,92], which proposes a continuous representation on several axes or dimensions.

The dimensions include valence (positive or negative character of the emotional experience), arousal or activation (how a person acts under the emotional state) and control (how to control over emotion). Many automatic emotion recognition systems are based on the dimensional representation like the Russell’s model [69]

due to its simplicity (it divides the space into a limited set of categories like positive vs negative) as shown in Figure 2.1.

(28)

Figure 2.2: Componential representation of emotions: Plutchik’s model [42]

2.2.2.3 Componential model

It is in between the dimensional and categorical models; it arranges the emotions in a hierarchical way where the superior layers are composed of emotions from previous layers. According to the Plutchik model [125] as shown in Figure 2.2, complex emotions are the combination of pairs of primary emotions like:

Disapproval = Surprise + Sadness and Contempt = Disgust + Anger, etc.

2.3 Emotional body gesture recognition: State of the art

2.3.1 Body gesture cues

Gestures are a crucial component in the interpersonal communication as they are used to decode the vocal content and aid observer comprehension by acti- vating the images in the listener’s mind and reinforcing the attention. Gestures can be identified either from an interpersonal interaction (non-verbal, semiotic communication) or from a physiological signal (reflex result or voluntary muscle contractions). We can classify the gestures according to the body parts involved [131]: Gestures involving the whole body, head posture and facial expressions, and gestures of the hands that form the main category of interactive gestures. The

(29)

Figure 2.3: Gesture categories [95]

research in this area is linked to the recognition of hand positions [60, 152, 64], the development of human-machine interactions, and the interpretation of sign language. As shown in Figure 2.3, several researchers proposed various gesture classifications; Kipp [95] used a set of six classes of gestures detailed in the work of Ekman and Friesen [39] and McNeill [32] which are: emblem, deictic, iconic, metaphoric, beat, and the last one is adaptor that represents the non- communicative and spontaneous gesture. It is usually unintentional and includes self-touch behaviors such as touching an object or the own body (scratching head, touching nose, etc). It can inform about the state of the speaker and it generally reflects negative emotions like anxiety and uncertainty. Body has been used for gesturing to notify the thoughts and express the feelings. Based on previous works of researchers like Ekman and Friesen [39] and Gelder [21], Witkower and Tracy [169] wrote a review about body expressions of emotions like pride, shame, disgust and embarrassment and they presented the existing coding systems for emotional analysis. In another work, Noroozi et al. wrote a survey on body gestures and their interpretations [42]. Different emotion classifications are presented: Bal- trusaitis et al. [151] proposed Geneva Multi-modal Emotion Portrayals-Facial Expression Recognition and Analysis (GEMEP-FERA) database that consists of 10 actors displaying 5 emotions including anger, fear, relief, sadness, and joy.

Baveye et al. [160] created a video database for affective content analysis (LIRIS- ACCEDE), which contains upper bodies of 64 actors, extracted from different kinds of movies like action, drama, and romance, displaying 4 emotions (sadness, anger, disgust, and fear). As a result of the feedback obtained from the works of Ekman and Friesen [39], Burgoon [74] and Coulson [90], Gunes and Piccardi [56] identified the correlation between body gestures and the emotional state categories as presented in Table2.2. They were not limited to the six basic emotions

(30)

Emotion Body gesture

Happiness Arms opened, hand clapping, hands made into fists and kept high.

Sadness Covering face with hands, trunk leaning for- ward, dropped shoulders, body extended and hands over the head.

Anger Hands on waist, lift right and left hand up, finger point with right or left hand.

Disgust Hands covering the neck, backing, hand on the mouth.

Surprise Two hands covering the cheeks or the mouth.

Fear Crossing arms, covering the body parts, arms around the body/shoulders.

Boredom Hands below the chin, elbow on the table.

Anxiety Tapping tips of fingers on table, hands pressed together in a moving sequence, biting the nails.

Uncertainty Palms up and shoulder shrug, right/left hand touching the chin, forehead, nose, ear or the neck, scratching hair or head.

Table 2.2: Body gesture cues and their interpretations

which are happiness, sadness, disgust, surprise, fear, and anger, Gunes and Pic- cardi added secondary emotions like anxiety, uncertainty, and puzzlement that can be real indicators of the emotional state and affect the decision.

2.3.2 Human body models

Body gesture recognition is a challenging task due to the complexity of gestures that are rich in varieties caused by high level of liberty. Numerous models have been proposed for the representation of human body [12] using part-based multiple patches (part-based model), skeleton (kinematic-based model), centroid, multiple points, rectangular or elliptical patch, contour, and silhouette as shown in Figure2.4. Part-based model and kinematic-based model as presented in Fig- ure 2.5 are the most robust ways for modeling the human body in automatic processing [42] and especially when we need to identify some body parts that could be useful in the task of emotion recognition.

2.3.2.1 Part-based model

Using this model, the body is considered as a set of components (head, shoulders, hands, arms, torso) which could be detected separately with their specific detec-

(31)

Figure 2.4: Object representations [12]:

(a) centroid, (b) multiple points, (c) rectangular patch, (d) elliptical patch, (e) part-based multiple patches, (f) skeleton, (g) contour, (h) control points, (i)

silhouette

Figure 2.5: Part-based model vs kinematic-based model [42]

tors. According to studies done by social psychologists like Mehrabian [69, 8], hand gestures give clues about the emotional state [12, 61] of the speaker and they can even help a person become a better communicator. We might also consider the relationships between body parts [65] like face and hand that could be effective for identifying emotions.

Taking as example the work of Marcos-Ramiro et al. [7], they suggested a technique to construct Hands Likelihood Map based on Optical Flow to detect and track the upper body parts with more energy which are the two hands. The motion of hands will be then classified into four classes (hand on table, hidden hand, gesturing, and self-touch).

(32)

N^◦ COCO output format

MPII output format

Pose Evaluator output format

0 Nose Head Head

1 Neck Neck Torso

2 Right Shoulder Right Shoulder Right Upper Arm 3 Right Elbow Right Elbow Right Lower Arm 4 Right Wrist Right Wrist Left Upper Arm 5 Left Shoulder Left Shoulder Left Lower Arm

6 Left Elbow Left Elbow

7 Left Wrist Left Wrist

8 Right Hip Right Hip

9 Right Knee Right Knee

10 Right Ankle Right Ankle

11 Left Hip Left Hip

12 Left Knee Left Knee

13 Left Ankle Left Ankle

14 Right Eye Chest

15 Left Eye 16 Right Ear 17 Left Ear

Table 2.3: Keypoints output format for pose estimation 2.3.2.2 Kinematic-based model

To estimate the body pose, kinematic-based model was proposed by some researchers for the tasks of activities and actions recognition [40,9] and the results prove its efficiency. This model is a collection of interconnected joints as shown in Table2.3. Based on the kinematic model, some challenging datasets have been proposed in the last few years to make the task of pose estimation less difficult.

The most popular ones are:

• Common Object in Context (COCO) dataset with 18 keypoints [10].

• MPII Human Pose dataset with 15 keypoints [103].

• Human Pose Evaluator dataset with 6 parts [107].

2.3.3 Existing techniques for gesture recognition

Gesture recognition has been gradually related to the development of systems able to identify human gestures and decode them to enrich the user experience.

It is a complex task implying different aspects like motion analysis, motion modeling, machine learning and even psycho-linguistics studies.

Gesture recognition was utilized in multiple domains like video surveillance, hu-

(33)

man computer interaction (HCI), robotics, decision system, etc. Several tools were used in the gesture recognition systems such as image and video processing, pattern recognition, statistical modeling, computer vision, etc. For decades, this topic was studied by several researchers and each one intervened in a specific phase: Human detection, pose estimation (detection and tracking), feature extraction, classification, or regression. In general, the support in which the gesture information is stored is a video. It is then indispensable to apply image and video processing techniques and be able to exploit them efficiently.

Several techniques exist in literature [138, 129]:

- For detection: Skin color, contour, background subtraction, Adaboost method.

- For tracking: Point, kernel, and silhouette tracking.

- For recognition: There exists two categories which are static and dynamic:

• For static recognition: Linear and non-linear classifier.

• For dynamic recognition: Hidden Markov Model (HMM), Dynamic Time Warping, Time Delay Neural Network, Finite State Machine, etc.

2.3.3.1 Hand gesture modeling

In the last decades, there existed a various range of applications based on body gesture recognition. For Human Computer Interaction (HCI) applications, hands have been widely used in comparison to other body parts. Hand detection and tracking in a video sequence confronts many challenges: Complex background when there are other objects in the scene with hand, complex shapes and motion, variation of hand positions in different frames leads to erroneous representation of features, hand poses with different sizes in gesture frame, overlap with other regions in the image, etc. Many researchers have strived to improve the hand gesture recognition techniques. These techniques could be divided into wearable glove-based sensor approach and camera vision-based approach [96,142].

Hand gestures using Wearable glove-based sensor approach For this technique, several sensors were used such as angular displacement sensor, flex sensor, accelerometer sensor, curvature sensor, etc. They can provide the exact coordinates of palm and finger locations, orientation, and configurations [115].

Several glove systems were proposed by research laboratories over the past 3 decades [83] like Human Glove, Cyber Glove, 5DT Data Glove, and Pinch Glove as shown in Figure 2.6, and only a few of them became commercially available.

Taking as example the work of Bedregal et al. [25], who proposed a method for hand gesture recognition based on fuzzy logic, and they applied it for the recognition of Brazilian language gestures. The suggested method used a glove with