Exploiting real time facial expression capture as a mean to improve collaboration in shared virtual environments

(1)

Department of Computer Science

Master’s Degree in Computer Science

Academic Year 2017/2018

Exploiting Real-Time facial-expression

capture as a mean to improve

collaboration in shared Virtual

Environments

Supervisors Candidate

Eng. Franco Tecchia Dr. Giulio Auriemma Mazzoccola Prof. Marcello Carrozzino

Reviewer

(2)

(3)

To my family, my senseis and my beautiful doggo.

(4)

”Che venga la realt`a virtuale, non ne ho nessun timore.” - Caparezza, L’Infinto

(5)

(6)

Shared Virtual Environments are a special class of Virtual Reality applications where multiple users participate, often from geographically remote location, to the same virtual experience. In this context, one of the hardest technical and procedural challenge is user representation: it’s extremely complex to present an expressive virtual representation of the users (named Avatar). Avatars are often simple virtual mannequins with no kind of facial expression, and while this might be tolerable in some type of applications (gaming being an example), when strong and effective collaboration is needed can reduce the sense of presence.

This thesis therefore investigates the topic of interpersonal communication in the context of real-time, fully-immersive Shared Virtual Environments, ad-vocating for the introduction of an innovative and effective means to convey verbal and non-verbal communication between users. We considered a learning system, in which the apprentice has no feedback but the master voice and we added different kind of master’s video feedback to check if there was an im-provement of the experience. To achieve this video feedback we have integrated in an VR system a camera right in front of each user mouth, and investigated how this additional image sensor can be used to improve expressiveness in VR. All the development related to this works has been conducted so to be com-patible with the two most used engine for the VR application: Unreal Engine by Epic Games and Unity Engine by Unity Technologies.

All the code produced is distributed as open source and can be found on my Github page.

All the images have been made by myself or have been taken from a cited paper, from the related article on Wikipedia, from a page linked in the image description or from the official page of the application the image shows.

(7)

(8)

1 Introduction 11

1.1 Thesis Outline . . . 12

1.2 Virtual Reality . . . 14

1.3 Visualization Systems . . . 15

1.3.1 The CAVE system . . . 15

1.3.2 Head Mounted Displays . . . 16

1.4 Applications of Virtual Reality . . . 19

1.4.1 Military . . . 19

1.4.2 Healthcare . . . 20

1.4.3 Virtual Heritage . . . 20

2 Background And State of the Art 22 2.1 Virtual Reality Frameworks . . . 22

2.1.1 Academic VR Frameworks . . . 23 2.1.2 Commercial VR Frameworks . . . 25 2.2 Game Engine . . . 27 2.2.1 The XVR Framework . . . 27 2.2.2 Unreal Engine . . . 29 2.2.3 Unity Engine . . . 32 2.3 Mouth Tracking . . . 36

2.3.1 Active Contour Model . . . 37

2.3.2 Image Derivative . . . 40

2.3.3 Applications . . . 42

(9)

Animoji . . . 43

FaceRig . . . 44

2.4 Blendshape . . . 45

3 Overall System Design 47 3.1 VOIP Chat . . . 47

3.1.1 Audio in Unreal . . . 50

3.1.2 Audio in Unity . . . 52

3.2 Camera system . . . 57

3.3 Mouth projection . . . 59

3.3.1 Communication with Raspberry . . . 61

3.3.2 Communication with the engine . . . 62

3.3.3 Video processing . . . 66 3.3.4 Lip Sync . . . 78 3.4 Virtual Environment . . . 80 3.4.1 Menu . . . 81 3.4.2 Learning Room . . . 84 4 Project Implementation 87 4.1 VOIP Chat . . . 87 4.1.1 Unreal Implementation . . . 87 4.1.2 Unity Implementation . . . 89 4.2 Tracking library . . . 93 4.2.1 Video Acquisition . . . 95 4.2.2 Video Communication . . . 100

4.2.3 Communication with Engine . . . 106

(10)

4.2.5 Optimizations . . . 124

4.3 Engine Application . . . 125

4.3.1 Calibration and Training . . . 126

4.3.2 Video Mouth . . . 129

4.3.3 Feature Points Mouth . . . 130

4.3.4 Lip Sync Mouth . . . 131

5 Conclusions 132 5.1 Future Improvements . . . 133

5.1.1 Low level audio implementation . . . 133

5.1.2 Machine learning techniques . . . 134

5.1.3 Facial expression . . . 135

Glossary 138

List of Code Snippets 141

List of Figures 143

List of Tables 145

(11)

The idea behind the thesis is inspired from the previous work done in ”Design

and development of a fully immersive training simulator for complex indus-trial maintenance operations” [1]. Its main idea is to improve this learning

environment adding an audio remote feedback and a visual feedback of the interlocutor’s mouth. While the work realized by Francesco Desogus is just an offline standalone application, the next step should have been to add the online and to increase the realism of the simulation through some basic facial expres-sion on the avatars of the users. After having done some considerations, the idea slightly muted into something more generic that could have been applied to a vast majority of training application: realize a plugin for the audio and remote video feedback. This plugin should have work with the two most used engine for the realization of Virtual Reality application, i.e. Unreal Engine by Epic Games [57] and Unity Engine by Unity Technologies [58], and should have include a spatialized VOIP chat system and a way to receive the video of the other user’s mouth.

The objective of realized a plugin that could have handled all this stuffs in both the engine is a too much ambitious project for a thesis for some reasons: • The engines have completely different architectures, and while the one of Unreal could have been investigated thanks to the fact that it is open source, the one of Unity is hidden behind the layer of abstraction exposed by the engine.

• It will have implied very low level solution that could have interfered with the underlying mechanism of the engines, bringing to some possible subtle bug very hard to find.

• Both the engines are very well thought-out and optimized as well, this imply that one should be a very skilled programmer to obtain better performance with raw code than the one reached used the engines’ API in the right way.

(12)

Those are the reason why the produced code for the audio has been divided into two separate plugins, one optimized for Unreal and one for Unity, that exploits native engine calls to obtain better performance in the easiest manner. While the motivations above are still valid when talking about how the engines handle with the graphic, here there are some deeper consideration to do. In fact there is a part of the work that is not engine-dependant and could be done in a platform-agnostic way with no kind of restriction. This part is the acquisition of the video from the camera and its processing1, that have been realized through a generic library not related to any engine.

Things could seem to have become a mess, because of the fact that at the beginning there was only one plugin while up to now they are three and the parts that deals with the video generated by the library are still missing, all that stuff will end up in only two separate plugin (one for engine). Details aside, the two plugins generated will handle with the same things but for a different engine. In fact both of them will realize a online spatialized VOIP chat, will recall the library for video process and will use the engine specific primitives to generate the desired result. This way the integration of all those function in a project would be a very easy task.

Once realized this plugin has been tested on some user to obtain result on how much this kind of feedback could have been increase the realism of the simulation and the performance of a user. During those testes three different kinds of mouth’s feedback have been compared: one that uses a real video, one that exploits a blendshape to simulate facial expression and one that uses the lip sync technologies.

1.1 Thesis Outline

This document has been realized to have a complete understood on how all the functions have been realized, the way the work and the main reasons behind the adopted choices. Every chapters is responsible for a specific area and their

1_{There is also the transmission of this video between the users, but to keep thing simple}

(13)

order is functional to always have all the necessary knowledges to handle with the chapter to read. Here is a preview of them:

1. Introduction: The current chapter introduces to the main thesis’ objectives and give the very low basis on which is the Virtual Reality. 2. Background And State of the Art: This chapter exposes all the

necessary knowledges about the engine that will be used to have a full understanding of the following chapter. Furthermore it introduces the State of the Art of the faced problems, i.e. a precise description of those problem and a little excursus on how they have been handled by the researchers and other programmers.

3. Overall System Design: Here there is a first, high-level, description of the adopted solution and of the way they have been realized. In the section 3.1 the audio problem in Unreal and Unity and the way the plu-gins handle with it are introduced. Then in section 3.2 and 3.3 descrip-tions of both the hardware and the software architectures is presented, followed by the design of the platform-agnostic library. It ends with a presentation of the application in which the testes have been realized. 4. Project Implementation: All the low-level details are discussed here.

This is the chapter that introduces the programming issues and then presents the way those problem have been solved showing the snippets of code that handle with them. The section 4.2 is the centre of the whole thesis, since here is described the code that realizes the main part of the objective.

5. Conclusions: This is the last chapter, where all the final considerations about the previously described work are done and the future improve-ments are presented more in depth. Despite the fact that not one of these improvement has already been integrated into the system, all of them have already been studied and has been presented a draft of the way they could have been implemented.

(14)

1.2 Virtual Reality

Virtual Reality [2] (VR) has seen an incredible growth in the latest years,

bringing a whole new way of interacting and experiencing 3D worlds, or even the real world itself enhanced with virtual elements in the case of Augmented and Mixed Reality. With the advent of technologies such as the Oculus Rift [86] and the HTC Vive [87], which can bring realistic VR experiences at a rea-sonable price, and a slightly different approach brought by Microsoft HoloLens [88] and Google Daydreaming [89], the interest on VR is constantly increasing. Virtual Reality is an immersive medium which gives the feeling of being entirely transported into a virtual three-dimensional world, and can provide a far more visceral experience than screen-based media. From the user’s point of view, the main properties of a Virtual Reality experience are presence, immersion and interaction:

• Presence: it is the mental feeling of being in a virtual space. It is the first level of magic for great VR experiences: the unmistakable feeling that you have been teleported somewhere new. Comfortable, sustained presence requires a combination of the proper VR hardware, the right content and an appropriate system. Presence feeling is strictly related to the user involvement.

• Immersion: it is the physical feeling of being in a virtual space, at a sensory level, by means of interfaces. It is related to the perception of the virtual world as actually existing. The perception is created by surrounding the user with images, sounds and other stimuli that provide an engrossing total environment.

• Interaction: it is the user capacity to modify the environment and to receive from it feedback to his actions. It is related to the realism of the simulation. The interaction can be direct or mediated. In the first case the user interacts directly with the VE (CAVE-like systems), in the second case the user interacts with the VE by means of an avatar either in first person (HMD systems) or third person (Monitor).

(15)

1.3 Visualization Systems

An immersive VR system provides real-time viewer-centered head-tracked per-spective with a large angle of view, interactive control, and binocular stereo display. One consideration is the selection of a technological platform to use for the presentation of virtual environments.

This section provides an overview of the two most common visualization tems used in immersive virtual reality applications: CAVE systems and sys-tems based on Head Mounted Display. As stated in [4], HMDs offer many advantages in terms of cost and portability with respect to CAVE systems. On the other hand, CAVEs provide broader field-of-view, higher resolution and a more natural experience for the user.

1.3.1 The CAVE system

A Cave Automatic Virtual Environment (CAVE) [23] is a cubical, room-sized fully immersive visualization system. The acronym is also a reference to the allegory of the Cave in Plato’s Republic in which a philosopher contemplates perception, reality and illusion. The walls of the room are made up of rear-projection screens on which high-resolution projectors display images. The user goes inside of the system wearing shutter/polarizer glasses to allow for stereoscopic viewing.

CAVE systems provide complete sense of presence in the virtual environment and the possibility of being used by multiple users at the same time. Apart from glasses, users inside a CAVE do not need to wear heavy headgear (like in HMD systems).

The SSSA X-CAVE [23] is an immersive visualization system developed by PERCRO Laboratory of Scuola Superiore Sant’Anna [56]. It is a 4-meters 4-walls room in which each wall is divided into several tiles in order to ensure a good resolution. Front, right and left screens are subdivided into four back-projected tiles each, for a global resolution of about 2500x1500 pixels per screen, whilst the floor is subdivided into six front-projected tiles for a global resolution of about 2500x2250 pixels. The X-CAVE is managed by means of the XVR technology, exploiting its distributed rendering features on a cluster

(16)

Figure 1.1: CAVE systems.

of five workstations. The system is also provided with 7 cameras for optical position and orientation tracking.

1.3.2 Head Mounted Displays

A Head Mounted Display (HMD) is a device worn by a user on his head which reflects the way human vision works by exploiting binocular and stereoscopic vision. By using one or two screens with high resolution and high Dots Per Inch (DPI) and showing a different point of view per eye, the headset gives the illusion of depth to the user. Furthermore, the headset’s gyroscope and accelerometer keep track of the head movement, moving and tilting the virtual camera of the world accordingly to achieve a natural realistic feeling of the head movement.

Since the Oculus Rift presentation, many companies developed their own VR headset with more or less success, each one proposing innovative solutions to the other big issue for a full immersion: the interaction. As the user is di-rectly transported into a virtual world, for a full immersion he should be able to interact with it as if it were the real world. While this still represents a big challenge in the field, some companies as HTC and Oculus proposed some

(17)

Figure 1.2: Representation of a VR headset functioning.

effective solutions by tracking the user movement in a small room or delimited space.

The Motion Controllers supposedly substitute the hands of the user, allowing an interaction with the virtual world. Obviously, this still represents a big limitation, there is still no touch feedback, and the user’s hands and fingers movement are still a challenge to reproduce in the virtual world. Other so-lutions such as the ones developed at the PERCRO, aim to provide a more realistic feeling to the virtual environment by using haptic devices and body tracking to bring the user’s body into the virtual world.

Figure 1.3: (Left) The Oculus Rift and its Motion Controllers. (Right) The HTC Vive and its Motion Controllers.

While VR aims to bring the user in a virtual world, Augmented Reality (AR), aims to bring virtual elements directly into the real world. Both the real world and the virtual elements are still limited to a screen, be it a computer or a mobile screen. By using a camera, the idea is to ”augment” the reality by introducing virtual objects. Only recently, cameras specialized in this field started to become more accessible to consumers and phones started to offer a better tracking system with new possible applications.

(18)

Figure 1.4: Microsoft HoloLens headset.

Finally, the latest frontier is the Mixed Reality, which as the name suggests, aims to mix virtual and augmented reality, bringing the user immersion to a new level. By using a VR headset and a high resolution 3D camera, the user is directly transported into a virtual representation of the real world augmented with virtual objects. In the most recent times, Microsoft has made some big steps in this direction by showing its HoloLens, a VR headset loaded with a powerful 3D camera which constantly scans the real world. The camera creates an internal 3D representation of the world used to provide an extremely precise tracking system to position and visualize 3D elements in it. As the user head moves, the headset moves and rotates the objects accordingly, effectively giving to the user the illusion of the presence of virtual objects in the real world. While the whole process is hidden to the final user, it results in a precise, high quality blending of Virtual and Augmented Reality.

Figure 1.5: Different kind of realities.

The virtual world in which the user is transported can be either a virtual representation of the real world or even a completely new world, realized in

(19)

a way to convince the user of being part of it. As of today, there are not yet devices able to make use of all of the human five senses. In fact, usually virtual environments such as videogames or virtual training systems, can solely appeal to the human vision and hearing, reducing the user immersion in the virtual world. For this specific reason, the focus is usually on the vision and hearing senses, trying to make the virtual world as convincing as possible, and letting the user navigate and freely explore it. With specific devices, called haptic interfaces it can be possible to receive tactile feedback in response to actions performed in the virtual world. The PERCRO laboratory of the Scuola Superiore Sant’Anna of Pisa (where this thesis has been developed) works on the development of innovative simulation systems, with the help of haptic interfaces and VR systems.

1.4 Applications of Virtual Reality

Even if the most widely adopted applications for Virtual Reality are the gaming and the entertainment fields, Virtual Reality is not an end in itself and in the literature are present many other kind of possible applications, some of which are more challenging or unusual than others.

1.4.1 Military

Virtual Reality is adopted by the military [6] for training purposes. This is particularly useful for training soldiers by simulating hazardous situations or other dangerous settings where they have to learn how to react in an appropri-ate manner. The uses of VR in this field include flight or battlefield simulation, medic training [7] and virtual boot camp. It has proven to be safer and less costly than traditional training methods.

For example Virtual Reality is adopted by the military for training purposes. This is particularly useful for training soldiers and to simulate hazardous situ-ations or other dangerous settings where they have to learn how to react in an appropriate manner. The uses of VR in this field include flight or battlefield simulation medic training and virtual boot camp. It has proven to be safer and less costly than traditional training methods.

(20)

Figure 1.6: VR for parachute simulation.

1.4.2 Healthcare

Healthcare is one of the biggest adopters of virtual reality which encompasses

surgery simulation [8], phobia treatment [9] and skills training [10]. A popu-lar use of this technology is in robotic surgery [11]. This is where surgery is performed by means of a robotic device controlled by a human surgeon, which reduces time and risk of complications. Virtual reality has also been used in the field of remote telesurgery [12] where the operation is performed by the surgeon at a separate location to the patient. Since the surgeon needs to be able to gauge the amount of pressure to use when performing a delicate pro-cedure, one of the key feature of this system is the force feedback [13].

Another technology deeply used in healthcare is Augmented Reality that en-ables to project computer generated images onto the part of the body to be treated or to combine them with scanned real time images [14].

1.4.3 Virtual Heritage

A new trend in Virtual Reality is certainly the use of such technology in the field of cultural heritage. Virtual Heritage [15] is one of the computer-based interactive technologies in virtual reality where it creates visual representa-tions of monuments, artefacts, buildings and cultures, in order to deliver them openly to global audiences [16].

The aim of virtual heritage [17] is to restore ancient cultures as a virtual en-vironment in which the user can immerse himself into in such a way that he can learn about the culture by interacting with the environment.

(21)

Figure 1.7: VR for telesurgery.

Figure 1.8: VR for cultural heritage.

Currently, virtual heritage has become increasingly important in the preser-vation, protection, and collection of cultural and natural history [18]. The world’s historical resources of many countries are being lost and destroyed [19]. With the establishment of new technologies, Virtual Reality can be used as a solution for solving problematic issues concerning cultural heritage assets.

(22)

Art

This chapters serves as introduction to all the preliminary concepts needed to fully understand the whole system. It starts giving an overview of all the frameworks used to realize VR application, then it gives particular attention to the engine that are up to now the most used systems to realize the vast majority of big VR application, i.e. Unity and Unreal.

Furthermore those problem that have been encountered during the develop-ment of the thesis are introduced together with the way they have been studied to gain the maximum advantages from the already present solutions. Here a little consideration is needed; in fact, those solution are too often very won-derful and effective in theory but poor when implemented to be used on real problem. This is mainly due to the fact that most of the paper lacks of imple-mentation details and too often remains in high-level description. Furthermore it is very difficult to find concrete and simple example of this solutions and try to do it from scratch always ends in a big mess that makes you lost a lot of time.

Another thing to keep into consideration is the fact that during the develop-ment of those plugins not a single but multiple hard problems have been faced. So to stay too much time on just one of them would have result in an pluri-annual project, that was not exactly what we want. Unfortunately this have brought to adopt some simpler solutions on problems that definitively need more attention, but this would for sure be matter of the future developments. For sure is pretty hard to realize World of Warcraft if you have never realized neither Pong.

2.1 Virtual Reality Frameworks

Commercial virtual reality authoring systems allow users to build custom VR applications some examples are WorldViz [59], EON Studio [60], COVISE [20]

(23)

or AVANGO [21]. However, these systems are not open source and only allow the programmer to implement specific applications which often do not offer advanced support to other technologies or case scenarios.

On the other hand, academic VR frameworks are designed to be rather lim-ited in scope, which mostly abstract from PC clusters to drive a multi-screen display device and support a variety of tracking systems such as Mechdyne’s CAVElib [61] and Carolina Cruz-Neira’s VR Juggler [22]. Nevertheless, they leave higher level tasks to the programmer, such as navigation or menu wid-gets. Some existing open source VR systems are just as extensible as CalVR and can be used for many tasks interchangeably with it. Examples are Oliver Kreylos’ VRUI [24] Bill Sherman’s FreeVR [25] or Fraunhofer IAO’s Lightning [26]. CalVR integrates OpenSceneGraph [62] (OSG) for its graphical output. OpenSceneGraph was modeled after SGI Performer [27] and is syntactically very similar, but it has evolved into its own programming library over the years. However, OpenSceneGraph lacks cluster-ability, tracking support, a way to describe multi-display layouts, menu widget and interaction system. Thus, OpenSceneGraph cannot drive alone VR systems.

In the latest years, many companies developed their own VR framework. Among all, one of the most common SDKs used in game engines is OpenVR [63], along its software support, SteamVR [64].

The following sections give an overview of the previously listed frameworks and their peculiarities.

2.1.1 Academic VR Frameworks

CAVELib [61] provides the cornerstone for creating robust, interactive and

three-dimensional environments. CAVELib simplifies programming and it is considered as the most widely used Application Programmer Interface (API) for developing immersive displays. Developers focus on the application while CAVELib handles all the CAVE software details, such as the operating and display systems and other programming components keeping the system plat-form independent.

(24)

framework maintained by IOWA State University. Their motto ”Code once,

run everywhere” sums up their goal to simplify common tasks in VR

applica-tions.

The Vrui VR [24] toolkit aims to support fully scalable and portable

appli-cations that run on a range of VR environments starting from a laptop with a touchpad, over desktop environments with special input devices such as space balls, to full-blown immersive VR environments ranging from a single-screen workbench to a multi-screen tiled display wall or CAVE.

FreeVR [25] is an open-source virtual reality interface/integration library. It

has been designed to work with a wide variety of input and output hardware, with many device interfaces already implemented. One of the design goals was for FreeVR applications to be easily run in existing virtual reality facilities, as well as newly established VR systems. The other major design goal is to make it easier for VR applications to be shared among active VR research sites using different hardware from each other.

OpenSceneGraph [62] is an open source high performance 3D graphics toolkit.

It is used by application developers in fields of visual simulation, games, virtual reality, scientific visualization and modelling. It is written entirely in Standard C++ and OpenGL and it runs on all Windows platforms, OSX, GNU/Linux, IRIX, Solaris, HP-Ux, AIX and FreeBSD operating systems. The OpenScene-Graph is established as the world leading scene graph technology, used widely in the vis-sim, space, scientific, oil-gas, games and virtual reality industries.

SGI Performer is a powerful and comprehensive programming interface for

developers creating real-time visual simulation and other professional performance-oriented 3D graphics application.

CalVR combines features from multiple existing VR frameworks into an

open-source system. It is a new virtual reality middleware system which was devel-oped from the ground up. In addition, CalVR implements the core function-ality of commonly used existing virtual refunction-ality middleware, such as CAVElib, VRUI, FreeVR, VR Juggler or COVISE. It adds to those that it supports sev-eral non-standard VR system configurations, multiple users and input devices,

(25)

sound effects, and high level programming interfaces for interactive applica-tions. CalVR consists of an object-oriented class hierarchy which is written in C++.

2.1.2 Commercial VR Frameworks

WorldViz [59] is a virtual reality software company that provides 3D

inter-active and immersive visualization and simulation solutions for universities, government institutions, and commercial organizations. WorldViz offers a full range of products and support including enterprise grade software, complete VR systems, custom solution design and application development. Vizard is one of the products offered by WorldViz and it is a virtual reality software toolkit for building, rendering, and deploying 3D visualization & simulation applications. It natively supports input and output devices including head-mounted displays, CAVEs, Powerwalls, 3D TVs, motion capture systems, hap-tic technologies and gamepads. Vizard uses Python for scripting and Open-SceneGraph for rendering.

EON [60] Reality is a Virtual and Augmented Reality based knowledge

trans-fer for industry and education. EON oftrans-fers a wide range of true cross platform solutions enabling Augmented Reality, Virtual Reality, and interactive 3D ap-plications to seamlessly work with over 30 platforms. One of its goals is to bring VR and AR in the everyday life, from mobile to commercial uses, pushing the research and development on holographic solutions and immersive systems.

COVISE [20] stands for COllaborative VIsualization and Simulation

Environ-ment. It is an extendable distributed software environment to integrate simu-lations, post processing and visualization functionalities in a seamless manner. It was designed for collaborative working, letting engineers and scientists to spread on a network infrastructure. In COVISE an application is divided into several processing steps which are represented by COVISE modules. Each module is implemented as separate processes and it can be arbitrarily spread across different heterogeneous machine platforms. Major emphasis was put on the usage of high performance infrastructures such as parallel and vector computers and fast networks.

(26)

Avocado [21] is an object-oriented framework for the development of

dis-tributed and interactive VE applications. Data distribution is achieved by transparent replication of a shared scene graph among the participating processes of a distributed application. Avocado focuses on high-end, real-time, virtual environments like CAVEs[23] and Workbenches, therefore, the devel-opment is based on SGI Performer.

OpenVR [78] by Valve Software1 _{is an API and runtime system that allows}

access to VR hardware from multiple vendors without requiring that appli-cations have specific knowledge of the hardware they are targeting. Thanks to the OpenVR API is possible to create an application that interacts with Virtual Reality displays without relying on a specific hardware vendor’s SDK. It can be updated independently of the game to add support for new hardware or software updates. The API is implemented as a set of C++ interface classes full of pure virtual functions.

This allows to develop a single application with the possibility to release it on multiple VR platform without edit basically nothing but the user input, since to abstract different kind of controller is pretty hard and also useless.

Many VR applications and engines, such as the Unreal Engine and Unity Engine, simplify their integration of VR by using SteamVR [64], that offers an interface between the OpenVR library and the engine. SteamVR, as the name suggest, is realized by Valve Software and this ensures not only a great quality of the product but also stability and continuous release together with new version of OpenVR. Thanks to those plugins is possible to use all the OpenVR functionalities as engine native components, that allows more simplicity in integrating the VR in own application. Furthermore it offers ready-to-use scripts and object that can be just inserted in the application to add the most common VR interaction like Teleport or UI pointer.

This is indeed the most common choice for a single developer or for small group of people when developing a VR application and it has been obliviously chosen to realize the learning environment on which conducts the tests.

(27)

2.2 Game Engine

A Game Engine is an application that provides a suite of high-level developing tools for the realization of a game. It usually offers a GUI interface in which is possible to control all the functionalities offered by the engine together with a set of API to allow the programmers to define behaviour for the objects inside the game. Thanks to the layers of abstraction that they offer, is possible to stay focused just on the game itself, letting the engine takes care of all the necessary more low-level instruments like shaders or physic engine.

A core aspect of game engines is their extensibility, thanks to that is possible to enrich their functionalities more and more through new plugins. Another very appreciated characteristic is the possibility to compile the same games on multiple platforms with minor changes that are still uncorrelated from the way the game interfaces with the hardware.

As one could image, a VR application is basically a first person game in which the user is the main character, but this time he is really surrounded by the environment and there is no 2D display to filter the point of view. In fact engines offer a wide range of tools to improve VR applications, from graph-ics and animation tools to audio management and obviously gameplay and core programming for the basic elements of the application. C++ is widely used among all the engines, and most of them implements one or more of the previously mentioned frameworks to provide a full immersive VR and AR ex-perience, leaving to the programmer the only task of creating the world and the gameplay.

Now is clear why Unreal and Unity are extensively used for the creation of VR contents, whatever they are games or normal application (also named ”serious game”). Before presenting them is worth a little introduction on the XVR Framework, one of the first engine to support the development in VR and optimized for this kind of application.

2.2.1 The XVR Framework

XVR [30] is an Integrated Development Environment developed at the Scuola

(28)

Figure 2.1: Some of the most used game en-gines nowadays (https://medium.com/@muhsinking/ should-you-make-your-own-game-engine-f6b7e3f4b6f5).

a modular architecture and a VR-oriented scripting language, namely S3D, XVR content can be embedded on a variety of container applications. XVR supports a wide range of VR devices (such as trackers, 3D mice, motion capture devices, stereo projection systems and HMDs).

Due to its extensive usage in the VRMedia group, a large number of libraries have been developed to support the mentioned devices along with many more developed at the PERCRO laboratory of the Scuola Superiore Sant’Anna. XVR evolved during the years to include its own engine, and while being a powerful and extensible graphic engine, its development has gradually slowed down.

XVR is actually divided into two main modules:

• The ActiveX Control module, which hosts the very basic components of the technology such as versioning check and plug-in interfaces.

• the XVR Virtual Machine (VM) module which contains the core of the technology such as 3D Graphics engine, the Multimedia engine and all the software modules managing the built-in XVR features.

(29)

and an area for storing methods. The XVR Scripting Language [?] (S3D) al-lows specifying the behaviour of the application, providing the basic language functionalities and the VR-related methods, available as functions or classes. The script is then compiled in a byte-code which is processed and executed by the XVR-VM.

In general, an XVR application can be represented as a main loop which inte-grates several loops, each one running at its own frequency, such as graphics, physics, networking, tracking, and even haptics, at least for the high-level control loop.

2.2.2 Unreal Engine

Unreal Engine [57] is a graphic engine developed by the Epic Games team and mainly used to develop games and 3D applications. Unreal offers a complete suite of development tools for both professional and amateur developers to deliver high-quality applications across PC, consoles, mobile, VR and AR. The engine has a strong and large C++ framework which can be used to develop any kind of Virtual Environment. It offers a large set of tools such as:

• A user-friendly, customizable and extensible editor with a built-in real-time preview of the application.

• Robust online multiplayer framework. • Visual effects and particle system. • Material Editor.

• Animation toolset. • Full editor in VR.

• Built and thought for VR, AR and XR. • A large number of free assets.

• Audio engine.

(30)

The engine is completely open source and this is its bigger quality. In fact, thanks to this, is possible to inspect all its feature starting from the line of code implementing them. This brings two main benefits: one is the possibil-ity to find the exact way the engines makes the things just looking inside the ton of line of code; the other is the that every programmer can edit a part of the engine, rebuild the whole source and have a completely new features that exploits the native mechanism of the engine. The latter is especially useful when Unreal does not provide a specific low-level functionality; in this case its possible to both edit existing class or to create a new one to directly insert into the engine. While this allows to exploit all the private or not exposes method of the already existent class, a big drawback is that the whole engine has to be rebuild any time a little modification is done, and it is valid also when trying to test the new inserted code. Furthermore the fact that Unreal use the C++ as development language means that also during the coding of the game is necessary to rebuild the whole game every time, and while there are some optimizations that make the process faster, it is still a bit slow, specially if compared to the time used by Unity.

Unfortunately while it is possible to read all the Unreal source code, its docu-mentation is not so well done, and while all the class are documented, usually they lacks of important description and examples.

The basic building block of Unreal is a Module. The engine is implemented as a large collection of modules, where each of them adds new functionalities through a public interface. While the module is write using C++ code, they are linked through C# scripts compiled using UnrealBuildTool. Every time a specific module is added a set of new classes and functions is added to the API of the engine and it is then possible using all of them during the game developing. To add this module is necessary to edit a C# script. Some of them also needs further modification within configuration file to be correctly used.

Unreal works with a big main loop 2 _{that recalls a series of function within a}

precise order in all the object present in the scene. Below are presented some basics concepts needed to be initialized with the engine:

(31)

• Project: A Project is a unity that holds all the content and code that make up an individual game.

• Objects: The base building blocks in the Unreal Engine are called Ob-jects. Almost everything in the Unreal Engine inherits (or gets some functionality) from an Object. In C++, UObject is the base class of all objects; it implements features such as garbage collections, metadata (UProperty) support for exposing variables to the Unreal Editor, and serialization for loading and saving.

• Classes: A Class defines the behaviors and properties of a particular entity or Object used in the creation of an Unreal Engine game. Classes are hierarchical as seen in many programming languages and can be created in C++ code or in Blueprints.

• AActors: An Actor is any object that can be placed into a level. Actors are a generic Class that support 3D transformations such as translation, rotation, and scale. Actors can be created (spawned) and destroyed through gameplay code (C++ or Blueprints). In C++, AActor is the base class of all Actors.

• Components: A Component is a piece of functionality that can be added to an Actor. Components cannot exist by themselves, however they can be added to an Actor, giving it access to the Component func-tionalities. They may be seen as modules which any entity may imple-ment.

• APawn: A subclass of Actor that serves as an in-game avatar or persona, for example the characters in a game, is a subclass of APawn. Pawns can be controlled by a player or by the game’s AI, in the form of non-player characters (NPCs).

• ACharacter: ACharacter is a subclass of a APawn that is intended to be used as a player character. The ACharacter subclass includes a collision setup, input bindings for bipedal movement, and additional code for movement controlled by the player.

(32)

• Level: A Level is a user defined area of gameplay. Levels are created, viewed, and modified mainly by placing, transforming, and editing the properties of the Actors it contains.

• World: A World contains a list of Levels that are loaded. It handles the list of Levels and the spawning/creation of dynamic Actors (created at runtime).

• AGameMode: A AGameMode object is an abstract entity that is responsible for managing all the game aspects and rules. As an abstract entity it has no model nor physics and it mainly works through events that are very different from the one called on AActor. In a multiplayer game it exists only on the server machine for security reason.

• AGameInstance: The AGameInstance class contains information shared through all levels within a world and it is an abstract entity too. It’s the first thing created when a game session starts and the last one to be destroyed when the game ends. Conversely to the AGameMode it is replicated in all the instance of an online session and serves to store information that clients can access without breaking the game.

The Unreal Engine offers a new, different way of defining the game core logic and its entities’ behavior. The Blueprints Visual Scripting system is a complete gameplay scripting system based on the concept of using a node-based interface to create gameplay elements from within Unreal Editor. It may be used to define object-oriented classes or objects in the engine. Objects defined by using the Blueprint scripting language are often referred as ’Blueprints’. The system itself is designed to be powerful and flexible, and is mainly aimed to designers and graphics with close to no C++ programming background (for this reason the Epic Team is pushing toward its development). It especially gives the possibility to define functions and objects which can be directly spawned as nodes in a Blueprint graph.

2.2.3 Unity Engine

Unity Engine born from a collaboration between Nicholas Francis, Joachim Ante and David Helgason to realize an Engine that also a programmer who

(33)

cannot effort expensive license could have used3_{. It is a very easy to use engine}

due to its very well made documentation and very powerful GUI that allows to user to make basically everything but coding in a very intuitive manner. Despite the fact that it is written in C++, it uses a system of scripting in C#, this, while restricts the control the user have over the engine (and also give less space for more powerful optimization), it avoids very annoying crash due to the most bad monster ever created: the segmentation fault 4_{. Although it}

is widely used mainly for the development of mobile games, it supports about 30 different platforms.

In the current version, i.e. Unity 5, it is entirely based on the Component design pattern, firstly introduced by the most famous quartet in the computer science’s word: the Gag of Four [32]. This pattern is one of the most used in the game development since it fits very well a lot of common problems like separation of domains (physics, graphic, AI, etc..) and shared behaviour [33]. So in Unity everything (or almost everything) is just a component of an object. This means that if someone wants to affect the physics of an object he needs to firstly recover the components responsible for the physics, i.e. Transform, and the same goes for the components that handle the rendering of an object, i.e. Renderer.

Also Unity in strongly event driven and it is based on a big cycle that contin-uously despatch event to all the scripts currently in the scene. The time on which those events are called obliviously change based on their time, and while someone has a fixed timing, i.e. FixedUpdate(), other ones strictly depends on how fast the hardware is, i.e. Update(). The figure 2.2 gives a simplified view on how this cycle works.

To intercept this events Unity offers a series of C# classes that contains defi-nitions of methods called when the associated event happens. The easiest way to handle with the desired event is to create a subclass of the related class and to override the method that deals with that specific event. Is worth noting that, unlike Unreal, there is really no need to recall the superclass overridden method to make things work well.

3_{More about the history of the engine can be found at [31].}

4_{If you are a very good (or a very terrible) programmer Unity is still able to crash, don’t}

(34)

Figure 2.2: Simplified version of the unity event loop.

To have a clear overview on the engine is necessary to introduce to a few basic concepts:

• GameObject: In Unity the very base class is GameObject, that is the superclass of all the object present in the scene. Unlike Unreal there is need to create any new class to spawn a new type of object. In fact the way the behaviour of the object is define is not within its class type, but inside its components 5_.

• MonoBehaviour: The components can be defined generating a new

5_{Also in Unreal there are the concept of component, but while the theory behind them}

is always the same, in Unity there is a more extensive use of them, that are basically the main ”object”.

(35)

C# scripts in which a new class is defined as subclass of MonoBe-haviour. This exposes all the basic methods with which is possible to interact with the engine, like Start() and Update(). It is the core of the whole engine since the 95% of the code that build a Unity application is within its subclass.

• ScriptableObject: There are two common reasons if a script does not inherit from MonoBehaviour: its job is to handle with something less related to the game (like FileSystem or Storage); it does not define a behaviour but it is responsible for storing some kind of data. The latter case is exactly what ScriptableObject was build for. In fact, it allows to create new object without being attached to any other GameObject. Conceptually it is like Unreal’s AGameModeBase, but its usage is completely different. For example one of its common use is to make object inside the scene data-driven, avoiding to fill the script with tons of public attributes.

• Prefab: Up to this point Unity seems to offer no way to reuse the same gameobject with many behaviour attached to it. This implies that every time a new entity is spawn, all the wanted scripts need to be attached one at time. Obliviously there is a simpler solution. Unity, in fact, offers the possibility to create a Prefab that works exactly the same way classes and objects work in programming. Once a scene’s object becomes a Prefab is then possible to spawn multiple instance of it with just instantiating it. This allow the user to define a prototype of a given object and then creates how many copies of it he wants to.

• Transform: All the object, even the one used just as father to group some others, have at least one component that is strictly necessary to let object exists: Transform. It is the component that manages the three basic characteristics of everything inside a scene, i.e. its position, its rotation and its scaling. Every Transform directly refers to a father Transform that allows the position, the rotation and the scaling to be defined in a hierarchical way. For example, most of the time localPosition attribute is used instead of position since it has the 0, 0, 0 point not on the world axis’ origin but on the position of the father Transform.

(36)

Those five concepts are enough to get starting with Unity Engine, and while they certainly not cover all the basis, they give enough knowledge to well understand what is written in chapter 3 and 4. Perhaps the only missing thing, though it is very intuitive, is how the plugins works in Unity, that is slightly easier than they do in Unreal.

Here a little clarification is needed: in Unity terminology what we intend to built is a package and not a plugin. In fact the latter is a library of native code that adds new low-level functionalities to the engine itself and that can be included as shared library. Once added they are usable in all the applications. Furthermore those plugin can be added only in the not-free versions of the engine. A package, instead, is a folder that contains new scripts and resources that can be used within the application in which they have been included. This means that can be added nothing but stuff that engine allows to build (that are still enough for the vast majority of cases).

Anyway to avoid further confusion and to do not make thing too specifically, the word plugin will be used to intend both Unreal plugin and Unity packages.

2.3 Mouth Tracking

The mouth tracking is the problem of recognize the shape of the mouth within an image. It is a very studied problem since its utility can be applied in a wide range of application for the most various tasks, from the health care [36] [42] to the improvements in multimedia system [37], from robotic [38] to emotion recognition [55], up to the virtual facial expression recreation [40] [41] [52]. The latter area contains all those applications that use the recognized shape of the mouth to model a virtual lip on an avatar to improve realism of the facial expression. This is indeed the kind of application this thesis is interested on since the purpose is a simplified version of this kind of task.

The biggest part of the described solutions are divided in two different macro-step:

• The recognition and isolation of the lower part of the mouth. In fact the domain in which does application are used usually provides not an image of the mouth alone but a photo of the whole face (or even worst of all the

(37)

body). So, in order to simplify the task, an isolation of the interested area is required. This usually happens through machine learning techniques like some kind of classifier. In [42] a Viola-Jones method is used to find the face of the subject. This method, in its turn, uses an Haar classifier that is also used in [43] together with SVM to extract the lip from the rest of the image.

• The extraction of the feature points and the computation of the desired result. In this second phase the lips are already been isolated and now is possible to operate on an image containing the mouth and nothing more, that incredibly simplifies the computation. From this image is possible to extract some feature points that will be then used by another kind of algorithm. The term feature points is used to define a set of points that have a central importance in the considered problem. In this case they are mostly like to be some peculiar point along the border of the mouth like the bend points. The most common used method to compute these feature points is to find the contour of the lips and then samples some points at a fixed length or some other with specific geometric characteristic. In [42] a method that exploits image processing and geometric features of the mouth is used instead. Once they have been found a common solution is to give them as input to a classifier than outputs the desired result, as done in [45], that uses a PNN to recognize the Action Unit in a sequence of images.

2.3.1 Active Contour Model

The most used techniques for the image contouring is indeed the Active

Con-tour Model [46], also called Snake Algorithm. This name is due to the fact

that the algorithm during its execution move the spline as it was a snake. It is an energy minimization algorithm that guides a deformable spline through some kind of forces that pulls it toward the object contour. The main idea behind it is to minimize the sum of three different type of forces at every point that represent the spline. This sum is indicated by the symbol E_s∗nake and

(38)

(a)

(b)

(39)

E_snake∗ = 1 Z 0 Esnake(v(s)) ds = 1 Z 0

(Einternal(v(s)) + Eimage(v(s)) + Econ(v(s))) ds

(2.1) The internal energy is the one responsible for avoiding the spline to assume too irregular form and it is composed by a first order and second order terms:

Einternal = Econt+ Ecurv

= 1 2(α(s) |vs(s)| 2 ) + 1 2(β(s) |vss(s)| 2 ) = 1 2 α(s) d¯v ds(s) 2 + β(s) d2_v_¯ ds2(s) 2! (2.2)

The α and β terms in equation 2.3 respectively represent the first order and second order terms of the energy. The first controls the length of the spline and avoid the points to be too near or too far while the second controls the smoothness and penalizes a too high oscillation.

The second type of energy is the image energy and it is the core force. In fact it is based, as name suggest, on the features of the image and tries to attract the spline toward the salient point of the shapes within the image. This force is in turn composed by three kind of sub-force:

Eimage = wlineEline+ wedgeEedge+ wtermEterm (2.3)

The first energy is the image intensity itself and it is needed to pull the point toward darker or lighter line. The second term is the term compute through a second order derivative of the image and it is used to let the splice be attracted by the edges of the object. This is the main term of the whole 2.1 equation since it is the one that mainly guides the points to reach their destination. The third and last energy is the one that tries to makes points closer to the termination of the line and to the corners. The image energy is indeed the main force of the algorithm since it is the one that differences it from one instance to another.

The last terms of the equation 2.1 is the constraint energy, that is not a fixed kind of energy. In fact, it is more a ”free parameters” that can be exploited by the user to allow the spline to quickly converge toward some particular features

(40)

of the image and for this reason it is very domain-specific.

However this algorithm has a very big drawback: it strongly need knowledges about the searched object. In fact all the energy terms contains some user-tuning parameters that is very context-dependant and cannot be set one for all. This means that a phase of parameter tuning is necessary to make this algorithm correctly works. Furthermore while the Eimageterms is less prone to

this kind of problem, the Econ force is entirely based on the domain in which it

is applied. Nevertheless there are some other general purpose terms that can be included in 2.1 like the balloon force described in [47].

This technique is largely used in the mouth tracking since it allows to compute the mouth’s contour from which then extracts the feature points. Both [45] and [52] use it for the mouth contouring.

Figure 2.4: An example of how active contour algorithm is applied to mouth recognition.

2.3.2 Image Derivative

During the description of the Snake algorithm, the derivative of the image has been cited without explaining what it really is. Since it will be strongly used,

(41)

it is good to fix some clear concept about it.

The derivative of an image is a measurement of how much the function F (x, i) change along x and y directions. The function F (x, y) is defined as a func-tion that maps the x and y component to a discrete value that represent the intensity of a pixel. To compute this derivative a convolution matrix that ap-proximate the first order derivative is used. The result is a matrix with the same size of the image but filled with the derivative of a pixel in its specific coordinate. This matrix is computed using this formula:

I0 = Mc∗ I (2.4)

where I0 is the image derivative matrix, Mc is the used kernel and I is the

matrix of the image computed using the F (x, y) function. The ∗ instead rep-resent the convolution operator, i.e. the one that applies the Mc kernel to

every element of the matrix.

The first order derivative is mainly used for the edge and corner detection within an image. This due to the fact that the pixels that lies on a vertical or horizontal edge have respectively a strong gradient along y or x axis because there the change between the pixel’s colors is less smooth than the one between two pixel within the same object.

Especially for this kind of purpose, a very common choice to compute the image derivative is the Sobel operator [53]. This uses two 3x3 kernel to compute the x derivative and the y one and then returns a matrix in which every position is computed as magnitude of the corresponding pixel in the two derivatives matrices. Gx =      +1 0 −1 +2 0 −2 +1 0 −1      ∗ I and Gy =      +1 +2 +1 0 0 0 −1 −2 −1      ∗ I G = q Gx2+ Gy2 (2.5)

This operator is usually applied on a one-channel image, for this reason the most of the time the image is first converted to grey-scale. Nothing prevent to compute the Sobel filter on every channels separately and then merge the

(42)

(a) (b)

Figure 2.5: An image before (a) and after (b) the application of the Sobel filter.

result, though. An example of the result of the Sobel filter are shown in figure 2.5.

Is important to recall that the Sobel operator is just an approximation of the true derivative that aims to lose in precision to gain in speed. Usually a way to remove some noise and have better performance from the application of the Sobel filter is to firstly blurring an image with a Gaussian blur. This allows to better approximate the Gaussian filter, that indeed is more accurate as operator to compute an image derivative.

2.3.3 Applications

The mouth tracking, and more generally the facial expression tracking, is a very emerging technique in the entertainment world. Give to a virtual avatar the precision of a real facial expression, taken from a real human, is a great improvement that can make a real big difference. In fact, more and more companies are investigated it both to improve the realism of our character, one above all Disney [34], or just to add some funny feature for its user. It is worth to cite two different application with the very same goal, but reached with slightly different instruments.

(43)

Animoji

The last family of iPhone created by the Apple [66], i.e. the X series, have brought a lot of improvements at the most famous smartphone. One of it, and also one of the most appreciated, has been the new Animoji feature. It is a program that allows the user to make some funny avatar reproduce the very same facial expression done by the user facing the phone in real-time. This is a feature that has not been added to the previous model for a very simple reason: the hardware. In fact, the iPhoneX and its brother have a very particular camera mounted on the top of their display. Indeed, it is not just a single RGB camera, but the new ”TrueDepth camera system” is composed by a flood illuminator, an infrared camera, a front camera, a dot projector, a proximity sensor and an ambient light sensor. All of them combined together with a powerful chip built-in neural network ends up with a very realistic re-sults. Thanks to all those components the system is able to catch most of the user’s facial expression also from different angles.

Despite its amazing precision, the big drawback of this application is the hard-ware requirements to run it, that is clearly not so simple to supply. Anyway it is not a big surprise, since the house of Cupertino is famous not to donate good made software for everyone.

(44)

FaceRig

A more surprising result is the one obtained by the FaceRig software [67]. It, as the Animoji feature, try to capture facial expression of an user to simulate it real-time with different kind of virtual time. This time, instead, all is done with no need of any further camera but the classic RGB one. The software, in fact, has as only requirement to have some kind of camera attached to the computer to which the software is running. Despite its lack of hardware, the result of the program are very impressive and at the very same level of the one obtained using Animoji. Furthermore it is possible to use the FaceRig avatar in some other popular platform like Twitch and Youtube. Obliviously it comes not for free, and while the basic version is pretty cheap, more permissive licences have bigger cost.

Figure 2.7: A snapshot taken from FaceRig application.

For commercial reason the technologies behind this programs is not so easy to access. Anyway it’s known that they do an intensive use of advanced machine learning techniques and image processing, and that’s specially true for FaceRig due to its limited input.

(45)

2.4 Blendshape

Figure 2.8: Different poses of a blendshape (https://graphicalanomaly. wordpress.com/2015/07/17/blendshapes/).

The problem of modelling the facial expression into a virtual avatar have brought the 3D artists to search for a more dynamic and easy-to-use solution that allows to model continuously changing mesh. Here comes the blendshapes

6_{: they are basically deformable meshes which dinamically changes the}

ver-tices geometry without changing their topology. They way these blendshapes are realized is through the definition of multiple deformations of the same mesh. The first step is the definition of a base form for the mesh, then further poses are created and then linked to the base one through the definition of a ”deformation parameter”. At this point is possible to edit the new created parameter to obtain a set of different poses that are nothing but interpolation of the positions of the vertices of which position have been changed.

Obliviously a mesh could have many different poses linked to as many param-eters, that allow the user to easily mix the different kind of poses to create a

6_{The name, now widely used for this kind of special mesh, is the word with which Maya}

[68] calls it. Others names to refer to the same thing are Morph target animation, per-vertex animation and shape interpolation.

(46)

very new one completely different from the statically defined ones. The possi-bility to add more and more parameters should be used with caution since a blendshape with a lot of different poses could fall in problems like gimbal lock that deforms the meshes into poses that are not the ones the user would have been achieved through the manipulation of the used parameters. Furthermore the combination of completely incompatible parameters could result in a very horrible graphic rendering.

Another advantage offered by them is the fact that the user has the complete control of the actual form of the mesh (within the limits explained above) and it can adapt the shape to different kind of situation without the rigidity of the skeleton animation. This allow to create small changes during every sin-gle frame that result in a better and more fluid animation. However the big drawback is the fact that the geometry of the shape continuously changes and this is not a light computation task, so it is typically good to not abuse of the number of the parameters in a single blendshape.

(47)

First it is provided a description of how the real time voice chat to allow the speech communication between the users of the system has been implemented. The adopted solutions for both the engine are presented separately and with high-level details. Some code will be shown at character 4.

Then the main part of the thesis is discussed. Before describing the solution adopted to handle the problem of a real time video feedback that can be integrated in a VR system, an overview of the system used to obtain the video is needed to fully understand the whole application.

3.1 VOIP Chat

To make a comparison between different kind of video feedback, a working audio system is needed. So the first step in developing the application was to make an online VOIP chat. The two engines taken into consideration deal with it in two very different ways, so let it work in both environment was not trivial. Creating a unique solution that could have work in both the engine was the first choice, but there were some considerable problems to take into account, mainly related to the overhead introducing in passing the audio to the engine after capturing it outside. Moreover an eventual unique plugin would have been platform agnostic and it would have work on all kind of system supported by the engines. Last but not least the plugin would introduce a new thread to handle with audio, that is a useless overkill since both engines have their own audio thread that can be exploit to process sounds in a way compliant with the engine logic, without considering the speed gained by the program using the specific optimized audio tools offered by engines.

So the engine specific solution was picked as the best choice. A general discus-sion to describe the adopted design is impossible since, as stated above, the two engines are very different in how they deal with VOIP chat, so both the solution will be discussed separately.

Before going into more details, one more important aspect of the chat has to 47

(48)

be highlighted: the position of the audio source. In fact in a 3D application, primarily if talking about VR application, a big bunch of realism can be added using a 3D audio system rather then the classic 2D omnipresent sound. This one is the kind of audio used in the majority of games, also 3D, and it is basically the way we listen for the music every day with our headphones: no matter our position, no matter how much we move, we always hear to the same sound. Instead 3D sound is a technique that has started to been used so much thanks to the Virtual reality and tries to simulate how every human being feel the sound. In fact, in real life, no sound is omnipresent and always audible at the same way from both our ears, but the volume changes proportionally to the distance from the audio source and the amount of sound heard by each ear changes with the orientation with respect to the source. So using this feature in virtual reality could increase the degree of immersion in a exponential way, since the user can also explore the environment using its hearing sense.

Figure 3.1: Difference between 2D sound and 3D sound (https://www.yarra3dx.com/).

The way Anyway it is good to not generalize this speech to all the possible set-up, since there are many situations in which the unnatural 2D sound is way better, i.e. environment music, internal self speech and voice of virtual guide.

(49)

In fact it is still present in a lot of 3D and VR application and sometimes it is preferable for a good user experience. In this case of study, indeed, the use of spatialized audio sources could improve very much the immersion. When considering a real life learning situation, the instructor and the student perceive the sound coming from the position of the other interlocutor, so faking this feeling can increase the sensation of realism of the simulation and create a more comfortable environment.

The responsible of this 3D effect is the head-related transfer function (HRTF). The way a person perceives a sound depends on a lot of factors, mainly related to its physiognomy, like the size and shape of the head, ears, ear canal, density of the head, size and shape of nasal and oral cavities. All of them contribute to how our brain receive the sound wave, and depending on its shape it is able to detect the direction of the sound in a 3D environment comparing the cue from one ear with the cues received at both ears. Thanks to this cues is possible to locate the audio source. The responsible for this localization is the head-related impulse response (HRIR), an impulse response that relates position of our ears with respect to the position of the audio source. The HRTF is nothing but the Fourier Transform of this impulse. In this way the HRTF obtains the way a sound x(t) is heard by left ear xL(t) and by right ear xR(t), like in figure

3.2. More informations about the 3D sound with particular reference to VR, that is not the main topic of this thesis, can be found in [54].

Implementing this function from scratch could be a big problem, specially for one that have never put hands on this audio stuff. Luckily, and as one could expect, both the engines offer the possibility of choosing between this two kind of audio for every audio source in the scene. So no effort has to been done in the audio spatialization but to redirect the voice of the interlocutor to a 3D audio source. Then the engine will take care about letting the audio arrive correctly to the left earphone and to the right one. The only restriction on this is that the sound must be mono, that is not a problem for the sound captured from a microphone, that is usually mono-channel.

For this reason the possibility to switch between 2D and 3D sound has been added as feature to the system to investigate the impact of this from the user’s point of view.