Natural interaction in fully immersive Virtual Reality: design and development of a natural interaction method using RGBD cameras.

(1)

Università degli Studi di Pisa

Scuola Superiore Sant’Anna

Master of Science in Embedded Computing Systems

Natural interaction in fully immersive

Virtual Reality: design and development of a

natural interaction method using RGBD

cameras.

Candidate:

Lorenzo Quadrelli

Supervisors:

Prof. Franco Tecchia Prof. Marcello Carrozzino

(2)

1 Introduction 1

1.1 Virtual Reality . . . 1

1.2 Purpose of this thesis . . . 3

1.3 Fundamentals of tridimensional graphic . . . 5

1.3.1 Rendering . . . 5

1.3.2 Virtual objects and meshes . . . 5

1.3.3 Transformation in the rendering process . . . 7

1.3.4 OpenGL . . . 11

1.3.5 Direct3D . . . 12

1.3.6 OpenGL context . . . 12

1.3.7 OpenGL Evolution . . . 13

1.3.8 Shaders . . . 15

2 State of the art 19 2.1 VR state of the art . . . 19

2.1.1 Output devices . . . 20

2.1.2 Input devices . . . 23

2.1.3 Challenges . . . 27

2.2 Hand Gesture Recognition . . . 32

2.2.1 Computer Vision Techniques for Hand Gesture Recognition . . . . 34

2.2.2 Depth Based Hand gesture recognition . . . 36

3 Intel Real Sense 39 3.1 intel Real Sense platform . . . 39

3.1.1 Intel Real Sense cameras . . . 40

3.1.2 F200 Camera . . . 40

3.1.3 Intel Real Sense SDK . . . 41

(3)

CONTENTS iii

3.1.4 SDK overview . . . 43

3.2 Program development . . . 44

3.2.1 RGB and Depth image acquisition . . . 44

3.2.2 Coordinate system . . . 46

3.2.3 UV coordinates . . . 48

4 Camera calibration and ArUco 51 4.1 Camera model and calibration problem . . . 51

4.1.1 Pinhole camera . . . 51

4.1.2 Pinhole Camera Model . . . 52

4.1.3 Camera Matrix . . . 56

4.1.4 Lens . . . 58

4.1.5 ArUco library . . . 60

4.1.6 Calibration implementation . . . 65

5 OpenGL Rendering 73 5.1 XVR and module architecture . . . 74

5.1.1 Shader program . . . 76 5.1.2 Vertex Shader . . . 78 5.1.3 Geometry Shader . . . 85 5.1.4 Pixel Shader . . . 87 5.1.5 XVR DLL program . . . 87 5.1.6 XVR Application . . . 89 5.1.7 Performances . . . 91

6 Rendering fidelity check and pose estimation 93 6.1 Markers Pose . . . 95

6.1.1 Rendering a cloud of points . . . 98

7 Hand Gesture Recognition 103 7.1 Implementation . . . 104

7.1.1 OpenCV . . . 105

7.1.2 Hand Recognition Program Structure . . . 106

7.1.3 Test with XVR . . . 128

8 Fully Immersive Visualization 131 8.1 Terminology . . . 132

(4)

8.4 Final Test Development . . . 136

8.5 Hardware set-up . . . 137

8.6 Implementation . . . 137

8.6.1 Data recovery and dynamic texture creation . . . 138

8.6.2 Dynamic Mesh . . . 140 8.6.3 Coordinates transformation . . . 142 8.6.4 Pinch . . . 144 8.6.5 Demo . . . 145 8.6.6 Unreal Results . . . 145 9 Conclusions 149 9.1 Future work . . . 150

(5)

List of Figures

1.1 The virtuality continuum . . . 2

1.2 3D Mesh Example . . . 6

1.3 Frontfacing and Backfacing . . . 7

1.4 The view frustum and NDC cube . . . 10

1.5 The rendering transformations . . . 10

1.6 OpenGL old fixed pipeline . . . 14

1.7 OpenGL new programmable pipeline . . . 15

2.1 taxonomy of output devices . . . 21

2.2 HMDs: From left to right, Oculus, HTC vive, Playstation VR . . . 22

2.3 Example of Haptic gloves (GloveOne) . . . 23

2.4 taxonomy of input devices . . . 24

2.5 Oculus controllers . . . 25

2.6 Example of omnidirectional treadmill . . . 26

2.7 Leap Motion tracker . . . 27

2.8 PrioVR suit example. PrioVR is an in-development motion capture and body tracking suit intended for mainstream use. . . 28

2.9 Screendoor effect example . . . 30

2.10 Hand gloves for hand recognition using computer vision techniques. image from [1] . . . 33

2.11 Background subtraction in hand detection . . . 34

3.1 Infrared pattern example . . . 41

3.2 F200 3D camera . . . 41

3.3 F200 3D camera without protective shell . . . 42

3.4 F200 3D camera IR pattern . . . 42

3.5 SDK architecture . . . 43

(6)

3.6 Intel Image coordinates . . . 47

3.7 Depth and RGB image mapping difference . . . 47

3.8 Texture Mapping (UV coordinates) . . . 49

4.1 Pinhole camera . . . 52

4.2 Pinhole model geometry . . . 53

4.3 Pinhole model geometry above view . . . 54

4.4 Principal Points example . . . 58

4.5 calibration radial distortion . . . 58

4.6 calibration tangential distortion . . . 60

4.7 examples of ArUco markers . . . 62

4.8 examples of ArUco board . . . 63

4.9 examples of ChArUco board . . . 64

4.10 Example original image with ChArUco board . . . 65

4.11 Calibration Program Workflow . . . 66

4.12 Example detected ChArUco board . . . 69

5.1 Schematic Overview of XVR DLL uses . . . 75

5.2 Vertex Shader IO . . . 78

5.3 flat mesh example . . . 80

5.4 texture filtering . . . 82

5.5 example rendering screenshot . . . 91

5.6 fps of XVR using OpenGL shaders . . . 92

6.1 example displacement between rendered marker and marker pose . . . 95

6.2 find marker pose flowchart . . . 96

6.3 cloud of points and ArUco marker . . . 100

6.4 dynamic mesh and cloud of point first comparison . . . 100

6.5 dynamic mesh and cloud of point second comparison . . . 101

7.1 hand recognition module . . . 107

7.2 example border recognition source image . . . 110

7.3 original image converted in grayscale and noised . . . 110

7.4 Canny detector fail, on the left the edge recognized and on the right the borders recognized shown on the original image . . . 111

(7)

LIST OF FIGURES vii

7.6 Canny detector result on the blurred image, on the left the edge recognized

and on the right the borders recognized shown on the original image . . . 112

7.7 morphological methods original image example . . . 114

7.8 dilate applied at the original image . . . 114

7.9 erode applied at the original image . . . 115

7.10 opening operation (erode followed by dilate) . . . 115

7.11 opening operation on F200 real sample. On the left the source image, on the right the final result . . . 116

7.12 thresholding example . . . 117

7.13 thresholding simply explanation . . . 117

7.14 binary treshold example . . . 118

7.15 example of threshold using the frames from the camera. On the left the source image, on the right the final image. Off course the camera is not located where stated, this is only an example . . . 118

7.16 Threshold problem and solution. On the left is shown what happen if we select as threshold the value of the "nearest pixels". On the right instead the right result . . . 119

7.17 Example hand silhouette; convex hull is coloured in blue, the border in green . . . 120

7.18 Example hand silhouette convexity defects. The red circle show the max-imum depth point of the convexity defect.Yellow circle represent where the convexity defect start and the green circle where it finish. . . 121

7.19 Example hand silhouette convexity defects. The red circle show the max-imum depth point of the convexity defect.Yellow circle represent where the convexity defect start and the green circle where it finish. the purple circle is an approximation the palm centre. . . 122

7.20 Example finger counting, there is an obvious error . . . 123

7.21 Example finger counting with corrections. . . 124

7.22 Example hole formed during pinch gesture (note: the finger count does not reflect the purple segment number any more!) . . . 125

7.23 Example of pinch, the pinch point is the green filled circle . . . 127

7.24 Example of zero depth problem . . . 128

7.25 Example pinch sequence on XVR . . . 129

8.1 Actors, everything in Unreal Engine is an Actor . . . 132

(8)

8.4 Create a Material from the UE4 editor . . . 139

8.5 render in front of the user blueprint . . . 143

8.6 Pinch in Unreal . . . 144

8.7 demo room . . . 145

8.8 QR link to first demo . . . 146

8.9 QR link to final demo video . . . 147

8.10 FPS of UE4 viewport . . . 148

(9)

LIST OF FIGURES ix

Acknowledgements

This will be the first and last section in Italian of this thesis. I apologize to non-Italian readers, I have to thank some people and I want to do it in their native language.

Prima di tutto e tutti volevo ringraziare la mia famiglia per avermi sostenuto in questi anni non sempre facili. Ho un carattere particolare che avete dovuto sopportare. Grazie quindi a mia madre Roberta, mio padre Cesare e mia sorella Michela.

Non voglio fare elenchi, sono sicuro che se qualcuno dei "citati" in questo paragrafo lo leggerà mai, saprà trovare il suo nome anche se non c’è scritto. Grazie quindi agli amici di sempre, quelli che ci sono da anni. Grazie a chi per me c’è o c’è stato.

Grazie alla mia ragazza Micol.

Grazie agli amici dell’Università, quelli con cui ho condiviso il peso delle lezioni, l’afa degli appelli estivi, i caffè insalubri delle macchinette. Ringrazio in particolare Alessan-dro, perchè gli ultimi due anni senza di lui sarebbero sicuramente andati diversamente e avrebbero avuto un altro sapore.

Ringrazio il mio relatore per l’opportunità datami per imparare argomenti di cui ho sempre avuto passione e per avermi guidato in questo lavoro di tesi.

Ringrazio chi ha sempre creduto in me e mi ha sostenuto e chi ha creduto in me stando in silenzio. Ringrazio anche chi diceva che non ce l’avrei fatta o mi ha messo i bastoni tra le ruote: non avete avuto ragione.

(10)

(11)

Chapter 1

Introduction

1.1 Virtual Reality

When we speak about Virtual Reality (VR)[2] we refer to computer technologies which allow the user to immerse himself in non-real environments. These artificial environ-ments are built through a multitude of technologies such as head mounted displays, audio headsets and haptic systems. A person using virtual reality equipment is able to "look around" the artificial world, and with high quality VR move about in it and interact with virtual features or items. The person fell an immersion sensation which is the perception of being physically present in a non-physical world[3]. To increase this sensation the user must be able to interact with the environment in the most natural and intuitive manner. For this reason many immersive technologies (such as gestural controls, motion tracking, gesture recognition etc.) have been developed and are still subject of studies. The virtual environment in which the user is projected can be a faithful reproduction of a real environment or it can be completely different from reality but still convincing for the user who must fell present and part of it.

The typical applications of VRs are wide:

• Video games[4].

• Cinema and entertainment.

• Education[5][6] (VR is used to provide learners with a virtual environment where they can develop their skills without the real-world consequences of failing).

• Training (medical training, Space Training, Flight and vehicular applications).[7],

(12)

[8]

• Military uses[9].

• Engineering (3D computer-aided design)

• etc.

Sometimes the term “virtual reality” is substituted with “augmented reality” but there is a difference. Augmented Reality (AR) is a live direct or indirect view of the real-world whose elements are supplemented by a computer generated information. There-fore, AR enhance the current perception of reality. By contrast, VR replaces the real world with a simulated one. An example of AR could be the possibility for a user (who wear a special AR equipment) to see digital labels\information drew near the real ob-jects which are related. With this feature therefore, an employee can see the instruction to install a new machinery simply watching the components.

A way to understand the difference is the virtuality continuum. The reality-virtuality continuum is a continuous scale ranging between the completely virtual and the completely real reality. The reality-virtuality continuum therefore encompasses all possible variations and compositions of real and virtual objects[10] . We can refer to these intermediate positions as Mixed Reality (MR) systems. Therefore, for example, AR is a MR.

Figure 1.1: The virtuality continuum

At present time (2017), there are still no systems capable of providing, simultane-ously and effectively, sensory stimuli for all five human senses and in this way create an artificial environment totally indistinguishable from the real ones.

(13)

1.2. PURPOSE OF THIS THESIS 3

• Real Time1 _{3D highly realistic graphics.}

• Stereoscopic visualization of the virtual environment

• User tracking (or tracking of his body parts).

• olophonic sound.

• haptic feedback and force feedback.

• etc.

The main technologies are focused on the vision, hearing and sometimes tactile sense. The overall quality of a virtual reality system depends not only from the quality of the sensory information provided to the user (graphic, audio). In fact, there are at least two additional points:

• Level of control on the virtual environment . The more the user fells free the more the experience will be impressive. The inbound curiosity of the person will bring it to move objects, touch elements, break something, etc).

• Freedom of movement in the virtual environment. If the user is blocked or bounded in his movement (impossibility of walking, rotate head etc) the his experience will be less convincing.

1.2 Purpose of this thesis

As the title state the purpose of this thesis is the study and the implementation of meth-ods for real time rendering of 3D scenes using depth cameras[11] (RGBD cameras) and sequentially design and develop a method of natural interaction using the same cameras. Rendered objects are intended to be used in virtual environments achieved mainly by using head mounted display technology. As a prototype application in this thesis, the depth camera has been mounted on the user’s helmet. In this way is possible to render in the virtual world what the user would see in the real world. In particular, the rendering module open the possibility to bring the reconstructed user’s hands into the VR envi-ronment.The solution shall be optimized oriented in order to accomplish the real time constraint. It is important to notice, again, that in this field, "Real Time" does not refer

1 _{In this field, "Real Time" does not refer to the presence of hard or soft time constraints managed}

with special algorithms. But it simply means that the application must be used in an interactive manner and therefore it must be optimized for performance

(14)

to the presence of hard or soft time constraints managed with special algorithms. But it simply means that the application must be used in an interactive manner and therefore it must be optimized for performance.so that the series of rendered images induce the illusion of movement in the human brain of the user. This illusion allows for the interac-tion with the software doing the calculainterac-tions taking into account user input. Back to the subject, the rendering module must be independent and shall be possible to integrate it into a bigger 3rd part environment as building block for VR or AR environment.

Once the hands were bring into the virtual world, another independent software module has been implemented so, thanks to the same camera images and without other hard-ware device, could identify elemental hand gestures. In detail attention was focus on the pinch gesture. The final prototype application is thus in a pick&place in the virtual world where the user moves virtual objects performing only natural gestures without tracking equipment besides "wearable" the camera. In order to develop this system has been taken into account the the F200 depth camera from the Intel Real Sense platform. The benefits for VR applications are clear:

• The user will have the possibility to see his hands in the virtual world increasing the overall immersion feeling.

• The user will have the possibility to interact naturally with the virtual world object and so, again, the overall immersion feeling will increase.

In fact a significant actor in immersion is the accurate and timely tracking and representation of a user’s hand in the virtual environment. If the user can see a virtual rendering of his or her hand and its movements relative to the movements of other objects, there is a much better chance that the user will feel that the virtual hand embodies his intentions.

In summary for this thesis have been developed three main modules:

• Intel F200 Real Sense Manager: a software module that manages the RGBD camera and enables the video streams.

• Hand Gesture Recognition: a software module that recognizes gestures, in particular the pinch one.

• ArUco Module: a software module that, thanks to the ArUco library, is used as support for development and could potentially allow AR features.

(15)

1.3. FUNDAMENTALS OF TRIDIMENSIONAL GRAPHIC 5

The three modules described above are independent and can be used together into bigger 3rd part environment such as XVR or Unreal Engine.

1.3 Fundamentals of tridimensional graphic

In this section are introduced some base concepts of tridimensional graphic that will be useful in the following chapters.

1.3.1 Rendering

Rendering refers to the complex activity that generates an image from a certain point of view of a virtual scene’s representation existent in the computer memory. The rendering process is executed in pipeline. Each pipeline stage can be saw as a software module that executes a series of operation on the input data it receives. Although nowadays, these modules are implemented in the hardware of the graphic card adapter to increase the performance. In this thesis, and in VR in general, the rendering process must be continuous. After the generation of an image, called frame, begins the generation of the following one. The framerate is defined as the number of frames produced in one second and is measured in fps. In order to give the illusion of a continuous evolution of the scene to the user the framerate should be above 30 fps. Below the 30 fps threshold, the user perceive a discontinuity in the video stream that reduce the quality of the application. A brief overview about the rendering pipeline of the OpenGl graphic system is show ahead.

1.3.2 Virtual objects and meshes

In Computer graphic, we visualize objects that are artificially constructed (or recon-structed). The geometry of real objects is so complex that is impossible or, at least, prohibitive to represented them completely. So an approximation is needed. There are different approaches to represent tridimensional objects on a computer. The prominent one is the method that associates to each object a collection of interconnected polygons that approximate its superficies. This group of polygons is called mesh. More in detail, also the type of polygons used in the representation is not unique, but during the years, the triangles approximation became the most used one.

Obviously, an object represented through a collection of polygons is not an exact representation. In general the goodness of the approximation will depend by the form of the object we want encode but, it is evident that, the higher is the number of polygons

(16)

Figure 1.2: 3D Mesh Example

better will be the end quality. A mesh is represented through the vertex position of its polygon. A mesh will occupies a part of memory of the computer related, and more or less proportional, to the number of vertex belonging to it. It is again obvious that the time necessary to elaborate and display a mesh is related to the mesh complexity (number of vertices). This is a fundamental problem of Computer Graphics. It is useful to establish a convention that determine the orientation of a polygon. Without this convention, we cannot relate a normal vector to a polygon because it can have two possible direction. Furthermore, in case of a closed mesh2 can be useful to recognize triangles that are not looking to the camera which for this reason are hide.

• If the vertices are ordered anti-clockwise from camera point of view then the direc-tion of the normal vector is exiting from the polygon. We’ll say that the polygon is front facing.

• If the vertices are ordered clockwise from camera point of view then the direction of the normal vector is entering into the polygon. We’ll say that the polygon is

2

a closed mesh can be easily defined as a mesh in which each edge, a connection between two vertices, is shared exactly by two triangles

(17)

back facing.

Figure 1.3: Frontfacing and Backfacing

If v1, v2, v3 are respectively the first, the second and the third triangle vertex. Then

the normal N can be calculated as follow:

v0= v2− v3

v00= v₃− v₁

−→ n = v0× v00−→ N = n knk

Where × is the cross product and knk is the euclidean norm of vector n.

In the case of the closed mesh this guarantee that all the normal are oriented through the extern of the mesh and furthermore the normal of each polygon approximate the normal of the real object in the same point. The back-facing polygons of a closed mesh will be hide from the front-facing polygons nearer to the camera. The technique which avoid to draw that back-facing polygons is called back-face culling.

1.3.3 Transformation in the rendering process

As state above, the geometry of a virtual object is represented through a mesh and the vertices of the mesh defines the position of the polygons which it is composed. Each vertex is identified with it’s position in the three-dimensional space by the vector

(18)

vobj = (xobj, yobj, zobj). The three-dimensional space where the position of the vertices

are defined is called objet space and the coordinates defined in this space are called object coordinates. A virtual object can be placed in a virtual environment through a modeling transformation. It is obvious that the coordinates of the object will be referred no more at the object space but at a new reference point common in the virtual scene.In fact the coordinates of the vertices are now referred to the world space and are now called world coordinates. To obtain the world coordinates from the object coordinates a transformation matrix is used. Usually these operation make use of the homogeneous coordinates. In the homogeneus coordinates a point in a three-dimensional space is rappresented through a vector of 4 components (x,y,z,w) where w can be seen as a scaling factor and therefore the actual position in the euclidean space is (_wx,_wy,_wz). The matrices used for the transformations are matrix 4-by-4. It may seem complicated but through 3-by-3 matrices is not possible to represent translations transformation. Transforming the object coordinates into world coordinates can be done as follow:

       xwrld ywrld zwrld wwrld        = M modelling        xobj yobj zobj 1        (1.1)

Then in order to capture a frame, so a virtual image, is necessary to take a picture of the scene from a certain position and orientation. It is like to have a virtual camera. The new transformation is called viewing. The new coordinates are called eye coordinates and they represent the position of the vertices in the eye space.

       xeye yeye zeye weye        = M viewing        xwrld ywrld zwrld wwrld        = (M viewing · M modelling)        xobj yobj zobj 1        (1.2)

A computer monitor is a 2D surface. A rendered 3D scene must be projected onto the computer screen as a 2D image. The projection matrix is used for this projection trans-formation. It transforms all vertex data from the eye coordinates to the clip coordinates in the clip space.

(19)

1.3. FUNDAMENTALS OF TRIDIMENSIONAL GRAPHIC 9        xclip yclip zclip wclip        = M projection        xeye yeye zeye weye        (1.3)

Furthermore the visible volume is defined in the 3D scene. This volume is called view frustum and has the form of a truncated rectangular pyramid.The planes that cut the frustum perpendicular to the viewing direction are called the near plane and the far plane. Objects closer to the camera than the near plane or beyond the far plane are not drawn. Sometimes, the far plane is placed infinitely far away from the camera so all objects within the frustum are drawn regardless of their distance from the cam-era.View frustum culling is the process of removing objects that lie completely outside the viewing frustum from the rendering process. Rendering these objects would be a waste of time since they are not directly visible. The clip coordinates are still homoge-neous coordinates. The perspective division take place to transform these coordinates in Cartesian ones. The normalized device coordinates are produced. The view volume has been transformed in a cube centered in the origin in which all the vertices coordinates are now normalized from -1 to 1 in all 3 axes. It is more like window (screen) coordinates, but has not been translated and scaled to screen pixels yet.

    xN DC yN DC zN DC     =      xclip wclip yclip wclip zclip wclip      (1.4)

The last transformation is the viewport. The NDC coordinates are translated in the window coordinates that defines the position of the vertices inside the application window (xwin,ywin) and a depth zwin (usually 0 < zwin < 1 ) that can be useful to know

when an object is farther than another.

The viewport transformation depend by the current viewport (so the area of the window in which we are drawing), that is specified by 4 integer values (x,y,w,h) that represent the bottom left corner (x,y) of the viewport, the width (w) and the height (h).

(20)

Figure 1.4: The view frustum and NDC cube

These integers that describe the viewport area are, finally, specified in pixels.

    xwin ywin zwin     =     w 2 · XN DC + (x + w 2) h 2 · yN DC+ (y + h 2) 1 2 · + 1 2     (1.5)

All the rendering transformation can be summarized in one figure:

Figure 1.5: The rendering transformations

It is important to notice that the transformation chain shown above concern vertex coordinates. It has not been said about how pixels are transformed inside the surfaces they belong. In fact this is application dependent. The presented transformation chain is the one adopted by OpenGL. Other graphical systems can use a slightly different chain but still very similar.

(21)

1.3.4 OpenGL

The OpenGL[12] real-time 2D and 3D graphics Application Programming Interface (API) was introduced in 1992 as an open, vendor neutral, multi-platform and scalable basis for interactive graphics application development. The API is typically used to inter-act with a graphics processing unit (GPU), to achieve hardware-accelerated rendering. The OpenGL specification describes an abstract API for drawing 2D and 3D graph-ics. Although it is possible for the API to be implemented entirely in software, it is designed to be implemented mostly or entirely in hardware.The OpenGL interface is almost implemented in hardware by the producer of graphic adapter.The specification says nothing on the subject of obtaining, and managing an OpenGL context, leaving this as a detail of the underlying windowing system. OpenGL is an evolving API. New versions of the OpenGL specifications are regularly released by the Khronos Group, each of which extends the API to support various new features. The details of each version are decided by consensus between the Group’s members, including graphics card manufacturers, operating system designers, and general technology companies such as Mozilla and Google. Due to the fact that OpenGL is purely concerned with rendering, it does not provide APIs related to input, audio, or windowing. Given that creating an OpenGL context is quite a complex process, and given that it varies between operating systems, automatic OpenGL context creation has become a common feature of several game-development and user-interface libraries, including Allegro and Qt. A few libraries have been designed solely to produce an OpenGL-capable window. The first such library was OpenGL Utility Toolkit (GLUT), later superseded by freeglut. OpenGL does not require important hardware requirements, basically the only constraint is the presence of a framebuffer that is a portion of memory, often managed and directly implemented by the graphic card, containing the data of every single pixel that is displayed on the screen. The OpenGL function’s call can be seen as client-server model: the program that uses the OpenGL system is the client and the implementation of the OpenGL API is the server. OpenGL (and an OpenGL context manager) was directly used to implement part of this thesis. Another part of the thesis has took advantage of XVR, an innovative development environment devoted to Computer Graphics, VR and AR. But Also XVR is based on OpenGL. For this reason a paragraph is dedicated to the API and another will be dedicated to the OpenGL contexts.

(22)

1.3.5 Direct3D

OpenGL is not the only system available to develop graphics application. In fact also the Microsoft has established his solution. Direct3D is a graphics application programming interface (API) for Microsoft Windows. Direct3D is encompassed in the DirectX API and is used to render three-dimensional graphics in applications where performance is important, such as games. Like OpenGL also Direct3D uses hardware acceleration if it is available on the graphics card, allowing for hardware acceleration of the entire 3D rendering pipeline or even only partial acceleration. There are some differences between Direct3D and OpenGL. First of all Direct3D is proprietary instead Opengl is, obviously, open. Due to the fact that the efficiency of the systems depends by the implementation performed by the hardware producer and ,often, the producers choose optimize Direct3D it happens that the latter is more performing. In fact Direct3D is used wide-scale on Windows system and Xbox 360 for videogames development. implicitly, also the DirecX were used in thesis. In fact the Unreal Engine 4, where are implemented the last presented modules, is based on DirectX.

1.3.6 OpenGL context

An application which uses OpgnGl begin its execution building up the windows where the objects will be drawn. The window manager (and the operative systems from which it depends) will make available a framebuffer. This framebuffer is the target where the OpenGL calls will write the data regarding the pixels displayed on the window. The framebuffer is not directly managed by OpenGL. OpenGL can directly access to it but yt does not care how the buffer is obtained. It may be that this buffer is managed through a couple of distinct buffers that are used alternately: While informations on the first buffers are written thanks to the OpenGL (back buffer ) calls, the second front buffer is displayed on the window. When the writing operation on the back buffer ends, the buffer swap take place. This technique is called double buffering and it is used for drawing graphics that shows no (or less) stutter3 , tearing4, and other artifacts.

It is difficult for a program to draw a display so that pixels do not change more than once. For instance, when updating a page of text, it is much easier to clear the entire page and then draw the letters than to somehow erase all the pixels that are not in both the old and new letters. However, this intermediate image is seen by the user as

3

stuttering is a term used in computing to describe a quality defect that manifests as irregular delays between frames rendered by the GPU(s)

4

Screen tearing is a visual artifact in video display where a display device shows information from multiple frames in a single screen draw.

(23)

flickering. In addition, computer monitors constantly redraw the visible video page (at around 60 times a second), so even a perfect update may be visible momentarily as a horizontal divider between the "new" image and the un-redrawn "old" image, known as tearing.

A software implementation of double buffering has all drawing operations store their results in some region of system RAM; any such region is often called a "back buffer". When all drawing operations are considered complete, the whole region (or only the changed portion) is copied into the video RAM (the "front buffer"); this copying is usually synchronized with the monitor’s raster beam in order to avoid tearing. Double buffering necessarily requires more memory and CPU time than single buffering because of the system memory allocated for the back buffer, the time for the copy operation, and the time waiting for synchronization.

Returning to the topic of the paragraph, when the window has been created, then the OpenGL context is created. An OpenGL context is the data structure that collects all states needed by server to render an image. It contains references to buffers, textures, shaders, etc. OpenGL API doesn’t define how rendering context is defined, it is up to the native window system; but none of OpenGL commands can be executed until rendering context is created and made current. A single OpenGL client can manage more contexts but it can execute commands only on a single context at time (the current active).

1.3.7 OpenGL Evolution

The first version of OpenGL (1.0) was released in 1992. The functionalities of that version was limited but they will not described here. For the purpose of this thesis instead it is important to remark the release in 2003 of the OpenGL 1.5. In fact a series of extensions aimed to modifies the graphic pipeline were released. Before the version 1.5 the graphic pipeline was fixed. That means that the functionalities of each pipeline stage was preset and the user (programmer) could only change the configuration parameters to obtain a different functionality.

In summary and simplifying:

• Per-Vertex Operations: In this stage the geometry data of single vertices are re-ceived and they are transformed. The transformation explained in the precedent paragraphs such as modelling, viewwing, projection and viewport take place here. Then the primitives ( meaning triangles, lines, etc) are assembled. The volume clipping is performed, so the primitives out of the view frustum are discarded. The primitives laying the boundary of the view volume are modified, the part

(24)

out-Figure 1.6: OpenGL old fixed pipeline

side is discarded and the part inside are modified producing new vertices on the boundary. This operation is called primitive clipping.

• Rasterization: the transformed vertex coordinates and the primitive description produced by the previous stage allows to the rasterization stage to write address on the framebuffer that coincide to the position in the frame buffer of the primitives that we are drawing. That means, at a very basic level, rasterizers simply take a stream of vertices, transform them into corresponding 2-dimensional points on the viewer’s "monitor".

• Per-Fragment Operations: consist of a series of tests and operations that can modify the fragment color generated by the previuos stage before it is written to the corresponding pixel in the framebuffer. If any of the tests fails, the fragment will be discarded.

• Framebuffer : As said before is a buffer that contains information typically consists of color values for every pixel to be shown on the display.

As stated before, from the OpenGL 1.5 a new paradigm emerge. The new API allows to program directly part of the graphic pipeline.This avoids video card manufacturers from having to implement hardware specifically to meet APIs but simply to optimize the execution of shaders. Infact all of the OpenGL features that have been deprecated can be reproduced through shaders.

In summary and simplifying:

• Programmable Vertex Processor : In this stage the vertex shader is executed. It can implement all the transformation described before on all vertices. The vertex shader it is executed once per vertex

(25)

Figure 1.7: OpenGL new programmable pipeline

• Programmable Geometry Processor : This stage execute the geometry shader. The geometry shader is executed once per primitive and allow to emit or discard the primitive itself. So the user can program directly a stage that modify the overall number of primitives (and therefore vertices).

• Programmable Fragment Processor : This stage execute the fragment shader (also called pixel shader) that process a fragment generated by the rasterization opera-tions into a set of colors and a single depth value.

1.3.8 Shaders

A shader is a program written in a special programming language that is compiled into an executable designed for execution on one of the stages of the graphical pipeline. Shaders calculate rendering effects on graphics hardware with a high degree of flexibility. Most shaders are coded for a graphics processing unit (GPU), though this is not a strict requirement. Shading languages are usually used to program the programmable GPU rendering pipeline, which has mostly superseded the fixed-function pipeline that allowed only common geometry transformation and pixel-shading functions; with shaders, cus-tomized effects can be used. The position, hue, saturation, brightness, and contrast of all pixels, vertices, or textures used to construct a final image can be altered on the fly, using algorithms defined in the shader, and can be modified by external variables or textures introduced by the program calling the shader. GPUs are optimized for shaders and floating mobile calculations that are very common during rendering. A GPU is normally capable of performing the operations specified by a shader on several parallel

(26)

cores, for example processing several vertices at the same time. Rarely, a shader runs as an ordinary CPU program, this can happen when the operations specified in the shader are not supported by the particular graphics adapter: this situation is called software fallback. Software fallback involves low framerate and prevents complex rendering tech-niques and so normally must be avoided. As this thesis has the purpose of real-time rendering, shaders are obviously taken into account for its implementation.

Whit shaders, a basic graphic pipeline can be schematized as follows:

1. The CPU sends instructions (compiled shading language programs) and geometry data to the graphics processing unit, located on the graphics card. Within the vertex shader, the geometry is transformed.

2. If a geometry shader is in the graphic processing unit and active, some changes of the geometries in the scene are performed.

3. If a tessellation shader is in the graphic processing unit and active, the geometries in the scene can be subdivided.

4. The calculated geometry is triangulated (subdivided into triangles) and then bro-ken down into fragments.

5. Fragments are modified according to the fragment shader.

6. The depth test is performed, fragments that pass will get written to the screen and might get blended into the frame buffer (in truth there are still some operations pending before a pixel arrive into the framebuffer).

When we speak about shaders, fragments is more or less a synonymous of pixel with the only shrewdness that: not every fragment (or pixel) computed by a fragment (or pixel) shader will arrive in the framebuffer and so drawn into the screen.

It is not easy to determine the exact number of executions of any kind of shader, even knowing what we are drawing. Suppose for instance, that we want to draw a triangle as a single primitive, we use a geometry shader to perform operations on the primitive itself (ie shift vertices), we have that:

• The vertex shader must be executed once per vertex, then 3 times.

• The geometry shader must be executed once for each primitive. Then it will run once.

(27)

• The shader fragment must be executed a number of times equal to the number of pixels involved in the primitive under investigation after rasterization, potentially even thousands of times.

Shaders perform inside the graphic pipeline. So the output of one is the input of the next stage:

• The vertex shader operates on the input data (vertices coordinates and other data) coming from user-defined buffers; In general, you will use VBOs to store the data that will be processed by the vertex processor. The Vertex buffer object (VBO) are vertex array data stored in the high-performance graphics memory on the server side and therefore promote efficient data transfer. The vertex shader output is sent as input at the next stage.

• The shader of geometry receives as input the primitives from the assembly stage of the primitives. At this stage it is possible that the primitive number is changed (some will be discarded, some will be emitted) and then as output we will have the new data of the new primitives.

• The shader fragment works on the data from the rasterization stage on the in-dividual fragments. As output we have data about processed fragments (color information, depth etc) that will undergo other operations before reaching the framebuffer.

There are also variables called uniforms. Uniform variables are used to communicate with your vertex or fragment shader from "outside". Uniform variables are read-only and have the same value among all processed vertices. You can only change them within your C++ program.

Then varyng variables exists. Varying variables provide an interface between the shaders. For example Vertex Shaders compute values per vertex and fragment shaders compute values per fragment. If you define a varying variable in a vertex shader, its value will be interpolated (perspective-correct) over the primitive being rendered and you can access the interpolated value in the fragment shader.

Then another type of special variables are the built-in variables. They are usually for communicating with certain fixed-functionality. For example in OpenGL Shading Lan-guange by convention, all predefined variables start with "gl_" and no user-defined vari-ables may start with this.For example is already defined the variable gl_Position that stores the position (4 floats) of a vertex.

(28)

The language in which shaders are programmed depends on the target environment. The official OpenGL shading language is OpenGL Shading Language, also known as GLSL, and the official Direct3D shading language is High Level Shader Language, also known as HLSL. However, Cg is a deprecated third-party shading language developed by Nvidia that outputs both OpenGL and Direct3D shaders. Apple released its own shading language called Metal Shading Language as part of the Metal framework.

(29)

Chapter 2

State of the art

2.1 VR state of the art

It is strangely difficult to describe the state of the art with regard to virtual reality. Current development in VR technology is happening at unprecedented speed. One of the most famous papers about VR was released in 1999 by FP Brooks[13]. In that article Systems and applications in the domain are presented at a daily or weekly basis.

In the survey conducted by Brooks, VR applications were "almost working" or "just working". In 1999, the overcoming tasks that were considered crucial were:

• Latency.

• Rendering massive models (where for massive models was meant as 1 million Polygons) in real time.

• The type of display technology.

• Aptic technologies.

What we can say today (2017) is that in less than twenty years some criticalities have been overcome or the solution is near.

• Regarding latency, its importance is crucial [14], HMD performance has improved much in recent years. If the Oculus DK1 had a latency of 40ms, the Oculus DK2 arrived at 13.3ms, the Oculus rift is 11 ms.[15], [16].

• As for the number of real-time renderable polygons, Nvidia’s GeForce GTX 1080 graphics adapter can render 11 billion of polygons per second ([17], 2016).

(30)

• With regard to display technology, it seems that the Head Mounted Display, which is best explained in the next paragraphs, is becoming the consumer market leader. In fact, the Head-Mounted Display Market was valued at USD 3.25 Billion in 2016 and is expected to reach USD 25.01 Billion by 2022.[18]

• Aptic technologies are the most disparate, at present (2017) there is still no dom-inant technology in the consumer market. Achieving true full touch simulation is not as simple as one might think at first. Our sense of touch comes from a com-bination of various different organs. With our hands we can determine everything about an object we can use to see (barring its colour) and we can tell things that we can not see with our eyes. We can tell hardness, geometry, temperature, texture and weight by handling something.

When providing a taxonomy of the current VR hardware developments, the presented devices often exist only in a prototypical state; most of them are not yet commercially available and may even never be. Nevertheless, it is still possible to categorise the hardware and identify trends. It is not the purpose of this thesis to provide a complete taxonomy or to describe the various possible devices. So will be described only closely related elements or general worth VR technologies. Different surveys are available, a very recent survey of the use of VR in industry is [19], general overview of VR are: [20] released in 2016, [21] released in 2015 and [22] in 2013.

2.1.1 Output devices

The preponderant category in current display technology is represented by Head Mounted Displays (HDM). One of the classical images that comes to mind when thinking about VR is of someone with a device on the head, covering the eyes. There are currently many Head-Mounted Displays (HMDs) in the market. Most of them have stereoscopic displays and tracking systems, enabling the user to see 3D images through a big field of vision and have the virtual camera move accordingly to the user’s head position. As there are one display of each eye, stereoscopic images are made simply by including two virtual cameras on the software. Usually, Gyroscopes and accelerometers make position recognition possible. In terms of consumer VR they are all HMDs which are either wired or mobile.

(31)

2.1. VR STATE OF THE ART 21

Figure 2.1: taxonomy of output devices

(image from [20] C. Anthes, R. J. García-Hernández, M. Wiedemann and D. Kranzlmüller,

"State of the art of virtual reality technology," 2016 IEEE Aerospace Conference, Big Sky, MT, 2016, pp. 1-19. the authors have opted for a well-known tree visualisation enhancing with icons

hybrid character of the devices.

Mobile HMDs

Regarding the mobile HMDs their main property is obviously of being wireless and being able to be used without an additional PC. Mobile HMDs carry in most cases a common smart phone as a whole for display and processing of data [23]. They provide a simple casing, which keeps the phone at a specified distance from the lens. Google developed the first devices of that kind[24]. Solution that uses more than one smart-phone as computing and rendering server has been taken in consideration [25]. A still unsolved issue with the mobile systems is the limited interaction.

(32)

Wired HMDs

Figure 2.2: HMDs: From left to right, Oculus, HTC vive, Playstation VR

The second big display category are wired HMDs. The feature list of the wired HMDs is diverse and differentiation is beyond the traditional quality factors like resolution, Field of View (FOV) or weight. A study regarding the visual quality of virtual reality system is [26] Some are equipped with cameras to allow for AR and can be used as video see-through displays, while others include eye tracking.

The stationary HMDs are besides their optical tracking all equipped with additional sensors. They contain accelerometers, magnetometers and gyroscopes and use sensor fusion to combine this information with the optical tracking. These devices are often built using existing display technology from the mobile phones.

The three big competitors are the Oculus Rift, PlayStation VR, and the HTC Vive. The devices have different characteristics and is difficult to state what is better from a general point of view [27]. Some specialized online web-pages released articles about them [28]

The current stationary HMDs are typically empowered by a 6 Degree of Freedom (DOF) tracking system provided by the manufacturer of the HMD and are connected to a powerful PC. They conceptually focus on sitting usage except the HTC system, which promises room scale experiences.

In terms of the actual display, the HTC Vive provides similar specifications to the Oculus Rift. The main difference to the other wired HMDs lies in it’s tracking range. Opposed to many other approaches the Vive is designed for room scale use allowing different application areas.

Haptic Devices

The haptic devices cross different areas. Several approaches exist in form of vests in-cluding vibrotactile elements, while other are clearly hybrid since they are implemented

(33)

as a controller. All of these approaches are either body worn or carried and form their own sub-category. On the other hand, development in the area of ubiquitous displays providing haptics feedback has been undertaken; an example would be VirWind.

Figure 2.3: Example of Haptic gloves (GloveOne)

Multi Sensory Devices

Additional displays stimulating other senses, which generate tactile or olfactory feedback, exist as well. The suggested olfactory displays for the consumer market are body worn, either as an add-on to upcoming HMD solutions, or alternatively combined with the display component of the HMD. Ubiquitous olfactory systems as known from the research domain, for example implemented in SpotScence (a projection-based olfactory display method as an unencumbering way to deliver smells using a scent projector. This device allows us to deliver smells both spatially and temporally by carrying scented air within a vortex ring launched from an air cannon [29]), have not yet been suggested.

2.1.2 Input devices

We can identify three different categories of input devices focused on input provision for HMD users. Thus, the main input category unsurprisingly is controllers. The second branch is constituted by navigation devices which allow the user to a more intuitive mov-ing experience. Traditional controller input is often enhanced by trackmov-ing technologies which can further be divided in full body and hand tracking. The developments and technological choices in the input devices are more diverse than in the display branch. in2.4 is shown a possible taxonomy. As it was for output devices, not all input devices (because of their number) were examined in the figure. In addition, controllers, which are the most common input devices, have not been used for the final demo of this thesis.

(34)

In fact, this thesis focuses on natural interaction (gestures, without any additional sensor besides the RGBD camera).

Figure 2.4: taxonomy of input devices

image from [20] C. Anthes, R. J. García-Hernández, M. Wiedemann and D. Kranzlmüller,

"State of the art of virtual reality technology," 2016 IEEE Aerospace Conference, Big Sky, MT, 2016, pp. 1-19. the authors have opted for a well-known tree visualisation enhancing with icons

hybrid character of the devices.

Controllers

The controllers for HMDs are hand worn and provide discrete input in the form of buttons and continuous input by top-mounted joysticks or touchpads with additional 6 DOF tracking information. They may be wired or wireless.

For the early applications for the existing HMD prototypes, conventional game con-trollers or keyboards and mouse interaction is used. A remarkable point is that often two similar controllers are offered to the user, one per hand.

Navigation Devices

Navigation devices are used to give the user the illusion of moving through endless spaces and act as an input source for travelling through the virtual environment. The

(35)

possi-2.1. VR STATE OF THE ART 25

Figure 2.5: Oculus controllers

bilities for physical navigation through virtual environments (VE) are still relatively rudimentary. Most commonly, users can ’move’ through high-fidelity virtual environ-ments using a mouse or a joystick. Of course, the most natural way to navigate through VR would be to walk. For small scale virtual environments one can simply walk within a confined space. While traditional treadmills allow motion in one direction the current developments in VR support motion on a two-dimensional plane - the Omnidirectional Treadmills (ODTs)[30].

Other technologies also exist, like the slide-mills, which are passive low-friction sur-faces, or devices designed for walking in place or sitting (which we consider stationary since the user is not actively walking forward).

Body and Gesture Tracking

The posture estimation approaches focus on the actual posture of the user’s body or up-per body as well as on the gesture of the user’s hand. The posture estimation in consumer VR can become a critical feature in order to provide a reasonable self-representation required in HMDs. The tracking technologies are variegated: there exist mechanical tracker, optical, magnetic tracking, inertial etc. More than one tracking system can be used at the same time, for example is possible to use magnetic tracking enhanced by IMU (Inertial Measurement Unit) data. One quite famous consumer example is the Nintendo Wii controller. In fact this controllers uses the data coming from th infrared tracking (so an optical system) plus data from accelerometers embedded in the device. In truth the Nintendo controller[31] does not track the body but only the position of

(36)

Figure 2.6: Example of omnidirectional treadmill

the hand. The gesture tracking is an important part of the more general body tracking. The area of gesture capture offers new input approaches for VR. Gestures are captured either optically, via biofeedback or with data gloves. Data gloves are typically based on strain gauge technology using fibre optics (but they can also exploit optical active or passive technologies).

Although not initially designed for VR input the Leap Motion[32][33] is a device, which has found many applications in consumer VR. Leap Motion is an optical hand tracker based on two cameras and infra-red LEDs covering a hemisphere on top of the device. Special kits to mount the Leap Motion on an Oculus Rift are available. These setups are used for interaction, hand representation in the VE as well as using the Oculus Rift as an AR display.

As the purpose of this thesis is to develop a natural interaction method based on optical systems, especially using only an RGBD camera, we will focus especially on these. To not completely disregard the existence of biofeedback gloves or systems we can simply say that, supposing as true the possible advantages in accuracy, these systems need additional hardware, additional costs etc. A noteworthy advantage in using only a RGBD camera (which could in principle be easily included with the HMD body) is the

(37)

Figure 2.7: Leap Motion tracker

absence of devices that the user has to wear except the helmet itself.

To the intuitive question "why not just use the Leap Motion for gesture recognition" is possible to reply that using a RGBD camera, it is also obviously possible have RGB images that can be used to texturize dynamic meshes created at runtime. So with only one camera we can do both rendering and gesture recognition. Furthermore as will be explained in the chapter on the Intel Real Sense platform RGBD cameras like F200, SR300 etc, are already on the market and integrated into many mid\high-range notebooks.

2.1.3 Challenges

The technological challenges discussed in the field of consumer VR deal with display quality to increase the experience and latency issues in order to reduce potential cyber-sickness. All of these problems are well known in the scientific community but often novel approaches are undertaken in consumer VR to solve them.

(38)

Figure 2.8: PrioVR suit example. PrioVR is an in-development motion capture and body tracking suit intended for mainstream use.

Cybersickness

One of the major challenges is Cybersickness[34] also called VR sickness. the Virtual reality sickness occurs when exposure to a virtual environment causes symptoms that are similar to motion sickness symptoms.The most common symptoms are general dis-comfort, headache, stomach awareness, nausea, vomiting, pallor and sweating.

One the theory whose explain why this happen is the Sensory conflict theory. Sensory conflict posits that sickness will occur when a user’s perception of self-motion is based on incongruent sensory inputs from the visual system, vestibular system, and non-vestibular proprioceptors, and particularly so when these inputs are at odds with the user’s ex-pectation based on prior experience[35]. Applying this theory to virtual reality, sickness can be minimized when the sensory inputs inducing self-motion are in agreement with one another.

Research has uncovered some clear indications of certain conditions that cause VR sick-ness. It seems that the images projected from virtual reality have a major impact on sickness. The refresh rate of on-screen images is often not high enough when VR sickness occurs. Because the refresh rate is slower than what the brain processes, it causes a dis-cord between the processing rate and the refresh rate, which causes the user to perceive glitches on the screen. When these two components do match up, it can cause the user to experience the same feelings as simulator and motion sickness

The resolution on animation can also cause users to experience this phenomenon. When animations are poor, it causes another type of discord between what is expected and what is actually happening on the screen. When on-screen graphics do not keep the

(39)

pace with the users’ head movements, it can trigger a form of motion sickness.

Another trigger of virtual reality sickness is when there is disparity in apparent mo-tion between the visual and vestibular stimuli. Essentially what happens is there is a disagreement between what the stimuli from the eyes send to the brain and what the stimuli from the inner ear are sending to the brain. This is what is essentially at the heart of both simulator and motion sickness. In virtual reality, the eyes transmit that the person is running and jumping through a dimension, however, the ears transmit that no movement is occurring and that the body is sitting still. Since there is this discord between the eyes and the ears, a form of motion sickness can occur.

Display Related Topics

Current HMDs take advantage of the technology advancements of mobile phones and tablet displays. The technology advancements reduces and sometimes eliminates certain causes of motion sickness. Resolution, pixel density and contrast values have signifi-cantly improved since the rise of the smart phones and tablet computing. But these are not the only factors: one of the issues with the first available prototype of the Oculus Rift[36] was the persistence of the displayed content. This was caused by LCD display technology, where pixels under constant illumination lead to a perceptible smearing dur-ing rotation.

The issue is not present any more when using low-persistence displays incorporating OLED technology. In this case, low exposure times reminiscent of strobe lights produce a flashing image, which are sometimes still distracting but give improvements in com-parison to blurry images produced by previous displays.

All current HMDs use lenses to increase the quality of the visualisation but this requires some careful considerations which are discussed in the following. Using lenses to zoom a planar image causes distortion of the image. When perceiving the image through regular lenses it results in so-called pincushion distortion. Straight lines seem to bend inward to the centre of the image which gives it the appearance of a pincushion.This distortion has to be compensated by pre-distorting the image inversely. The inverse distortion of a pincushion distortion is a barrel distortion.

Furthermore when using lenses to magnify displayed content, colour shifts can take place. This is due to the lenses having a wavelength-dependant index of refraction and produces colour smearing and distortion. This can be corrected by performing one render-pass per colour layer, which obviously leads to a performance loss. This step is typically implemented in shaders.

(40)

Another problem related to the display (but not due directly from the lens) is the so called Screen Door effect. It is well known problem from projection technology is a visual artefact of displays, where the fine lines separating pixels (or subpixels) become visible in the displayed image. This can be seen in digital projector images and regular displays under magnification or at close range. It describes the visible gaps between the actual pixels. Its effect becomes visible when displays are scaled through lenses. In current VR technology, especially with HMDs, this is irritating for some users. Diffuser screens can be applied to reduce the effect, with the consequence of blurring the image (for example, general-use matte screen protectors have been used).

Figure 2.9: Screendoor effect example

(author note: if the graphic quality of the support you are reading on this document is not sufficiently high the screendoor effect will not be visible from the image shown)

This effect can be mitigated by deliberately setting the projected image slightly out of focus, which blurs the boundaries of each pixel to its neighbour. Unfortunately this lead to another problem, in fact, when the image becomes blurry, most users attempt to compensate by constantly trying to refocus. This again leads to the eye strain problem. In the future, the screen door effect will be solved by increasing the resolution of the displays until the gaps are no longer visible. An example of screendoor is shown in2.11.

(41)

Design Approaches

Many of the VR applications follow conservative approaches which try to avoid cyber-sickness issues. Oculus has released a best practice guide explaining these and other problems in their documentation in order to improve the initial output of the early stage VR developers.[37].

Cockpit views are common to provide a static frame of reference and solve the cabling problem since the user is to be seated simply by the design of the application. First person view seems to be avoided in many cases. A comfort zone in hands reach of the user provides space to place 3D avatars, which is the preferred perspective in many of the Rift line up applications.

An issue with which many developers are fighting is the scale. In traditional 3D worlds on 2D displays, scale is not as important as long as it feels correct in the 2D projection. When developing for stereoscopic 3D content developers have to take into account a different life-like perception of scale.

Motion is in most cases forward and the duration of acceleration is often short. Constant perspective movements should be avoided. Ultimately, all the tools that make the user control the camera motion as much as possible by using its real movement is used, which limits the discrepancies between what is perceived in the virtual world and what happens in the real world. The rendered image must correspond directly with the user’s physical movements. This is one of the reasons why the first applications and virtual reality demos were simulations of roller coaster or similar. In this case the discrepancy between "virtual" and "real" acceleration is complete, but all the movements of the camera are determined by the movements of the user looking around.

Tracking Systems

Traditionally, wide spread systems for tracking, like mechanical, magnetic or acoustic systems loose relevance in the current developments. The present issues are sensor fusion and the reduction of tracking latencies. The most prominent tracking technique in the current VR market is to optically track infra-red (IR) diodes[38].

The cable-bound HMDs use additional information from gyroscopes, accelerometers and magnetometers to stabilise information gained from the optical tracking. Sensor fusion techniques are in this case used to determine the position and orientation of the HMD. Since most smart-phones currently have gyroscopes, accelerometers and magnetometers that can be accessed programmatically, the Mobile HMD systems exploit these to keep track of user’s position and orientation.

(42)

Two different general approaches exist for determining an object’s position or orien-tation: outside-in and inside-out. In the case of outside-in, trackers (e.g. cameras for optical tracking is used)are fixed in the scene while markers are positioned on the tracked objects. For inside-out, the optical trackers are fixed on the tracked objects whereas the markers are fixed in the environment.

Another tracking solution are the IMUs (Intertial Measurement Unit), Standard IMUs are inertial tracker and nowadays contain 3 accelerometers, 3 gyroscopes and 3 magne-tometers, placed orthogonally to each other in order to provide six degrees of freedom. However, temporal dependent drift is still a problem. Some devices use visual tracking by an external camera to mitigate this. For an example, the Oculus Rift using this tech-nique and 1kHz sampling frequency[39] reaches approximately 1 mm or 0.25◦ accuracy. Regarding Optical Tracking there are two main approaches: the common one is the passive approach, the new is the active approach. In the passive approach, used widely in industry and academia, passive retro-reflective markers are used and IR light is pro-jected into the tracked area. However, the camera based approaches in the upcoming systems use active IR-LEDs on the objects to be tracked (active approach).

2.2 Hand Gesture Recognition

In recent years, research efforts seeking to provide more natural, human-centered means of interacting with computers have gained growing interest. A particularly important direction is that of perceptive user interfaces, where the computer is endowed with per-ceptive capabilities that allow it to acquire both implicit and explicit information about the user and the environment. Vision has the potential of carrying a wealth of informa-tion in a non-intrusive manner and at a low cost, therefore it constitutes a very attrac-tive sensing modality for developing percepattrac-tive user interfaces. Proposed approaches for vision-driven interactive user interfaces resort to technologies such as head tracking, face and facial expression recognition, eye tracking and gesture recognition.

In particular in this thesis we will focus on hand gestures recognition. Hand gestures has been one of the most common and natural communication media among human being. Hand gesture recognition research has gained a lot of attentions because of its applica-tions for interactive human-machine interface and virtual environments[40]. Most of the recent works related to hand gesture interface techniques has been mainly categorized as:

(43)

2.2. HAND GESTURE RECOGNITION 33

• Glove-based methods.

• Vision-based methods.

Glove-based gesture interfaces require the user to wear a cumbersome device, and gener-ally carry a load of cables that connect the device to a computer. Gloved-based approach can use optical sensors (infrared light for example) or use fiber strain. The computa-tional cost for the tracking is limited.

Figure 2.10: Hand gloves for hand recognition using computer vision techniques. image from [1]

Vision-based methods provide a large flexibility however, the computational com-plexity is quite high. To handle this challenge, coloured markers or data gloves have been employed to simplify the vision tasks. Although these wearable land markers cir-cumvent the skin segmentation, they place additional burden on users and could feel unnatural enough to perform hand gestures, There are many vision-based techniques but as we have a depth camera we will focus on techniques that take advantage of this information in addition to the classic RGB stream. In fact the 3D depth information can be used to extract hand silhouettes for robust hand gesture recognition in a comfortable and efficient way by simply thresholding a depth map to isolate the hands.

In truth exist other methods that are not taken in consideration. One method regard biofeedback sensors and the other concern multitouch screen. The latter is a method, maybe good for mobile devices but limit the distance between users and computers. The biofeedback sensors are used to detect a trained person electroencephalogram (EEG) ac-tivity. These sensor are often used on injured persons to help their movements with some other kind of device.

(44)

2.2.1 Computer Vision Techniques for Hand Gesture Recognition

Most of the complete hand interactive systems can be considered to be comprised of three layers[41]: detection, tracking and recognition. The detection layer is responsible for defining and extracting visual features that can be attributed to the presence of hands in the field of view of the camera(s). The tracking layer is responsible for performing temporal data association between successive image frames, so that, at each moment in time, the system may be aware of “what is where”. Moreover, in model-based methods, tracking also provides a way to maintain estimates of model parameters, variables and features that are not directly observable at a certain moment in time. Last, the recogni-tion layer is responsible for grouping the spatiotemporal data extracted in the previous layers and assigning the resulting groups with labels associated to particular classes of gestures.

Detection

Figure 2.11: Background subtraction in hand detection

From above left to below right: 1)original background image 2) Frame (N-1) 3) Frame N, 4) Frame N and original background subtraction, 5) Sequential frames subtractions. Images taken

from [42]

The primary step in gesture recognition systems is the detection of hands and the segmentation of the corresponding image regions. This segmentation is crucial because it isolates the task-relevant data from the image background, before passing them to the subsequent tracking and recognition stages. A large number of methods have been proposed in the literature that utilize a several types of visual features and, in many cases, their combination[43]. Such features are skin color, shape, motion and anatomical