UNIVERSITÀ DI PISA
Facoltà di Scienze
Matematiche Fisiche e Naturali
C
ORSO DI
L
AUREA
M
AGISTRALE
IN
I
NFORMATICA
Tesi di Laurea
Exploring the Mixed Reality
capabilities of the HTC Vive
Candidato:
Davide Busato
Relatori:
Dott. Gianpaolo Palma Dott. Paolo Cignoni
Controrelatore:
Acknowledgments
I would like to take the opportunity to acknowledge the people who have assisted me in one way or another during this project. I will start by thanking my supervisor, Dr. Gianpaolo Palma, who was always avail-able to support me when I ran into trouble or had questions. Thanks also to Paolo Malomo and Manuele Sabbadin with whom I shared the room where I conducted the project for their useful comments and the many funny moments we had. I like to thank also to Prof. Paolo Cignoni and to the other professors of the courses I followed these years for the passion and motivation they put during their lectures and for inspiring me to explore computer graphics and many other CS topics too. Last but not least, I would like to express my gratitude to my friends and my family for supporting me throughout writing this thesis and in my life in general.
Abstract
The study of this thesis deals with the exploration of the possibilities to provide an experience combining virtual and real elements when using one of the recent headset for virtual reality. The term used to describe the intertwining of elements coming from the two realities is Mixed Re-ality. The range of possible uses of this technology was largely untested until a few years ago, but the recent releases of many commodity de-vices for virtual and augmented reality have raised interesting ideas about the possibility of blending the virtual and physical world provid-ing a new type of experience to the user. My investigation was done using the HTC Vive, one of the high-end head mounted display of the moment, leveraging both the SDK made available by its producers and some software tools publicly available. The driving idea of the project was to use the front facing camera, which is directly mounted on the Vive, to provide a basis to do optical tracking of simple objects and gen-erate a 3D rendering correctly aligned with their real counterparts. To support this aim, tracking by feature detection approaches and camera calibration procedures have been used. Unfortunately, some technical difficulties during the various experiments hindered the achievement of good results and highlighted the limitation of the Vive in this type of en-deavor suggesting the use of more capable hardware to achieve a better integration of real and virtual worlds.
Contents
1 Introduction 1
1.1 Background . . . 1
1.2 Project Focus . . . 6
2 State of the Art 8 2.1 Virtual Reality . . . 8 2.2 VR Applications . . . 15 2.3 VR Systems . . . 16 2.4 Augmented Reality . . . 19 2.5 HW/SW Components . . . 21 2.5.1 Displays . . . 21 2.5.2 Computers . . . 28
2.5.3 Sensors and Tracking . . . 31
3 Aim, Objectives and Tools 38 3.1 Project Aim . . . 38
3.2 Project Objectives . . . 39
3.3.1 HTC Vive . . . 42
3.3.2 Unity . . . 50
3.3.3 HW/SW Technologies . . . 54
4 Implementation And Results 55 4.1 Coordinate Systems . . . 57
4.2 HTC Vive Setup . . . 61
4.3 Real Time Pose estimation with OpenCV . . . 62
4.4 Camera Calibration . . . 65
4.4.1 Intrinsic parameters and lens distortion . . . 69
4.4.2 Distortion removal with OpenCV . . . 70
4.4.3 Image undistortion on the GPU . . . 74 4.4.4 Camera to Head Transformation via the Tsai method 79
5 Conclusions 87
List of Figures
2.1 The Virtuality Continuum . . . 10
2.2 The Uncanny Valley . . . 13
2.3 Dale’s Cone . . . 14
2.4 VR system loop . . . 17
2.5 Perception, cognition, action loop . . . 18
2.6 AR feedback loop . . . 20
2.7 Horizontal and Vertical FOVs . . . 22
2.8 VST display . . . 23 2.9 OST display . . . 23 2.10 Immersive HMD . . . 24 2.11 Ocularity . . . 25 2.12 VR glasses . . . 26 2.13 CAVE . . . 27 2.14 Projection AR . . . 28
2.15 Pinhole camera model . . . 31
3.2 Room Setup . . . 45
3.3 Tracking Representation . . . 45
3.4 Vive Controllers . . . 46
3.5 VR Software Layers . . . 48
3.6 Unity Interface with SteamVR . . . 51
3.7 VR features support in Unity vs SteamVR . . . 53
4.1 Radial distortions . . . 71
4.2 Vive camera image with radial distortion . . . 71
4.3 Checkerboard distortion . . . 73
4.4 Checkerboard with distortion correction . . . 73
4.5 Undistortion Map encoded as a RGBA image . . . 76
4.6 Camera pose estimation using Tsai . . . 81
4.7 Projector configuration . . . 83
1.
Introduction
1.1
Background
Starting from the 2010 we are witnessing a new wave on the ocean of the revolutionary trends. This new wave is represented by an inno-vative paradigm of human-computer interaction which is made possi-ble by technologies called Virtual Reality (sometimes called Virtual En-vironments), Augmented Reality and Mixed Reality. Even if they are surely not new, the recent wave has been boosted by the development of better underlying technologies to sustain their requisites and pro-vide a convincingly user experience. Furthermore, equally important the fact that recently these technologies, tools and their new applica-tions have arrived on the consumer market and have reached a wide audience. Among the technological developments, on one side the mo-bile area opened new possibilities with their small, lightweight devices, high density displays and high resolution cameras. On the other side, computer processors and coprocessors have capabilities to handle fast and rich 3D graphics and other types of output data, to fast integrate input data coming from various sensors and to execute computer vi-sion algorithms to detect and process human movements. All of this allows to achieve two key elements on which Virtual Reality is based: immersion and presence. Remember that Virtual Reality (VR, for short)
is a name used to denote a set of technologies where a human and a computer system interact in a way that the computer perceives the user’s state and provides a sensory stimuli that make him feel as if he is in an alternate reality. Immersion in this context refers to the level of engagement offered by the multi-modal experience generated by the system while presence denotes the user’s response to such stimuli and it is considered as the feeling of being effectively in another place. Im-mersion and then presence are traditionally obtained with the use of two type of devices: head mounted displays directly worn on the head by the user on one side and space fixed environments with big projec-tion based panels such as the CAVE and domes on the other. These are used at least for immersive experiences and can be paired with other types of devices for touch/haptic input and feedback, but one can also consider other forms of VR where the level of immersion is lower such as the use of a PC desktop configuration with traditional mouse and keyboard or at most with shutter glasses for stereoscopic images view-ing. Anyway, today, the core component of a truly VR experience is the head mounted display (HMD, for short) by which the sensory stimula-tion happens through the combinastimula-tion of its panels and optics elements which provide a pair of stereoscopic images to engage the user’s sight coupled with stereophonic headphones for sounds and special hand-held devices for touch used both for input and for output feedback. The immersive effect is sustained by a tight loop between the user and a computer system which is connected to the HMD: the user interacts with the perceived environment, the system tracks his state using vari-ous pose sensors on the HMD and on other devices worn by the user and then it generates new sensory stimuli to maintain the user’s presence feeling. So, for example, the system enables the user to look around in
while maintaining a dynamic internal model of the environment and of the user. The underlying enabling technologies guarantee the equation “Fast refresh rate + accurate user tracking + stereoscopic 3D rendering = VR can achieve presence”.
Back to the 2010 the first prototype of a new HMD called Oculus Rift was developed with the aim to obtain a low cost product with a wide field of view, low latency and not so bulky, at least in comparison with ex-isting available alternatives at the time. It was with that prototype that VR (re)gained attention and people start realizing that VR could finally be ready for a wider audience. Since then many VR companies entered the consumer market and developed their systems. Last year Facebook, HTC, and Sony released their own products which are the Facebook’s Oculus Rift, Valve and HTC’s Vive, Sony’s Project Morpheus and for this event the 2016 has been dubbed as the “Year of Virtual Reality”. These three headsets have high refresh rates (90-120Hz), modern dis-play technologies (AMOLED panels) and fast tracking mechanisms that rank them as high end VR devices. In addition to these, there are also simpler solutions aimed at easier portability and more casual users: their main feature is the fact that they are mobile based VR devices so a smartphone has to be inserted in a slot in the headset and it provides both the device screens and the processing capabilities. Examples are the Google’s Cardboard and the more recent Google’s Daydream view and the Samsung’s Gear VR. Their quality and the level of engagement offered depends mainly on the smartphone used, they are essentially used for 360° photo or as video viewers so they are fairly limited when compared to the previous cited top three headsets but anyway their low price contributes to the diffusion of VR concepts. On top of these and other VR devices there is a rising field of new applications. The expe-riences that VR creates with its simulated worlds, alternate realistic or
imaginary realities, the amount of data that a VR system can record about human behavior, the new learning and communication possibili-ties of this new medium open the door to many ideas and experiments. Entertainment and video games come immediately to mind in this re-spect (actually this business accounts for a large part of planned sales both for software and hardware VR industry), but the power of simula-tion proved to be useful and has been applied also to many other en-deavors such as for personnel training in military and medical surgery environments, for industrial prototyping, for marketing and advertising to create a new type of connection between customers and products, in education where VR technology could help in exploring difficult or com-plex data/phenomena or historical settings, and also for studies about human psychology and cognition, human-computer interaction or hu-man relations in shared virtual environments.
Next to immersive virtual reality, big opportunities for application developers rely on Augmented Reality technologies which are meant to combine real and virtual content in an interactive way. In fact, immer-sive applications usually restrict and isolate the user from his daily con-text and they are probably better suited for special and niche activities (the primary force behind the VR market today is the big base of gaming enthusiasts who want to reach new levels of immersion in their favorite games). Instead, an augmentation of our real life experience with digital content that provides personal data, or useful knowledge contextualized with what we see, could prove to be more successful, at least among a wider public. For example, Augmented Reality (AR, for short) could be used as an interface to interact and control the modern smart objects in our physical environment. From a technical point of view, Augmented Reality poses some unique challenges due to the requisites it tries to
not a simple problem since it requires accurate tracking of various sys-tems (user, objects, the world), fast integration of data coming both from tracking and from internal models of the augmented environment, and the combination of artificial and visual stimuli (at least, in visual terms). Complex computer vision algorithms are often used in conjunction with optical sensors like cameras to recognize real objects and determine the poses for virtual graphics overlays. In this context, obtaining a high degree of reliability especially with dynamic changes is a true challenge (not to mention, for example, the evaluation of illumination conditions to provide correct virtual shadows). As for the integration of the types of visuals, different types of displays are used. Smartphone or tablet with their handheld screen could provide a window on an augmented reality, while projectors can be used to project symbols, text or new tex-tures directly on real surfaces. Also head mounted displays are used for AR. Usually these HMDs are classified in Optical-See-Through dis-plays which use a transparent material to not occlude the view while generating on it the virtual overlays, and Video-See-Through displays which capture frames of the real world from a camera, process them adding virtual elements and producing the combination on screen pan-els near the eyes. The current leader among AR devices is the Microsoft HoloLens, classifiable as optical-see-through with holographic waveg-uides that create a 3D depth effect and capable of accurate tracking. Actually this and other modern HMDs for AR still have some latency and a limited field of view.
AR and VR are now considered complementary to each other but it is believed that they will converge at some point in the future. Think for example what happens if an AR system is equipped with the ability to scan a real environment and reproduce it, or part of it, in a digitized form leaving only some real elements on the final scene presented to the
user. This would be considered a sort of augmented virtuality controlled by a device that could let the user select among different levels of real and virtual.
1.2
Project Focus
For my project I had the possibility to work with one of the high-end VR HMD of the moment, the HTC Vive. Firstly, I tried several application examples to explore some of its features and potential. In particular, I took an interest on a novelty of this HMD which is not available in other market competing products, which is the presence of a front facing camera mounted directly on the main headset. I was already aware of this feature having read some news about it in the period before the official release, and I was thinking about possible implications of its use. It turns out that the official use of this camera is to film the area immediately in front of the wearer allowing him to look around without having to take off and then re-put on the device with the aim to let him avoid unpleasant collisions with physical obstacles. The camera has no depth sensing capabilities, it is only a RGB camera which acts in a pass-through mode, in the sense that it can grab RGB frames with minimal processing on it and pass them directly to the VR application where they can be seen on a 3D floating window inside the virtual world, or, in a more interesting effect, these frames can be paired with an image processing step which applies a sort of edge filter to enhance the visible objects silhouettes and present the result overlaid to the current displayed VR scene with a Tron mode effect. There is no depth cues generated by different parallax of the two eyes because what we get from the camera is only a single 2D image, but anyway, the user movement
depth when viewing the nearby area in this mode. It was this mode that suggest me the possibility of a combination of real and virtual elements in applications for the HTC Vive. Think for example to a game where you can grab the controllers or other real manipulable objects and see them coming to existence also in the virtual world. The possibilities for blending real and virtual with the Vive were still untested at the start of the project, so I accepted the challenge to explore some routes in this context, focusing the research on using the tools made available by the Vive inventors and by using some standard vision techniques for tracking. The intent was to obtain some results for mixed reality using only the commodity hardware of the Vive, exploring some of its capabilities for this type of endeavor inspired by the fact that its default pass-through mode, even if enhanced with image based processing, may not be the only useful application.
2.
State of the Art
2.1
Virtual Reality
In the last few years innovative and more convincing technologies for virtual reality are finally arrived. They are leading to a new revolution in the way we communicate and interact with computers with a bigger potential impact than it was for the personal computers revolution in the 80’s or the mobile phones/computing in the 00’s. Now, if we ask ourselves what VR means, we see that it is not so simple to define it, for various reasons, starting from the fact the term appears to be an oxy-moron. Certainly there are as many definitions as the people who try to describe it. Probably we are accustomed to what popular media says about it. For example virtual reality is presented as a technology that can place people inside of computer simulated worlds or realities. But what is it and how can it realize this? Let’s try first to define what VR is, without making any reference to technologies that make it possible today. This is necessary to provide a bit of comprehension concerning the various dimensions taking part in the Human-VR interaction. We can start by considering what reality and virtual mean, although even these two terms alone are not easily describable. Ignoring complicated philosophical implications, we can say that reality is something we
as-life experience, whereas for the term virtual, an often cited dictionary definition states: “something that is being in essence or effect, but not in fact”([1]). Is it possible to associate these two terms? And how? The answer is yes, and from this question we can start thinking about the idea of using our imagination to build one important element of every virtual experience: a description of a virtual world with its objects, char-acters and rules. A world which is not real but plausible, in the sense that once we are inside it, it seems as real as possible to our human senses. Obviously an imaginary world can’t be real; however, VR aims to present it in a coherent way to the participants tricking their senses and mind to think that they are living a kind of dream/experience in full awareness. What are the possible type of worlds ? We can think of:
• Simulated world, in which certain aspects of the real world are simulated to achieve some task, such as training or testing some prototype.
• Symbolic world, in which structural or other properties of some entities or concepts of a phenomenon are presented to the user in a way to enhance his mental model of a particular environment. • Imaginary world, an unreal/fantastic world in which the physical
laws of time and space as we know may not be valid since they can be entirely established by the VR designer or by the VR user itself and so there is no restriction on what one can do.
These worlds are forms of realities alternative to our daily one, but they need not to be necessarily separated from the real world. We sider the possibility of a fusion which allows us to see a sort of con-tinuum between the real and the virtual where we have some mixed realities characterized by the coexistence of real and virtual elements
at various degrees. When real elements are prevalent and we use the virtual ones to augment the real we speak of augmented realities, con-versely, when the real augments the virtual, we speak of augmented virtualities. Milgram and Kishino([2]) propose it as follows:
Mixed Reality (MR, for short) involves merging of real and virtual worlds somewhere along the “virtuality continuum” which connects completely real environments to completely virtual ones.
Figure 2.1: The Virtuality Continuum.
Virtual worlds are part of modern communication media through which people can live the experiences, ideas and concepts defined by some source. This high level form of communication is based on more low level forms, sort of energy transfers between the user and an other entity, called the VR system. Since today VR systems are based on current hardware and software technologies, those energy transfers, on which the VR experience is based, are forms of human-computer inter-action. A first task a VR system has to deal with is the human per-ception. Virtual worlds contain objects, entities, characters which have their properties or attributes, such as weight, color, sound, texture, shape and temperature. These properties can be communicated to the user through various senses: color through sight, sound through
hear-modalities, the congruence between sensations, their quality are mea-sures of the actual degree to which a VR system generates stimuli onto the sensory receptors of users. On one side, there are these measures of the level of engagement offered by the system and by its technologies to guide the user’s mind. Using these degrees of engagement, VR experi-ences can be classified also by the level of immersion as low-immersive, semi-immersive or fully immersive VR. On the other side, there is the user, the organism who receives the stimuli, reacts and interprets them. The subjective response to immersion is called presence. The user can have variable degrees of presence: he can perceive the connection with the computer system, or he can discriminate between real and virtual or he can be so immersed to consider the experience as lived and not simply perceived. This last case is usually associated to a VR experience with a high level of presence, where one has the idea of being in a differ-ent place than the original real one with a physical and mdiffer-ental state of unawareness of the real world and of the interference of the VR system. Presence can be conceived as a form of illusion. Several components contribute to sustain this illusion:
• Physical presence or place illusion([3]): the feeling of being in a real place occurs when the user senses the virtual environment and its objects as if they are in a 3D real space. The point of view, visual parameters and visual cues are important elements of immersion to convey this type of presence. So, for example if the user moves, the visual, audio and haptic stimuli change as the virtual scene also moves. Objects become larger, louder when moving toward. Turning the head shows the world to the left and right of the user, • Self Embodiment: presence improves when the user has the per-ception of his own body and when it matches his movements, for
example when visual and physical motions match. An interesting fact is that presence doesn’t degrade necessarily when the body doesn’t look like the real user’s aspect: human mind can easily as-sociate what it perceives at the location of the user with his body; on the other hand, motion match is more important,
• Physical interaction: looking or hearing around are not enough; to increase presence the VR system should provide also a direct sensory feedback to users according to their location and actions. For example, objects become capable of being manipulated when touching them,
• Social and Behavioral Presence: the illusion of communicating with believable characters (computer or human controlled) in the same environment;
Presence does not require necessarily photorealism in representation fidelity; more important factors are environment rensponsiveness, vi-sual/physical match and depth cues. Representation fidelity obviously moves along a continuum, where we have photorealism and a replica of the real world at one end, and an abstract, surreal world at the other. We should also consider the fact that photorealism and real reality may not be the target we are trying to achieve with VR, but a target we are trying to surpass in some way. It has been observed also that user com-fort, in presence of characters simulating real people, increases as their representation fidelity raises, but if reality is not reached, creepiness and revulsion emerge due to small differences in their appearance. This phenomenon is known as Uncanny Valley, as proposed by M. Mori([4]).
Since it is not simple today to provide an exact appearance of our-selves in a virtual experience, we should expect a tendency to use more
Figure 2.2: The Uncanny Valley.
cartoon-like characters instead of near photorealistic humans. Similar concepts can be applied to interaction fidelity, the correspondence be-tween the physical actions for a virtual task and the physical actions for the equivalent real task([5]). Here, as for the previous type of fidelity, the chosen degree of fidelity depends on the original design requisites of the VR system. For example, if magical power were to be given to the user, the VR system could let him grab and move things at distance.
Watching television or a portrait on the wall are activities that share some common features with VR in the ability to caught the senses and the mind of the human, but in some way these activities are not com-plete because they require a certain level of knowledge or past
experi-ence to fully comprehend them and they exhibit a lack of interaction. So, VR aims to do a step further in this direction, making the learn-ing and communication experience more complete. VR features a truly interactive medium of communication; this and the increased sensory stimulation through sight, hearing and touch make possible an active and dynamic learning experience that can help fill any gaps in user’s knowledge while he’s trying to understand the virtual subject. For ex-ample, the participant may want to change perspective, explore around, rotate and flip objects, flip switches etc. Furthermore in virtual worlds users can potentially modify time and space, change the flow of events, interact freely with multiple participants. All of this for a better under-standing of a story, learning a concept, practicing a skill or a greater fun.
Figure 2.3 shows Edgar Dale’s Cone of Experience. As can be seen there is a range of abstraction degrees from concrete at the bottom to ab-stract at the top which describes a progression of learning experiences starting from direct purposeful experiences to less direct and more ab-stract ones based on text, verbal symbols, pictures that together lever-age both motor and cognitive skills. The idea is to propose these many levels of learning through a VR experience.
2.2
VR Applications
As a new paradigm of human communication and human-computer in-teraction, VR grabbed the interest of many people and spread in several application areas ([6, 7]):
• Entertainment industry: one of the first and prominent industries to use and promote VR techniques. Gamers dream of entering and living in their favorite video games is certainly not new. VR games usually present a first or a third person perspective of the gamer’s avatar and allow the exploration of large and realistic worlds filled with characters while evolving the game’s plot. Similar ideas in-spired applications of immersive cinema experiences.
• Industrial environments: VR can be used to test and perceive pro-totype products through virtual prototyping, to train operators in a safe and variable contexts, to employ telepresence which is a spe-cific kind of VR that simulates ways of interaction with a remote real environment through telemanipulators, useful to operate in hazardous environments.
products, enhance products-clients interaction and study clients responses and feedback.
• Travel, Tourism: VR can give a taste of far holiday places.
• Visualization: VR makes multidimensional data perceptible and easily accessible for humans, useful for scientific, engineering, med-ical and architecture purposes.
• Training, Education: simulating flight, driving for civilian and mil-itary purposes, touring physical museums and exhibiting artworks in virtual museums (digital heritage fruition). Another interesting idea is using the first person perspective provided by VR to allow people to feel empathy for someone else situation.
• Medicine: surgery simulators for medical students to help them practice their difficult and risky work. Virtual patients can be mod-eled on the real ones to simulate the intervention before the actual surgery. A surgeon can use haptic interfaces (which look and feel like actual surgical instruments) to practice surgical procedures or to guide surgical robots. VR can also be combined with special devices to support treatments and rehabilitation procedures, for both physic/motor injuries and psychic traumas. In fact, one of most popular application is the treatment of phobias and traumas by exposure to situations that cause negative feelings.
2.3
VR Systems
The interaction is made possible by the ability of a VR system to respond to user actions. Just to support this important requirement related to
and human-computer communication devices/techniques. As we will see below it’s a matter of continuously transform information from one domain to another (in a correct and preferably fastest way possible). Typical VR systems use a closed loop approach to handle interaction: the sensory stimulation and user’s actions are reciprocally dependent (video games provide famous example of a closed-loop approach). Start-ing from the user, special input devices or motor interfaces are needed to implement a translation from human output, such as location, head and eyes orientation, hands positions etc, to a digital input to allow the system to track the user’s state.
Figure 2.4: VR system loop.
A software application running on computer hardware can then han-dle the simulation’s aspects, such as performing collision detection or updating dynamic geometry, AI driven characters, user interactions. Software algorithms perform all of this updating changing an internal
computer based representation of the virtual world involved. Keeping the interactive nature of the experience means that the system has to translate again the world from a computer format to a user friendly for-mat. This transformation is called rendering. Typically a VR system performs a visual, aural and touch rendering. To close the loop, special output displays or sensorial interfaces such as screens, audio speakers, headphones and haptic devices are used to convey the rendering results and notify the user about the progression of the virtual world.
This loop can be considered a transposition of the “perception, cogni-tion, action” ([8]) loop of man’s behavior in a real world. To guarantee a high quality transposition, the system should minimize any inconsisten-cies between a user’s reaction to real stimuli and to artificial stimuli and execute its loop in a stable and fluid manner. This imposes some real-time constraints on the system. Failure to meet these type of require-ment means having a noticeable latency and eventually introducing a time lag in user’s perception. Along with inaccurate devices calibration or inaccurate tracking, it can result in a break of user’s presence and destroy the VR experience.
When designing a VR system, three fundamental aspects shall be analyzed and modeled:
• the human activity in real and virtual environment, • the interfaces for immersion and interaction,
• the application content.
2.4
Augmented Reality
Even if one can image a myriad of applications for immersive virtual re-ality, other forms of realities, along the previous cited continuum, can be useful for human activities. Consider the fact that immersive vir-tual reality restricts the user to a specific place and removes external influences. This type of escapist dream is certainly intriguing but it is unlikely the only useful application one can think of. There are many areas for people where the idea of combining real and virtual in their daily life would be very useful. Augmented reality is the name given to set of contents and technologies developed to support this idea. In AR the physical world is partially overlaid with digital information such as text, symbols, images and sounds, and less frequently touch or smell and taste effects. One thing that distinguishes AR from classic VR or augmented virtuality is the contextualization or localization of the digi-tal contents with respect to the user environment. Furthermore, digidigi-tal information requires a precise control and alignment with the corre-sponding real entities it augments. To this end, a registration proce-dure is used. It guarantees a convincing overlapping of virtual elements even when user’s viewpoint changes. Registration is used also because AR requires a very high level of interactivity: while the user explores a
scene, the AR supporting system should be able to continually present a visualization registered with the objects in that scene. This interactivity is backed up by a tightly coupled feedback loop. In this loop, AR sys-tems perform continuously a sensing and tracking of features related both to user and to the environment through various types of sensors and software algorithms, they handle input and commands coming from the user, and possibly access the network to download and upload data. Virtual content registration and visualization close the loop. A spatial model of the current real scene and its virtual augmented data is main-tained in some internal database of the system. At each loop iteration such model is processed to present a seamless real-virtual integration: it helps to handle collisions, occlusions and shadows between objects.
2.5
HW/SW Components
The hardware devices for VR/AR systems are classified as Displays, de-vices that produce stimuli to the human senses, Computers, dede-vices that process inputs and outputs and Sensors, devices that acquire in-formation from the real world. In some cases these components are installed into a single device.
2.5.1
Displays
There are many special hardware devices used in bringing to the user the rendered sensory output produced by the VR system. Let’s consider mainly those made for vision, our dominant and most sophisticated sense.
Visual Perception
About the 70% of the overall sensory data sent to the brain are delivered by the human vision([9]). Let’s mention some notable facts of the visual system as this is useful to understand some design principles of VR displays for vision. Schematically, our vision system consists of both eyes which are able to receive the visual stimuli and propagate it to the brain which interprets them. A first property is the angular size of the field of view, known as FOV. Horizontally, for each eye it is usually 120°, considering overlap we have a total FOV of 200°-220°. Vertically approximately 130°. The area where the two FOVs overlap is called binocular overlap. In that area, the use of both eyes allows to perceive the so called binocular depth cues: the most important one is called stereopsis. Given the disparity conveyed by left and right images, the human brain is able to deduce an estimated depth of visible objects
using the change in their relative angles, especially for nearby objects who have bigger angular offset.
Figure 2.7: Human horizontal and vertical fields of view.
The area of best visual acuity is called fovea. It covers about 1-2°. Outside its area acuity falls rapidly, but we usually move our eyes in a range of about 50°or we directly move our head to regain it. Human can also accommodate a dynamic range of intensities, expressible as
a ratio with a value of 1010 between the maximum and minimum light
intensities that our eyes can perceive ([10]), with a coverage of different viewing conditions, from dark to bright.
Visual Displays
A first classification of visual displays is based on the virtuality degree offered:
• Fully Immersive: stereoscopic displays used for fully immersive classic virtual reality. Real world view is completely occluded. They are equipped with sensors to track head position and orientation to update users view in the virtual space.
Figure 2.8: A VST display captures the real world, modi-fies the image electron-ically and combines vir-tual and real.
Figure 2.9: An OST display uses an optical component to overlay virtual elements to a user’s view of the real world.
• Video See-Through (VST, for short): this term denotes fully immer-sive displays, potentially stereoscopic, used for AR or, more gen-erally, a method to obtain virtual augmentations in MR applica-tions. The output images are generated by combining real images captured by video cameras with virtual elements generated by a graphics processor,
• Optical See-Through (OST, for short): displays used for augmented reality. Users see the real world directly through optic elements. Text, symbols and other graphics generated stuff are overlaid over optic material to augment the real view. This and the previous type are examples of augmentation methods used in AR systems,
A common display device for vision is the HMD, or Head Mounted Display, a fully immersive device. The user wears it on the head and its display elements are positioned in front of the eyes. HMDs come in several form factors, sizes and configurations; they consist of at least one image source and optics components (displays and lenses).
Figure 2.10: A binocular HMD (here the Oculus Rift) for immersive virtual games.
From being available only to laboratories for a niche market and be-ing equipped with low resolution displays some decades ago, their pro-duction has moved to a mass consumer market thanks to recent de-velopment of cost effective high resolution displays for smartphones and tablets. Since then manufacturers have applied many imaging and display technologies within their HMDs. From Liquid Crystal Displays (LCDs), to Active Matrix Organic Light Emitting Diode (AMOLED) Panels, Digital Light Projectors (DLP) and Liquid Crystal on Silicon Microdis-play (LCoS). These technologies enabled flexibility and easy integration within small form factors in near eye displays.
Moderns HMDs can be customized on different features relating, for example, to the display channels, tracking sensors and performance. A common classification is based on ocularity, that is the number of eyes they serve:
• Binocular: double viewing channel, one for each eye with slightly offset viewpoints to create stereoscopic view which is the common approach to convey 3D content to the user. Among their features there are a large field of view, a binocular overlap and possible depth cues effects. Not always very lightweight, their information processing can be intensive but at least they provide an immersion
effect.
• Monocular: single viewing channel, one eye involved. The other eye can see the real world around. They are of small form factors, they don’t provide any stereo depth cues (no stereopsis) and no immersion.
• Biocular: single viewing channel, both eyes involved. Used when immersion capability is required, but without stereopsis.
Figure 2.11: Monocular, Binocular, and Biocular configurations used in this image for night vision googles.
HMDs are not the only visual displays used today. If we consider their placement relative to the user, moving from wearable devices in head space to body space devices, we identify the handheld devices, such as smartphones and tablet computers, which, thanks to technol-ogy miniaturization and their spread, have become the main platform for AR applications. Equipped with a back-facing camera, several pro-cessors and a visual display mounted in a single casing, these compact devices can be used in a video-see-trough mode: the camera captures real images, the processor combines them with virtual elements using proper alignment and presents the result through the display. Smart-phones can also be used as near-eye displays in cardboard headsets:
their displays, plus added optical elements and cheap components, re-alize a low-cost HMD system where stereoscopy is usually enabled by a software application running on the smartphone processor. In this configuration they can be used for VR applications in a fully immersive mode or, by using their camera, as Video See Through AR devices. For VST AR, the idea is that once one is able to extract frames from a camera stream and track camera’s pose relatively to one or more objects, using a display backed by a computer connected to such camera constitutes a minimal setup to build a video-see-through display.
Figure 2.12: VR glasses for smartphones.
At a farther placement from the user we identify world space dis-plays. Usually located at fixed positions, these displays come in several configurations and sizes: from simple PC desktop monitors and large multi-panels displays to hemispherical surfaces each displaying stereo imagery. Usually these configurations are designed to be enjoyed not only by a single user but by a group of people. While for HMDs the
each eye, for world space displays, users wear special glasses which are synchronized to receive the correct image for each eye. Interaction is possible by tracking a selected user’s pose which leads the movements and viewpoints changes while for the other participants the experience is obviously a bit more passive.
Figure 2.13: Examples of CAVEs (Cave Automatic Virtual Environment), walled projector driven 3D display system that allows users to move and see their graphics data.
Not all world space devices rely on 2D electronic screens or panels based devices. There are other type of displays, called volumetric 3D displays ([11]), which renders images directly in 3D space by exploit-ing the properties of light waves patterns. Holographic displays and light fields are advanced examples of this approaches. Spatial AR is an-other technique which provides virtual augmentations over real world without using any explicit screen but relying on projectors displays. It uses a digital projection mapping technique to apply patterns of light, emitted by a projector, onto real objects surfaces enhancing their as-pect by adding details, shadows or animated textures. Simple version doesn’t requires projector tracking as long as the scene remains static. Other variants of spatial AR requires the user’s pose tracking ([12]) and let virtual augmentations appear not only on surfaces but also in free space as 3D objects, the object tracking to project on moving objects,
the projector tracking to allow a dynamic working volume, or the use of HMDs with an integrated projector which casts light patterns onto objects whose surfaces are made of retro-reflective materials.
Figure 2.14: Projection based AR. Camera sensors capture images which are then used to generate an aligned virtual imagery to be projected directly on the real objects. The user does not wear any equipment, except possibly for some pose sensors.
2.5.2
Computers
Computers in this context denote mainly hosts which execute the ac-quisition and integration of sensors data, the virtual world simulation and the output generation for the various sensorial displays. From a hardware point of view, fast processors, fast memories and the use of
a small lag between the user’s actions on the input interfaces and the perception of the responses through output interfaces. In a VR/AR set-ting also the location of computers must be addressed, especially for not fixed displays configurations. Some configurations separate computers from the other sensorial devices, so one need to consider how to place them in a non-intrusive way. Associated problems are how to power the system itself and how to provide a fast and reliable communication channels with the other devices, for example with a headset. Currently these connections are made by wires, wireless speeds are not suffi-cient but fast wireless connections are however actively under study today. Smartphones and tablets are equipped with moderately sim-pler computational units with respect to traditional standalone desktop or laptop computers, but they are self contained and they are able to perform tracking with their sensors and virtual graphics generation au-tonomously so they can be used for augmented reality or combined with a case with lenses for a VR experience. More complex operations or big-ger data models are usually moved to an online server which can give the needed information through an internet connection. In this case the latency of the connection is crucial. Even with traditional desktop PCs, a single or multiple servers can be used to manage the virtual world content leaving eventually only graphics and input handling to the connected clients.
Camera Representation
As an aside, let’s see a geometric model used in computer software to generate a visual graphics display. It is a model that should resemble how we see the world with our eyes, or at least, through a video camera, and it is based on the notion of a virtual camera. Keep in mind that for
abstraction or simplification purposes, virtual cameras in many graph-ics applications are defined in a simpler way than how real cameras work. The common model used is the pinhole camera model for per-spective projections. Its features are a center of projection called optical center, an image plane, and the projection of the optical center on the image plane, a point called principal point. The optical center is also the origin of the camera frame. The model tells us how to project a 3D point q of an observed object expressed in its object space to a point p in a target image space. The projection can be expressed in the following form using homogenous coordinates:
u v 1 ∝ M xo yo zo 1
where M can be decomposed into this product of matrices: M ' KCO
where, starting from right, we have first a 3x4 matrix O, the object ma-trix to transform the coordinate vector denoting the point q in its object space into another coordinate vector expressing the same point in world space. Then we have another 3x4 matrix, C, which transforms the co-ordinate vector from world into the camera space. Note that the inverse of C describe the pose of the camera in world space considered in the so-called extrinsic camera parameters. Then a 3x3 matrix K, which com-bines a standard projection matrix to project from the 3D camera space on the continuous 2D image plane with some numeric factors to obtain the final pixel coordinates. It is this last matrix that uses the so called
the local center of the image (uc, vc) and a skew factor sθ for image plane axes. K = f sx f sθ uc 0 f sy vc 0 0 1
These parameters are properties of the camera itself, so assuming no change in camera configuration, they can be calibrated in an offline step, while the other transformations, which depends on the pose of the camera or on the object in space, usually change much more frequently and have to be tracked.
Figure 2.15: Pinhole camera model for perspective projections.
2.5.3
Sensors and Tracking
Some parameters that define a VR/AR system can be calibrated offline. The term calibration denotes a procedure used to find some reference values or to adjust the values for one or more parameters of a device or
of a tool so that it delivers its measures in a correct scale. For example in a geometric setting one could calibrate a fixed object transformation by finding the relation between the object’s coordinate system and a ref-erence world system. Calibration is usually performed at discrete times: it may be once in a device’s life or every time before starting using it. It may be man-operated or auto-executed. Anyway, many parameters of a system require a continuous act of measuring: for example, the spa-tial properties of moving objects or of the user itself, that can change at every iteration of the system loop and require a dynamic registration through some kind of tracking technology. Sensors are devices that can be used for tracking purposes. In general, they are used to take mea-surements related to physical properties of the environment in which they work. Measurements of radiations such as visible or infrared light, radio signals, magnetic flux, but also sound, gravity, inertia, distances, angles. Signal sources can be both passive, for example natural light, or active, i.e. made by some other device requiring a free line of sight to reduce perturbation. In any case, for each signal, the strength, di-rection, time of flight, phase could be measured. For pose tracking in a VR/AR system we need sensors able to deliver six degrees of freedom (DOF), three for position and three for orientation. Mechanical sensors are among the oldest approaches. These stationary devices are based on a chain of mechanical arms manipulated by the user from one end; from the known lengths of the arms and tracked angles at every joint, position and orientation of the end effector can be computed. More modern devices used today for tracking purpose are the GPS, Global Position System, which measure the time of flights of satellites’ signals to compute longitude, latitude and less frequently altitude, the wireless, mobile and other network infrastructures, which can send information
earth magnetic field to provide an orientation measure, and inertial sen-sors: these type of sensors are sourceless, meaning they do not require interaction with an external source. Examples are gyroscopes, which are devices for measuring rotational velocity and computing orientation, and accelerometers: devices used for positions and inclination compu-tations. Another type of tracking approach is the optical tracking. Made possible by modern ubiquitous digital cameras supplying millions of pixels, optical tracking is nowadays part of almost all AR systems. Opti-cal tracking encompasses many techniques in which cameras are used to capture the movement of objects often using computer vision algo-rithms to analyze image sequences. A common application of cameras in AR is the computation of the camera pose to use to render a 3D object as an overlay over a real scenery using specific artificial features called markers or fiducials. By the way, the range of optical systems goes from systems that use cameras to track active or passive light markers on objects, to systems that do not use markers at all but rely only on natural features detection, to systems that project structured light pat-terns actively highlighting interesting features and then capturing those patterns with a camera, to laser ranging techniques that can measure the time of flight for a light impulse to go from a source to a receiver bouncing on a surface. Illumination conditions are important factors to consider: relying only on natural light sources, the sun or artificial lights can hinder the quality of the captured images leading to unsat-isfactory visual features detection. A common countermeasure is using some active light sources, typically infrared LEDs, together with a cam-era sensor equipped with infrared filter able to obtain high contrast im-ages. Infrared LEDs can be mounted at fixed position near the cameras illuminating the tracking area where objects with retro-reflective mate-rials reflect the light making them visible and easily detectable by the
cameras, or LEDs can be mounted directly on tracked objects as long as they are powered with batteries or in some other way. Moreover we can make a distinction based on the location of camera or sensors used for optical tracking: when cameras are used at fixed locations around the scene, typically indoor areas, and track infrared lights reflected or emit-ted, we talk about outside-in tracking, while when cameras or sensors are mounted on a HMD or on other objects and capture lights from fixed emitters and track pose with respect to those sources we talk about inside-out tracking. In an outside-in tracking setting usually spherical markers covered with reflective materials are used. Markers such as spheres are organized in a specific configuration or shape to allow the system to match a visual image of theirs with a corresponding internal digital model. Markers are usually needed to provide artificial features when the natural ones on the scene’s objects are not sufficient to en-able a relien-able tracking. In an optical system with markers, first a digital and easy to track model of the marker is created, then a corresponding physical model is produced and used to instrument the objects to track and finally cameras capture their images and with the help of software vision algorithms the system processes the captured images searching for markers features. Markers detection provides not only a target pose for rendering but also content information for augmentations.
If artificial features are not used for their visual clutter or are not desirable because of the needed environment instrumentation, natural features are the way. The optical system can still use a model based tracking approach provided that a corresponding reference to match is available: for human made objects it can be a CAD model, otherwise it can be obtained with a preliminary 3D scanning process. There are also methods that don’t require a pre-made model but instead they build
techniques.
Natural features tracking (NFT) uses features known as keypoints which are relevant points of target objects detectable in captured im-ages. Not all objects exhibit easily detectable features since it depends on the surface properties they have, so a variant is to recognize objects shapes, silhouettes, corners or edges. A relatively simple technique based on keypoints is the tracking by detection: the system adopting this approach leverages computer vision techniques to process each frame individually, detecting its features and building a descriptor
struc-ture for each feastruc-ture. Both for feature detection and for descriptor
creation, many computational tools are available in computer vision. Feature detectors track visually distinct points, called corners. Each method defines the properties a point must have to be considered as a corner, guaranteeing a certain degree of coherence and robustness during tracking when changes in viewpoint or in lighting conditions happen. For example, a classic approach is the Harris detector([13]) which computes, for each point, a 2x2 auto-correlation matrix measur-ing the local changes at that point. The idea is that a corner must have a strong gradient or curvature both on x and on y directions. These curvatures are expressed by the magnitude of the eigenvalues of that matrix. However, Harris method does not compute those eigenvalues directly but employs a threshold value to do a binary decision on the point with a scoring function based on the determinant and the trace of the matrix. Another approach is the FAST detector ([14]) which classi-fies a point as a corner if around it there exists a sufficiently long arc of pixels exhibiting a high color contrast with the central point. These arc belongs to a fixed discretized circle around the point and different heuristics exist on which points to use and in which order evaluate those points to make the test as fast as possible. Once a point has been
identified as a keypoint, for searching purposes it can be saved with a data structure describing its local neighborhood in a way that allows the recovery in other images even when geometrical transformations or illumination change. A simple one is an entire image patch around the point. More sophisticated and robust approaches are the SIFT, BRIEF and SURF descriptors. The standard SIFT approach builds a 4x4 grid out of an image patch centered on the keypoint and for each grid cell it computes a histogram with 8 bins relating to the gradients of the cell pixels. The descriptor is obtained as the concatenation of the histogram bins of each cell, and results in a 4x4x8 size vector.
From the correspondences between 2D keypoints on the image and the 3D features of a model, camera pose with respect to that model can be computed. The camera pose can be used to create a virtual im-age that is consistent with the real camera view and this means that in a AR/MR application a virtual camera modeling the real one should be added for this purpose. However, the pose is not sufficient to com-pletely describe the geometrical transformation underlying the virtual image formation process because other parameters are needed to com-plete the model (focal, viewing angle, etc). These parameters, which are called intrinsic parameters, can either be assumed as given if pro-vided by the camera manufacturer or they can be found with one of the two most used algorithms for camera calibration (the one proposed by Tsai[15] and the other by Zhang[16]) using a set of 2D-3D corre-spondences related to a known pattern used as a calibration target. With a calibrated camera, the problem of estimating its pose (made of 6 degrees of freedom, 3 for translation, 3 for rotation) is referred to as the Perspective-n-Point (PnP) problem. At runtime, when the pose has to be computed to get a dynamic alignment, a PnP solver can be
a refinement step (RANSAC[19] approach or its later improvements) to guarantee a robust process reducing the effect of spurious data due to perturbations such as image noise, tracking misses, erroneous 2D-3D correspondences.
When using the tracking by detection approach, analyzing images individually is not always effective both in terms of resource usage and of tracking results; better results in this sense can be obtained consid-ering the frame to frame coherency by making use at each frame of in-formation from the previous frame and using an incremental approach for features detection (the Kanade-Lucas-Tomasi tracking algorithm[20] tracks features from a starting image using optical flow, another method is the zero-normalized cross-correlation).
3.
Aim, Objectives and Tools
Let’s briefly review the ideas that have driven the whole thesis project and the key and support tools that helped make it possible.
3.1
Project Aim
The project aim is to explore the possibilities of using the HTC Vive for mixed reality applications. The idea comes from the availability of a front camera which, with its feature to warn the user when he’s getting close to the border of the tracking area of the device, shows a visualiza-tion of nearby environment and objects integrated in some way with the virtual scene. This feature opens the door to the opportunity to blend virtual and real worlds enabling mixed reality experiences. Among many ideas that can be thought of, a preliminary idea of mine was to imple-ment an application which lets the user wearing the Vive to hold a model generated by a 3D printer or by an other manufacturing process and see a matching virtual representation with enhanced visualization features of its surface appearance. This could be useful for the rapid prototyp-ing of production objects as one can use fabrication techniques with simplified materials to obtain an artifact that resembles the final object, at least in its structure, allowing the interested users to get an idea of the possible result by let them touch the prototype to sense its shape
while seeing and evaluating its final visual features supplied by the Vive technology and potentially modifying some attributes in an interactive way. A similar useful application could be the visual augmentation of historical artifacts whose real surface is no longer available.
3.2
Project Objectives
Today cameras are an effective technology in AR and MR applications, especially when used in conjunction with an output display, since they can be used both to capture a real scene and as tracking devices. So I took the opportunity to give an answer to the question if the camera of the Vive could be used not only as a pass-through camera reporting images but also to do some optical tracking, coupled with some com-puter vision algorithms to do the hard work. Thinking about keeping an object with a hand, optical tracking applied to camera images could help us recognize its pose and determine how to render it with a custom surface appearance. Given a 3D internal model of the object, as in my case, one can employ a feature by detection approach which finds cor-respondences between 2D feature key points on a image and 3D points on the the model and resolve a geometrical transformation that express the object pose with respect to the camera. So, while running, an ap-plication could ask the camera to grab frames, then it can perform the above procedure to find the needed transformations to render the ob-ject. The real-virtual mixing in all of this is realized by the fact that while the user sense a real handheld object, he can look at its 3D ver-sion recreated or enhanced in the virtual world. Moreover, other parts of the real world can be taken in, for example the room background, and added to the application scenery; in any case, since the user can’t see directly the real world, this way of operate can be considered a form of
video-see-through type augmentation: camera images are grabbed, and then they are sent to the application to be processed and the results are finally combined with the virtual content on the HMD displays. Things are never so simple, so consider the fact that, apart from the compu-tational requirements needed for real time optical tracking, the camera pose and the head pose don’t coincide. This is to say that a correct rendering of an object has to be done with respect to the user head, or better to his eyes, so in some way one has to model the transformation between camera and head poses. The camera-head offset can be seen in the Vive Room View mode where the virtual representations of the controllers appear shifted from the camera view and this is because the virtual controllers are rendered with respect to where the user’s eyes are.
Moreover with an indirect vision system, when real scenes are cap-tured and presented to the user mediated by a display, the result could suffer from optical distortion, limited dynamic range and latency be-tween virtual and real parts (latency can be mitigated by matching their relative speeds, while in direct see-through mode latency is more prob-lematic).
A minimal pipeline which support all of this should be aware of the spatial models of the involved components and of their relationships: the user’s eye, the head display, the camera and the object to be ren-dered with augmentations. Some of these components can move inde-pendently, others can’t or are fixed; in any case, to guarantee a good alignment of various coordinate systems for rendering or other pur-poses, transformations between these components and/or with a fixed world system should be determined. For static positioning, transforma-tion is fixed so a calibratransforma-tion procedure can be used. For a moving
ob-the application are rendered using traditional computer graphics tech-niques so, to such end, at least a virtual camera has to be added. Also, the previous cited transformations and other offline or online param-eters are needed to compute the traditional model-view and projection matrices for 3D rendering from a virtual camera positioned somewhere in the virtual world.
As for the information coming from the real world, since we are talk-ing about a camera feedtalk-ing the application, it is important to align the camera space and the virtual rendering space: alignment in the end means that real objects and their virtual counterparts are seen or per-ceived in the same position by the user. Then a good part of the exper-iment was dedicated to the effort for the calibration of the Vive camera, i.e. an offline procedure to estimate some camera parameters needed to guarantee a correct alignment.
I decided to pursue the objectives with the aid of the Unity platform which after a decision phase became my main development framework because it gives the opportunity to streamline the development of inter-active graphics applications, mainly games but not just those, focusing less on technical graphics stuff and more on content. Moreover, the lat-est versions of Unity come with a support for the HTC Vive with a plugin
which adds SteamVR support and C# bindings for OpenVR interfaces,
which are useful tools to develop with the Vive. Then, taking inspiration from spatial AR, a form of AR which uses projectors at fixed locations or integrated in a headset, to project a custom image on surfaces of real objects altering their appearance, as a debugging step for calibration estimation, I tried to implement a similar effect by projecting from a vir-tual projector this time, i.e. a projector represented by a virvir-tual entity in a Unity application, related to the user view of the scene in a way that it matches the geometrical relation between the Vive camera and
the user head. Scene images are captured by the camera and then they are projected onto a virtual object, represented in my case by the 3D model of one of the Vive controllers.
3.3
Tools
Firstly, I introduce the main VR tool at my disposal, the HTC Vive.
3.3.1
HTC Vive
Valve Corporation, a well-known entertainment software developer com-pany, began a collaboration with HTC, a Taiwanese manufacturer of smartphones, tablets and consumer electronics, with the aim to develop a virtual reality headset with room scale tracking capabilities. Their product, referred to as the HTC Vive, was released on April, 2016. It consists primarily of a head mounted display, two wireless handheld controllers and two base stations. The head mounted display has a binocular display with two flat AMOLED panels with a 1080x1200 reso-lution per eye. Optics is based on a Fresnel diffraction grating offering reduced distortions and artifacts. Horizontal field of view is about 110°. Interesting novelties are the available room setups offered by the run-time system: the user can experience VR applications either with a seat-ed/standing mode if only limited space is available or with a room scale setup. In this last case, the HTC Vive room scale tracking technology transforms an area up to 3.5x3.5 meters in the user’s room in a 3D space allowing the user to navigate by walking around and use motion-tracked handheld controllers to manipulate and interact with virtual entities, experiencing in this way a very immersive environment. Another inter-esting feature is the availability of a built-in video camera, front-face
Figure 3.1: A virtual grid appears to warn the user when he is in proximity of the play area bounds. When the camera is active, the Vive system also overlays a processed view of the scene.
positioned in the headset. Apart from the range of possible uses of such camera that emerge from the possibility of blending virtual and physi-cal worlds in some way, its earliest application is related to user safety during room scale experience. During that type of experience, users are in some sense blind with respect to the real world because of the head-set they are wearing. So to avoid user accidents with nearby obstacles during virtual navigation, the Vive runtime system through its “Chap-erone” feature and the embedded camera, captures a representation of the environment in user’s field of view and displays it through glowing outlines and silhouettes giving users enough time to avoid collisions.
Tracking
HTC Vive headset uses inertial measuring units (IMU) as the primary position and rotation tracking sensors, providing quick updates (1000Hz sampling, 500Hz reporting). Since IMUs computations are subject to drift due to double integration errors, about 60 times per second com-putations are corrected with an absolute position reference given by the “Lighthouse” tracking system. This last one is a new tracking system for monitoring position and orientation which uses two or more base stations as infrared emitters, placed usually at opposite corners of the room to avoid occlusion or line of sight problems to which they are vul-nerable. Each base station is a small rectangular box which contains an IR beacon that emits a synchronization pulse and two laser emit-ters that spin rapidly and emit horizontal and vertical laser sweeps that continuously flood the environment. Synch pulses and laser sweeps are captured by an array of infrared sensors mounted on the HMD: measuring when and where sensors are hit, the Vive system is able to find the exact position of each infrared receptor and of the head dis-play relative to the base stations. This tracking method is an innovative example of an inside-out tracking system since infrared emitters are po-sitioned in fixed places in the user’s room while sensors on the HMD are used to track poses relative to those emitters. This approach is opposed to outside-in tracking, where fixed sensors or cameras track IR light emitted from passive reflectors or active LEDs mounted on the HMD. To complete the Vive system for VR applications another component is necessary: a host computer, typically a quite powerful modern PC. The host, through its hardware components and specifically designed software modules, runs the simulation while handling low level things such as integration of sensor data coming from the headset devices, and
Figure 3.2: Room setup for reliable tracking. Mounted on the walls are the two base stations.
Figure 3.3: A representation of Lighthouse in action. Sweeping IR laser trig-gers small IR sensors on the tracked devices.
timing, synchronization and prediction issues.
Controllers
HTC Vive uses two wireless and battery powered handheld controllers as input devices to allow users to interact with the virtual world and its entities. The two controllers, shown in the figure, exhibit an ergonomic and balanced design made up of two parts: a toroidal shape in the upper part, which hosts a collection of tracking sensors, and a lower handle with a circular trackpad and several buttons for various functions. As for the HMD, a 6DOF tracking is made possible by a combination of IMUs sensors and the laser tracking technology of the Lighthouse sys-tem. Being wireless, actions on controls and sensors data are sent via a wireless transmitter to a receiver on the HMD, and then from here to the main host for input processing. The two controllers are also used as system output devices exhibiting a haptic feedback that can, for ex-ample, inform the user about the success or not of his actions.
Figure 3.4: The HTC Vive controllers shown above here are used for a smooth interaction and navigation in virtual reality simulations.
HTC Vive Camera
Unfortunately, I have not been able to find detailed specifications about the camera mounted on the Vive (apart from the fact that it has been manufactured by Sunny Optical Technology[21]). It seems that those specs have not been released to the community yet. Anyway, consid-ering the headset, the camera is centered between the eyes positions and lowered down, facing downwards. It is a RGB camera, it has not depth sensing capabilities, so it can’t reconstruct and re-project a scene, therefore it can’t show objects in their correct real world locations. Then, I managed to acquire some technical information using the facilities for the application development with the Vive. For example, it seems that it is based on a VGA sensor with a native frame resolution of 612x460 and with a refresh rate between 30Hz and 60Hz. As for the use with a VR app, the camera can either provide a RGB frame that is glued as a texture to a floating quad lying on top of one of the controllers, or, it can be used with a Room View mode in which the frame showing the nearby area,