Low Latency Capture Compression and Visualization of 3D Scenes Using RGBD Cameras for Telepresence

(1)

and

Scuola Superiore Sant’Anna

Master Degree in Computer Science and Networking (MCSN)

Low Latency Capture Compression and Visualization of

3D Scenes Using RGBD Cameras for Telepresence

Candidate: Supervisors:

Matteo Mazza Prof. Franco Tecchia

Prof. Marcello Carrozzino

Examiner: Prof. Roberto Grossi

(2)

Grazie!

Un grande grazie va a mia madre ed un altro a mio padre che, con il loro instancabile sostegno, mi hanno permesso di intraprendere questo percorso e di arrivare fin qui, contribuendo alla mia formazione person-ale. Ringrazio anche mia sorella per essere stata presente e pronta a consigliarmi.

Un ringraziamento particolare va alla mia ragazza Giulia, per avermi supportato durante questo percorso.

(3)

1 Introduction 1

1.1 Objectives of the Thesis . . . 3

1.2 Thesis Outline . . . 4

2 Related Work 5 2.1 Telepresence System . . . 6

2.2 Pin-hole Camera Model . . . 6

2.3 3D Acquisition . . . 9 2.4 Microsoft Kinect . . . 13 2.4.1 How it Works . . . 13 2.4.2 Device Accuracy . . . 14 2.4.3 Hole Filling . . . 14 2.5 Camera Calibration . . . 17 2.5.1 Intrinsic Calibration . . . 18 2.5.2 Extrinsic Calibration . . . 20

2.6 Data Processing and Rendering . . . 21

2.6.1 Mesh Merging . . . 22

2.6.2 Color Merging . . . 24

(4)

2.7 Depth Data Compression . . . 26

2.7.1 Holoimage Technique . . . 28

2.7.2 Direct Depth Encoding . . . 32

3 Capture, Compression and Visualization 35 3.1 Device Calibration . . . 36

3.2 Hole Filling . . . 41

3.3 Depth Data Compression . . . 43

3.3.1 Direct Depth Encoding Implementation . . . 45

3.3.2 Spiral Encoding . . . 48

3.3.3 Monochrome Depth Compression . . . 56

3.3.3.1 Accuracy Based Compression . . . 58

3.3.3.2 Quantization based Compression . . . . 60

3.3.3.3 Adaptive Compression . . . 61

3.4 System Pipeline . . . 68

3.4.1 Capture and Compression . . . 71

3.4.2 Rendering the 3D Scene . . . 72

3.4.3 A System Implementation . . . 74

4 Tests and Measurements 76 4.1 Acquisition Tests . . . 77

4.2 Calibration Tests . . . 78

4.3 Compression Tests . . . 79

4.3.1 Direct Depth Encoding . . . 80

4.3.2 Spiral Encoding . . . 86

(5)

4.3.4 Quantization Based Compression . . . 92 4.3.5 Adaptive Compression . . . 95 4.4 Rendering Tests . . . 99

5 Conclusions and Future Works 100

5.1 Future Works . . . 101 5.1.1 Automated Calibration . . . 102 5.1.2 Compression Parameter Tuning . . . 103

Appendix A 105

A.1 Bilateral Filter . . . 105 A.2 Matrix Form for Extrinsic Calibration . . . 106

(6)

Introduction

Communication systems have always been studied to improve the sys-tem availability and the user experience. The main purpose of these systems is to make the user feeling the communication as natural as possible. Starting from the first electric telegraphs, evolving in tele-phones and, later on, in videotelephony, the communication systems improved the user experience increasing the feeling to be remotely lo-cated with the people who want to communicate with.

With electric telegraphs, people could significantly reduce the time nec-essary to communicate information over very large distances. With the advent of the telephony, the communication times were further reduced and people voices began an essential part of the communication. The user experience improves even more with videotelephony because the people involved in the video call can see each other as if they are seated one in front of the other in the same room.

Telepresence systems aim to improve in several ways the feeling of being co-located with a remote user. Some of these telepresence systems create a shared virtual environment in which the people using the telepresence system are immersed. To accomplish this effect, different hardware de-vices can be adopted. For example, the shared virtual environment can be navigated inside caves or, as another example, the users can wear

(7)

head mounted displays. However, the shared virtual environment is not the only available solution. Other telepresence systems create a window effect to extend the local environment. They show the remote environ-ment through a monitor so that the user experience is similar to seeing the remote environment as a room behind a window. Again, another approach of other telepresence systems is to insert a remote user in the local environment. This is accomplished using see-through rendering devices that represent the remote users as holograms.

Several new technological aspects must be studied to realize such sys-tems, ranging from user tracking to data compression and visualization. Moreover, a variety of input/output hardware devices are available. The systems can be specialized for a specific set of them or they can aim to support the great majority of devices whenever possible. For example, one system can be specifically realized to create a window effect using a monitor for rendering and a camera for user tracking; instead, another one can be realized to support the great majority of head mounted dis-play (HMD) rendering devices.

In the telepresence scenarios, the user perception is strictly related to the user movements in the environment. In fact, the remote scene is no more a flat 2D video, but a 3D representation of a remote environment. The user position and user orientation in space must be tracked in real time to render the scene from its correct view point at each frame. These tracking procedures can be realized in different ways depending on the system architecture. For example, some systems can delegate it to a head mounted display (HMD). Instead, other systems can implement it through eye tracking.

The capturing method must also acquire geometry information about the scene. It should be implemented using multiple devices that must be calibrated to build a consistent geometry representation of the scene. However, the main challenge to realize 3D telepresence system is the bandwidth availability for streaming 3D information. With respect to

(8)

standard video conferences, the amount of necessary data to be trans-mitted for a 3D telepresence system is substantially higher. The scene is no more represented with a simple 2D image stream and an audio stream. The 2D images are substituted by the geometry of the scene and the texture mapping introducing geometry compression as a key aspect to be investigated.

1.1 Objectives of the Thesis

This thesis investigates some of the key aspects of a 3D telepresence system. It starts from the capturing methods analyzing different tech-nologies and discussing their characteristics. It deals with the captur-ing devices, their calibration, the capturcaptur-ing software and the software quality-enhancers. These aspects linked together enable the system to capture a real-time 3D scene.

Then the thesis inspects the rendering techniques of the acquired 3D scene. This aspect is analyzed independently from the chosen rendering device. Thus, it can be applied for example to an HMD, a see-through device, a stereoscopic monitor or a standard rendering device.

Moreover, scene compression is analyzed to enable a telepresence system to be adopted on a network with a limited bandwidth such as internet. Instead of finding new compression algorithms, this thesis tries to ex-ploit the widely adopted standard video compressors to compress both video streams and geometry information. This approach aims to exploit the great work done for video compression in the past years.

Finally, it proposes a system architecture for telepresence systems. It aims to enable the development of telepresence systems that exploit hardware acceleration support. There are already a lot of cheap devices with such hardware support, ranging from laptop to single-board com-puters. The approach of this thesis is to make them exploiting their

(9)

hardware to execute a telepresence system.

1.2 Thesis Outline

The next chapter gives an overview of some relevant works about the main aspects involved in 3D telepresence systems. First, a more for-mal definition of telepresence system is given. Then, video capture and 3D scene capture are shown, illustrating different technologies. Pros and cons are discussed for different use cases. Following, the Microsoft Kinect capturing device is analyzed because its characteristics make it feasible to be adopted in this thesis. Since multiple capturing devices are used to capture a 3D scene, the chapter proceeds with a camera cal-ibration section. Then some data processing and rendering techniques are observed and finally, the state of the art about geometry compres-sion is investigated.

The third chapter investigates a new calibration technique for rgbd cap-turing devices. Its second section further investigates the hole filling technique. Then several depth data compression techniques are de-scribed and finally a system architecture is proposed.

In the fourth chapter, the implemented tests are described and their outcomes are shown and discussed. The tests performed are related to different topics as capturing tests, calibration tests, compression tests and rendering tests.

(10)

Related Work

This chapter describes the state of the art of some concepts and tech-niques exploited in telepresence systems. There is a wide range of differ-ent telepresence systems, however they have some common aspects. An example of these common aspects is that the starting point for all these systems is the acquisition of a scene. This acquisition can be realized with a variety of devices, each one with its pros and cons. A second common aspect is that after the 3D data of different devices is acquired, it is merged to form a consistent representation of the scene in a unique reference system so that it can be rendered for the user. However, the rendering devices can differentiate the telepresence system because they range from simple monitors to HMD (Head Mounted Displays), cave in-stallations or other more experimental devices.

The system described in this thesis is kindred to several works related to telepresence systems, real-time streaming, compression, calibration and 3D acquisition and reconstruction. The remaining of this chapter describes concepts and affirmed techniques about them.

(11)

2.1 Telepresence System

A telepresence system aims to create the feeling that one person is in a remote place or is co-located with another person. This illusion can be realized with different paradigms. Among them, there are:

• a remote user appearing inside the local environment [30]. Shown in Figure 2.1 a)

• a remote space appearing to extend the local environment [43, 12]. Shown in Figure 2.1 b)

• a local user immersed in a remote place [23] Shown in Figures 2.1 c),d)

Whatever scenario is chosen, to implement a telepresence system it is required a live 3D description of the scene and the ability to see that scene from an arbitrary point of view [46]. Thus, 3D acquisition and reconstruction are the key points of free view point video systems. A general purpose telepresence system is formalized in [46].

Moreover, also the streaming of the geometry[38, 62, 36] is a key point to be considered implementing remote telepresence systems.

2.2 Pin-hole Camera Model

In computer vision and computer graphics, the pinhole camera model is often used as a reasonable model of how a camera capture a 3D scene. The pin-hole camera model describes the relationship between the coordinates of points in the 3D space and the coordinates of their projections in the image plane of an ideal pin-hole camera.

There are relevant differences of an ideal pin-hole camera with respect to a real one. Some of these differences are related to the ideal optical

(12)

(a) (b)

(c) (d)

Figure 2.1: (a) An audience interacts with a remote participant. (b) Displays used to extend the local environment with a remote room. (c) A user immersed in a virtual environment. (d) A snapshot of a virtual environment.

center and the lens distortion. In the real world, the ideal optical center cannot exists because it cannot be a single geometric point and the lens distortion cannot be negligible depending on the kind of lens adopted. Moreover, in the pin-hole camera model, the image plane is shifted on the z axis of 2f , with f the focal distance of the camera. This last mathematical artifice allows to obtain an image not overturned, like a real camera captures.

(13)

(a) (b)

Figure 2.2: the pin-hole camera model

camera, while Figure 2.2 (a) shows the pin-hole camera model with the shifted plane.

As shown again in Figure 2.2 (a), it is possible to identify 3 reference systems:

• τ = (O; ˆX, ˆY , ˆZ) the reference system fixed to the camera centered in the ideal hole of the camera with the Z axis directed in the opposite direction of camera view.

• τ? _{= (C; ˆ}_{x, ˆ}_{y) the reference system centered in the principal point}

(that is usually the center of the camera image).

• ¯τ = (D; ˆu, ˆv) the discrete reference system centered in the top left corner of the camera image. Each vector (u, v)T _{with u ∈ N and}

v ∈ N identifies a pixel of the image.

The focal distance f is known and it depends only on the kind of camera adopted. The unit of measure usually used is pixels. Thus, for each pixel p in the image plane with coordinate (u, v)T it is possible to find its coordinates (x, y)T by knowing the principal point. Thus, it is possible to compute the −Op vector.→

(14)

following relations are obtained: x f = X Z and y f = Y Z

that can be applied to compute the−→OP vector given that its Z compo-nent is given. This can be the case of a typical 3D acquisition scenario. The latter equations can be also expressed in matrix form as follows:

Z     x y 1     =     fx 0 0 0 fy 0 0 0 1         X Y Z     (2.1)

and for (u,v) component:

Z     u v 1     =     fx 0 cx 0 fy cy 0 0 1         X Y Z     (2.2)

Note that since the focal distance and the principal point coordinates are expressed in pixel units, they need to be scaled accordingly to the camera resolution adopted.

2.3 3D Acquisition

3D capturing technologies are used in a variety of different aims with different constraints and requirements. They range from 3D scanning of sculpture for art museums to 3D scanning for film-making motion cap-ture, from 3D military reconstruction to 3D scans for industry 4.0. Thus there are a lot of research for different scenarios and different aims. The 3D acquisition can be performed with different kinds of devices and dif-ferent techniques, each one with its own advantages and limitations [15].

(15)

Each device has different costs and different performance optimization. For example some can focus on very low measurement errors, while oth-ers works with very low acquisition time and othoth-ers again take care of intrusiveness. Examples of the main techniques for 3D acquisition are:

• time-of-flight [20, 25]

Reconstruction of the 3D scene exploiting the speed of light phys-ical constant.

A scanner emits a light impulse and waits for it to return to a sensor registering the waited time. Multiplying this time by the speed of light, it is possible to obtain the distance crossed by the light impulse. Dividing it by two, it is possible to obtain the dis-tance of the illuminated point from the scanner.

d(t) = tc 2

Another approach is to continuously emit a light modulating sinu-soidally its power. The distance of the surface is measured thanks to the difference of phase between the light emitter and the sensor. The frequency of the light emitted can be tuned to be not visible to human eyes to avoid an intrusiveness during 3D acquisition.

• triangulation

Reconstruction of the 3D scene building triangles starting from known information.

It can be realized in several ways. For example, the triangulation can be performed between a laser emitter and a RGB sensor as shown in Figure 2.3. Knowing the distance d between the emitter and the sensor, the angle β formed by the projecting line and the emitter-sensor and the angle α between the emitter-sensor line

(16)

Figure 2.3: The system computes the distance P1P3 with triangulation

exploiting camera and laser emitter

and the camera-target line it is possible to compute the distance P1P3.

Given a pixel in the camera image, the α value can be calculated exploiting the orientation of the camera and the pin-hole camera model as described in section 2.2.

Figure 2.4: The system computes the distance P1P3 with triangulation

exploiting 2 cameras

Another kind of scene reconstruction exploiting triangulation can be realized without an emitting laser, exploiting 2 cameras like in Figure 2.4. With the acquisition of 2 frames by the 2 cameras, we

(17)

have to identify pixel p1 from the first camera and pixel p2 from

the second one, both representing the same feature in the scene. After p1 and p2 are selected, the α and β angles and the distance

P1P3 can be computed.

These two approaches have pros and cons. The most important difference is that the former can not be used in real-time, because the laser emitter is rotated of a ∆β angle at each camera frame. On the other hand, the latter need to solve the correspondence problem: for each point in one image, it needs to be found the correspondent point in the second image[28].

• structured light [59]

Reconstruction of the 3D scene by triangulation solving the cor-respondence problem.

Even if two fixed cameras are calibrated, to acquire the 3D infor-mation by triangulation can be difficult. For each point in each image of the first camera, the correspondent point in the corre-spondent image of the second camera need to be found.

The structured light technique simplifies this task substituting one camera with an emitter device that projects a light pattern in the scene. Thus, knowing the emitted pattern, it is possible to look for groups of pixels in the acquired images that represent pieces of the pattern solving the correspondence problem. Then simple triangulation is applied to retrieve the depth map. This can be computed in real-time for each camera frame with a very little intrusiveness considering that the light pattern can be emitted in infra-red.

(18)

2.4 Microsoft Kinect

A widely used device in telepresence scenarios appears to be the Mi-crosoft Kinect sensor. Introduced to be a gaming input device special-ized in human pose estimation, it starts to be exploited as a real-time depth sensor in different scenarios [10, 29, 51, 61, 63, 57] thanks also to its cheapness and wide availability. It can be used as a pose estimation gaming device as well as a model capture device[54] or as a 3D sensor for computer vision that can be mounted on a remote controlled robot to acquire a 3D map of a place[53].

Microsoft Kinect device exploits the structured light technique to acquire 640x480 depth-maps at 30 frames per second. Moreover, it pro-vides color images matched with depth-maps, meaning that the pixels with coordinates (u, v)T in a color image and in the paired depth-map refer to the same point in the space.[50] Joining these characteristics with its wide availability and cheapness, this device makes itself ready for adoption in telepresence systems. In fact, some systems are based on an array of Microsoft Kinect sensors strategically placed and calibrated to provide a unified mesh of the 3D scene which is rendered from the perspective of a tracked user. [42, 43, 44, 46, 50]

2.4.1 How it Works

The Kinect functioning is intellectual property of Prime Sense.This tech-nology exploits an IR emitter to project a speckle pattern that is con-stant over time. The speckle pattern is used to measure the depth dis-tance of each pixel from the camera through triangulation and all the relevant computation can be performed on chip. As observed in [58], the image pattern has some repetitions, symmetries and some brighter key points that probably simplify the computation so that it can be executed with few resources.

(19)

(a) (b)

Figure 2.5: Kinect’s IR pattern [58] (a) spot pattern (b) 3x3 repetition of the pattern

2.4.2 Device Accuracy

This commodity hardware, as all the hardwares, has some limitation. It cannot measure any possible depth but it is limited to a range shorter than 5 meters [4]. Moreover its accuracy decreases quadratically with the distance as experimented in [43]. However this limitation lets this device be a valid option for a lot of telepresence systems where the captured scene is limited to a reasonable area.

Moreover, some works exist to study the Microsoft Kinect noise in the acquired depth-maps. [39, 40]

2.4.3 Hole Filling

Another limitation of this device is the sensibility to the captured sur-face materials. Since its 3D capturing strategy is based on a pattern recognition of a projected light pattern, materials with a high reflec-tivity like glass, become difficult to be acquired and this can introduce holes in the produced depth-maps. Moreover, adopting an array of Mi-crosoft Kinect sensors to acquire a scene can lead to an increased holes

(20)

issue [45] because multiple overlapping lights projections would disturb the pattern recognition of each device.

When a group of pixels of the IR sensor is not illuminated by any piece of the pattern, the triangulation can not be computed because no pattern matching can be done in that region of the image. Thus, the depth-map is generated with a black hole, meaning that no depth is measured for that pixels. Using multiple devices illuminating together the scene leads to an IR interference caused by multiple pattern over-lapped in the scene. This interference decreases again the accuracy of pattern matching incrementing the holes of the generated depth-maps too.

However, the pixels in the depth-maps are correctly evaluated with a negligible increase in the mean-error because the interference decreases the probability of a correct pattern match but it does not decrease the accuracy of the triangulation done on the pattern matched region of the image. Moreover, if the holes are caused by the overlapping pattern of two cameras, a hole in the depth-map of one camera usually will not be in the depth-map of the second camera, meaning that one of the two cameras successfully disambiguates its pattern [42].

In [21], the RGBD cameras were modified to selectively switch on and off their IR projectors. The cameras acquisitions can be scheduled to decrease the IR interference problem. The selection of the acquir-ing devices can be done takacquir-ing into account the user view point and the occluded surfaces of the scene. Although this strategy allows the adoption of software schedulers, there are also two kinds of proposed approaches to solve the IR interference problem keeping all the RGBD devices active at the same time:

• hardware solution

(21)

(a)

(b)

Figure 2.6: multiple Kinect cause IR interference [45] (a) left: the image realized by only one kinect without interference; center: the image real-ized by one kinect with IR interference caused by multiple kinects; right: the image realized by one kinect with IR interference filtered by motion. (b) left: IR pattern interference; right: IR pattern disambiguation with motion. up: idealized images; down real images

done because the peak-to-peak range of wavelengths measured by a spectrometer on a group of Kinects is 2.6 nm, a too little range to be filtered.[42]

Another attempt is to install stepper motors on top of the IR de-vices to create a significant movement of the projected pattern during the cameras integration time. This movement lets each camera disambiguate its pattern that looks well defined with re-spect to the other camera patterns. The latters appear to be

(22)

blurred and they only increase the surrounding noise. Figure 2.6 (b)

As shown in [45] this approach works as expected but it has some disadvantages: Since the RGB sensor is rigidly attached to the IR sensor on the same device, a color image de-blurring is required. Moreover, the light available to each IR camera is decreased [42]. • software solution

Application of a filter to each depth-frame acquired by the sensor trying to fill the holes.

The filter proposed in [42] is based on a two passes median filter. It works better with small holes and it can be written in an effi-cient parallel version to be executed on a GPU on the base of the branchless median filter implementation of McGuire [48]. 2.6 (a)

2.5 Camera Calibration

To acquire the correct 3D scene combining data from multiple devices, the limits of the pin-hole camera model are an important issue. In fact, significant lens distortions can become a problem especially in cheap devices.

It is possible to determine and correct the lens distortions with a good approximation through an intrinsic calibration (subsection 2.5.1) for each camera. After that it can be computed the extrinsic calibration (subsection 2.5.2) to determine the geometric transformation that trans-forms the coordinate system of one camera to the coordinate system of another one.

(23)

2.5.1 Intrinsic Calibration

Intrinsic calibration is the procedure to find some coefficients to be associated to each camera to correct its tangential and radial distortions. Since these distortions are caused by the lens of the device, they are constant in time but differ from camera to camera. Two examples of radial distortion are shown in Figure 2.7.

(a) (b)

Figure 2.7: example of radial distortion. (a) barrel distortion (b) pin-cushion distortion

To correct radial distortion coefficients k1,k2,k3,k4,k5 and k6 are

in-troduced while to correct tangential coefficients p1 and p2 are

(24)

equa-tion 2.2, so lets rewrite it in another form Z     u v 1     =     fx 0 cx 0 fy cy 0 0 1         X Y Z     ≡      u = fx X Z + cx v = fy Y Z + cy ≡    u = fxx0+ cx v = fyy0+ cy with x0= X Z and y0=Y Z (2.3)

Now lens distortion can be corrected inserting kiand picoefficients in

the formula as shown in openCV [17] camera calibration documentation [1].    u = fxx00+ cx v = fyy00+ cy with x00= x01 + k1r 2_{+ k2r}4_{+ k3r}6 1 + k4r2+ k5r4+ k6r6 + 2p1x0y0+ p2(r2+ 2x02), y00= y01 + k1r 2_{+ k2r}4_{+ k3r}6 1 + k4r2+ k5r4+ k6r6 + 2p2x0y0+ p1(r2+ 2y02) and r2= x02+ y02 and x0= X Z y0= Y Z (2.4)

The ki and pi coefficients depends only on the camera lenses. So

once a camera is calibrated, as opposed to the focal distance and the principal point (fx, fy, cxand cy), the camera distortion coefficients can

(25)

2.5.2 Extrinsic Calibration

The extrinsic calibration is the process to calibrate multiple cameras to acquire consistently a unique scene with multiple devices. The aim is to obtain all the points coordinates expressed in one single reference system. [35]

If the system adopts n cameras, to solve this problem at least n − 1 geometric transformations Tij : R3 → R3 must be discovered such

that Tij transforms the coordinates of points in the reference system

τi of camera i, to coordinates in the reference system τj of camera j.

Moreover, without loss of generality, for each camera j must exists a sequence of transformation to transform coordinates from τ1to τj. Since

these geometric transformations are linear maps and can be expressed as matrices, these transformations can be composed (multiplying the associated matrices) to find the T1i transform for each camera i.

Once cameras are physically placed in an environment, the extrinsic calibration phase can take place and a transform matrix can be stored for each camera regardless to its resolution.

In the telepresence system described in [42], the extrinsic calibra-tion of multiple kinect devices is performed exploiting openCV [17] rou-tines that compute the transform matrix between two cameras taking as parameters the intrinsic parameters and multiple couples of images containing a known pattern like a chessboard with a predetermined cell size. In that system each Kinect device is directly calibrated with a pro-claimed master Kinect device. Moreover, only calibration of the color sensor is performed because the Kinect driver is able to map pixel of the depth-map to the color pixels.

If the used device has not a (good enough) driver that maps depth-map coordinates to color coordinates, the extrinsic calibration between color and depth sensors cannot be trivially achieved with the same pro-cedure because the depth-sensor does not see the chessboard pattern.

(26)

A new chessboard pattern can be constructed with black cells of a dif-ferent material in such a way that black cells completely reflect the IR Kinect pattern and the Kinect device can not match its pattern and can not compute the depth of black cells. Since the missing depth val-ues are stored as 0-depth pixels, reading the depth images as grey-scale color images, the chessboard pattern becomes visible and the classic calibration procedure can be exploited again [14].

2.6 Data Processing and Rendering

Once the Kinects devices are calibrated, the scene can be acquired in real time and the scene can be rendered from a free point of view. The point of view can be manually controlled or it can be automatically detected tracking the user. For example, in [43] the user point of view is estimated by eyes positions tracking, while with a head-mounted-display the point of view can be computed from the head-mounted-display sensors

Once the depth-maps and the color images are acquired by the Kinect devices, the data processing proposed in [42] is completely exe-cuted in GPU. The following list itemize its processing steps:

• acquisition

It is the process that when new data is available from the devices, reads and uploads them to the GPU.

• software filter

It applies hole filling filter as described in Section 2.4.3 to soften the effects of the IR interference generated by multiple Kinect devices.

• mesh creation

For each depth-map, it creates a triangle mesh. At each pixel of the depth-map is associated a vertex of a triangle computing its

(27)

3D components according to the pin-hole camera model shown in Section 2.2 with the lens distortion correction shown in Section 2.5.1

• color texture

For each triangle of each mesh, it applies color texture from the Kinect color image. Since at each mesh is associated only one color image, at each vertex of each mesh can be associated only its UV coordinates.

• merge meshes

They merge data of all devices taking into account the geometric transforms of calibrations as described in Section 2.5.2, the quality index described in Section 2.6.1 and the color issues illustrated in Section 2.6.2.

• smooth rendering

While the next 3D scene is not yet constructed, this process con-tinues to render the last constructed scene from the user view point at the required frame rate.

However, the reconstructed 3D scene can lack of some part of the real environment because of occlusion. To make up for missing image information, in [52] some texture synthesis is realized.

2.6.1 Mesh Merging

At some point in the pipeline, multiple meshes have been generated from the data of different devices. These meshes need to be merged to form the geometry of the entire scene. The meshes are not completely disjoint because the cameras are focused on the same scene and share some scene details as shown in Figure 2.8.

(28)

Figure 2.8: colored contributions of multiple kinect devices [44]

If the devices are not calibrated, the coordinates of each mesh vertex cannot be expressed in a unique reference system and so it is not possible to trivially merge the generated mesh of each device. Moreover, also if a good calibration cannot be done, a mesh merging phase needs to be exploited to build the merged geometry of the scene.

As shown in [22], it is possible to adopt the ICP algorithm proposed by Besl and McKay in [16] to find fine grain calibration corrections. This algorithm can be applied independently by the geometry represen-tation: it can operate over vexels, triangles, ecc. Moreover, an efficient and fast version of the ICP algorithm is also adopted in a real-time scenario by Kinect Fusion [54] to find the transformation associated to the movement of the camera from frame to frame. Although this is not exactly the same aim, it can be useful to find calibration correction from images of two devices.

In the work shown in [43] a different approach is used. The meshes are generated by multiple calibrated Kinect devices, so all vertices of each mesh are expressed in a unique reference system. Although the measurement errors lead to an imperfect vertex overlapping, the overlap-ping meshes (that are composed by triangles created by data of different devices related to the same object in the scene) are rendered indepen-dently without a geometry merge phase. For these overlapping triangles

(29)

the measurement error can be different, so a quality index is associated at each pixel of the depth-maps on the basis of the accuracy concept described in Section 2.4.2. After the independent rendering is com-pleted, the merging phase occurs in window space. For each fragment it is considered an uncertainty z-range in which the fragment can be located based on the possible error measurements. If more fragments of different independent renderings reside in the same uncertainty z-range, the one with higher quality index is rendered instead of the one with lower z value.

In the adopted model, the quality of each pixel color does not depend only on the distance of the represented point from the camera, instead it directly depends by the area of the surface represented by the pixel. This area is related to the distance and to the orientation of the object with respect to the camera. In fact, given a plane, more its normal vector is near to be perpendicular to the camera orientation vector, wider is the area that is represented per pixel. And vice versa, more its normal vector is near to be parallel to the camera orientation vector, smaller is the represented area as shown in Figure 2.9. Moreover, to simplify the computation, only a quality index is stored for each vertex of the mesh, assuming that also the depth measurements depend on the object orientation with respect to the camera:

quality = cos(θ)

distance2 (2.5)

The quality index can be used for comparison and thresholding dur-ing mesh mergdur-ing to provide better geometry and better color textures.

2.6.2 Color Merging

Since even same model devices exhibit different color gamuts, a color merging technique is needed, especially with cheap devices like Kinect [27]. Moreover, these devices can be subjected to different automatic exposure controls given that the devices are located in different places

(30)

Figure 2.9: the angle θ between the normal to the camera plane Np and

the normal of the surface to be acquired Ncis source of color degradation

and causes the decrease of the quality index

of a lighten room. Although the library provides function to control the exposure, the Kinect driver does not implement it.

The color merging phase aims to smooth the color noise that can occur in surfaces acquired by multiple cameras and to obtain a better user perception of the rendered scene.

The merging procedure can be splitted in two functionalities: build-ing a color matchbuild-ing function for each camera and applybuild-ing the color matching functions. A color matching function maps each color from one camera to a new color: the one it is supposed to be acquired from the reference camera. Since the exposure change does not happen very frequently, in [42], the matching functions are built out of the real-time pipeline with a CPU algorithm, while the color matching functions are applied at each camera frame.

To build the color matching functions, a set of couples of pixels (p, q) needs to be found such that pixel p belongs to the frame of the reference camera and pixel q belongs to the frame of the other camera. Moreover, p and q must be related to the same 3D point in the scene such that

(31)

they should be the same color. How to find this set of pixels is described in 2.6.1 where the overlapping fragments, within a threshold range, in window space are generated exactly from this set of couples of pixels. This set also contains noise due to light reflections or different quality of the acquired data as explained again in 2.6.1. Thus, the matching function is built from a RANSAC based algorithm to get a better ap-proximation ignoring outliers values generated by noise.

2.7 Depth Data Compression

To achieve a real-time streaming of the 3D scene, the data must be compressed to be sent over the network. In [37] an efficient point-cloud compression strategies is presented. However it enables only a real-time geometry decoding while the compression step is executed off-line. In a telepresence system, to achieve a real-time streaming, this scenario cannot be accepted. The compression and decompression steps must be executed in real-time.

Although the video-data compression is a well-known topic exploited in a wide variety of use cases, 3D-geometry compression is not yet a widely explored topic. Since standard video codec algorithms perform very efficiently, as a first attempt they were used to compress 2D rep-resentation of a 3D scene. Some approaches exist to construct such 2D representation of the scene and most of them are inspired by the holoim-age technique. The key idea is to construct the 2D representation of the scene given a view-point and an orientation in the space. Like taking a picture, this procedure cannot carefully represent the entirety of a generic scene because of the possible obstructions.

However, standard video codec algorithms do not perform very well compressing 2D representation of 3D data[47, 55, 56] since they are optimized for human optical perception of images and videos. Typical examples of these optimizations are a loss of color information with

(32)

respect to luminance information (color down sampling) and a non-uniform luminance quantization.

Research still exists in this field trying to adapt standard video cod-ing approaches to 3D data compression[32, 41, 47, 53, 55, 56].

Moreover other approaches exist.

In [13, 26, 60], Jpeg compression is exploited to compress again 2D representation of 3D scene. Although this is again a lossy com-pression, the main differences with video-compression techniques are that no inter-frame information can be compressed (like H.264 motion-compensation does) and the space to be compressed is RGB instead of chroma-luminance encoding.

The compression of a 2D representation of the 3D scene can not be a general solution because it cannot take into account all the data with the same prominence. In fact, some groups of 3D points can be assigned to a little region of the 2D image or, worse than that, they can even not be showed in the 2D image. This occurs because occluded surfaces involve discarded data and quasi-perpendicular surfaces squeeze points in a very little region of the 2D image. Thus, the representation fidelity is strictly dependent by the point of view from which the 2D image is generated. However, other more general approaches exist and they take into account all the 3D information without underestimate any subset of data.

Mesh-based compression is explored by [34] and [11]. This approach creates an approximated mesh looking for approximately flat surfaces in a point cloud 3D representation. For big compression ratio, low number of polygons are generated. Thus, big compression ratio means these polygons cover a wider area loosing scene details. This process is also exploited by Microsoft Hololens to implement the spatial mapping as shown in figure 2.10 (b).

(33)

(a) (b)

Figure 2.10: (a) a mesh based compression step (b) Microsoft Hololens spatial mapping [3]

Research proceeds also building a lossy compression algorithm for 3D data based on an intra prediction scheme[19]. This work grounds on strong standards. These standards are widely used in the video compression in everyday life.

Another interesting approach for 3D data compression can be found in [49], where a multiview video acquisition scheme is exploited to build an inter-view prediction to compress all the acquired data. As shown in figure 2.11, the standard time prediction scheme is executed side by side with the inter-view prediction represented with red arrows.

2.7.1 Holoimage Technique

The holoimage is a technique to encode 3D geometry into a 2D repre-sentation. It grounds on the 3D scanner technologies that it completely emulates in a virtualized environment [33, 31].

An empty scene is populated with the geometry to be encoded. Then a virtual projector projects a fringe pattern in the scene. When the virtual camera captures the scene in 2D images, the projected pattern

(34)

Figure 2.11: inter-view prediction (red arrows) and temporal prediction (black arrows) using hierarchical B pictures for 5 cameras and a GOP size of 8 pictures[49]

appears distorted because of the geometry. Knowing the pattern, the distance between the camera and the projector and their orientation in the space, it is possible to recover the 3D geometry.

Since the environment is purely virtual, the perspective projector and camera can be substituted with ideal orthogonal ones to simplify the process as shown in figure 2.12. Moreover, the controlled envi-ronment lets to avoid noise sources of the real world, such as surface reflectivity and not uniform ambient light. This makes possible to en-code the image with only 2 fringes into 2 color channels. Furthermore, it makse possible to use the third channel to encode additional informa-tion avoiding a common limitainforma-tion of phase-shifting techniques. In fact big discontinuities in the geometry can lead to errors in the decoding phase [33] [13].

Nevertheless, some of the phase-shifting technique limitations occur again:

• the angle between projector and camera is relevant for accuracy. However, increasing the accuracy means also to put some

(35)

geom-etry in a shadow area. Thus, this geomgeom-etry is lost during the encoding phase.

• overlapping surfaces cannot be encoded with a single couple of projector and camera. This is because the viewpoint, from which we project and acquire the scene, is determinant for the piece of geometry we can encode.

Figure 2.12: Diagram for the decoding of a single depth value z using a reference plane (at surface with z = 0)[33]

The encoding and decoding procedures can be executed directly on the GPU modeling the virtual projector with a shader. At each vertex in the projector space is associated a RGB color value dependent on the X coordinate of the vertex as shown in equation 2.6. The output of the encoding is shown in figure 2.14 (a) where the camera is rotated by an angle θ as shown in figure 2.12. Figure 2.13 represents the intensity of

(36)

each channel for each possible X value. Vr(x, y, z) = 1 2 1 + sin 2πx P Vg(x, y, z) = 1 2 1 + cos 2πx P Vb(x, y, z) = S · jx P k +1 2S + 1 2(S − 2) cos 2πM od (x, P ) P1

where P is the fringe width and P1=

P K +1

2

is the local fringe pitch where K is an integer

and S represents the gray-scale stair height and M od is the Modulus operator

(2.6)

Figure 2.13: graph representing the channel intensity at the variation of x coordinate[32]

After the encoding process take place, the resulting images can be compressed with standard lossless or lossy compressors, such as png or jpeg. Moreover, a sequence of images can be compressed with standard video compressors like H.264.

Since equation 2.6 encodes the fringe projection in the RGB components of images, if H.264 is directly applied to the RGB encoded frames, an intermediate step is required to convert the RGB components in Yuv components. This conversion can be a great source of noise as shown

(37)

(a) (b)

Figure 2.14: (a) an example of virtual fringe projection (b) decoded noise of H.264 encoding applied to RGB encoded frames[32]

in figure 2.14 (b) [32]. Thus encoding directly in the Yuv components is preferable.

2.7.2 Direct Depth Encoding

Based on the holoimage technique, this approach aims to avoid the possible shadows region that the projected fringe pattern cannot reach due to the virtual system configuration. The key idea is to encode directly the z value in each pixel[13].

Again, a virtual environment is set up to host the 3D geometry and an orthogonal camera is placed to acquire the scene. At the end of the standard rendering pipeline, just before the rasterization step, it is possible to execute the direct depth encoding with a fragment shader exploiting the z-buffer.

From each fragment, it is needed to compute 3 values to be stored in the 3 image channels. Similar to the holoimage, 2 channels are used for fringe patterns and the third one is filled with a cosine-wrapped stair

(38)

signal.

For each pixel coordinate < x, y > of a RGB image, the red and green values with the fringe patterns and the blue value with the cosine wrapped stair value can be computed, as shown in the equation 2.7. Thus, the depth value Z is directly encoded in the 3 channels without emulating any projectors.

Ir(x, y) = 1 2 1 + sin 2πZ (x, y) P Ig(x, y) = 1 2 1 + cos 2πZ (x, y) P Ib(x, y) = S · Z (x, y) P +1 2S + 1 2(S − 2) cos 2πM od (Z (x, y) , P ) P1

where P is the fringe width and P1=

P

K +1₂ is the local fringe pitch where K is an integer

and S represents the gray-scale stair height and M od is the Modulus operator

and Z(x, y) is the z value of the fragment with pixel coordinates < x, y > extracted from the Z-buffer

(2.7)

A similar approach is also implemented in [56] where the 3D data to be computed is directly acquired from a Microsoft Kinect. Here, no virtual projection is required since the input data is already provided in a friendly format. As shown in section 2.4, the Kinect provides 16bit depth maps that can be processed as grey-scale images, result of a real camera acquisition. This images cannot be considered the output of an orthogonal projection but a perspective one. Nevertheless the encoding process can be executed in GPUs to speedup the computation.

As shown in figure 2.15, different functions are adopted to encode di-rectly the depth values. Equations 2.8 and 2.9 describe how to im-plement 2 triangular waves and 1 linear function opposite to the 2

(39)

si-nusoidal waves and the cosine-wrapped stair function of direct depth encoding described in [13].

Figure 2.15: depth encoding scheme[56]

L(d) = d

w (2.8)

L(d) is a linear mapping of a depth value d in the [0, 1] interval. The w value is setted to 216 _{because the Microsoft Kinect depth map precision}

per pixel is 16 bits.

Ha(d) =    L(d) P 2 mod2 ifL(d)P 2 mod2≤ 1 2 −L(d)P 2 mod2 otherwise Hb(d) =    L(d)−P 4 P 2 mod2 ifL(d)− P 4 P 2 mod2≤ 1 2 −L(d)−P4 P 2 mod2 otherwise

where P is the period of triangular waves normalized to a [0, 1] depth range (2.9)

L(d) function is used to decode a low precision depth value, while Ha(d) and Hb(d) (fast changing functions) are used to recover fine grain

depth variation.

Ha(d) and Hb(d) are defined to be the same function with different

phase because in the decoding step, for each depth value d, it exists a neighbourhood of p in which either Ha(d) or Hb(d) is linear and therefore

(40)

Capture, Compression and

Visualization

This chapter proposes a system architecture and software components to enable capture, compression and visualization of 3D scenes for stream-ing purpose. The input devices of the proposed system are a set of calibrated RGBD cameras. Their video and depth streams are com-pressed through a standard video compressor to be sent over a network to a rendering node. The rendering node collects the video and depth streams of each RGBD device. Moreover, it aggregates the depth infor-mation of the cameras reconstructing the geometry of the scene. Finally, it renders the scene from the tracked user view-point.

In section 3.1, it is presented a calibration strategy to perform extrin-sic calibration between fixed RGBD cameras pointing at the same re-gion of space. Section 3.2 presents the hole-filling software component proposed. Section 3.3 presents several approaches to depth data com-pression. Section 3.4 presents the overall system pipeline from a high perspective.

(41)

3.1 Device Calibration

Fixed Microsoft Kinect sensors are adopted as color and depth cameras to acquire the 3D scene. This section proposes an alternative method for extrinsic calibration. Although this extrinsic calibration technique is proposed for Microsoft Kinect cameras, it can be applied also with similar acquisition devices.

The extrinsic calibration procedure, described in section 2.5.2, is of-ten used to calibrate nearby cameras and it performs better when the chessboard pattern occupies a great portion of both cameras images. This can have some limitation. Far apart placed cameras lead to the need of a very big chessboard pattern like the one in figure 3.1. More-over, cameras placed with a quite different orientation cannot exploit very well this calibration procedure also using a big pattern unless the camera resolution is especially high.

Figure 3.1: extrinsic calibration with a big chessboard pattern [1]

This section proposes an alternative solution to exploit the depth maps of Microsoft Kinect during the process of extrinsic calibration.

As explained in 2.5.2, extrinsic calibration is a procedure that aims to find the linear mapping that translates the coordinates of one camera space to the ones of another camera space. Considering 2 coordinate systems τ and τ∗ we can formalize the procedure as the process to find A in equation 3.1 given that (X, Y, Z)T _{are the coordinates of a point}

(42)

wrt τ and (X∗, Y∗, Z∗)T are the coordinates of the same point wrt τ∗. The equation exploits the homogeneous coordinates theory to express the transformation as a matrix multiplication.

      X∗ Y∗ Z∗ W∗       =       a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33             X Y Z W       (3.1)

Given this formalization of the problem, it is necessary to find 16 un-known values of the equation, but the matrix A has some un-known proper-ties that let decrease this number to 12. In fact, the matrix can always be expressed in the form of equation 3.2

A =       a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 a30 a31 a32 a33       ≡       r00 r01 r02 t0 r10 r11 r12 t1 r20 r21 r22 t2 0 0 0 1       (3.2)

where R is a rotation matrix and T is a translation vector such that equation 3.1 can be rewritten in equation 3.3.

    X∗ Y∗ Z∗     =     r00 r01 r02 r10 r11 r12 r20 r21 r22         X Y Z     +     t0 t1 t2     (3.3)

Thus, considering again equation 3.1 as a linear system with 12 unknown values (aij with i = 0 to 2 and j = 0 to 3), it is possible

to find these values given that there are enough points such that it is possible to know in advance coordinates expressed in τ and τ∗. Since there are 12 unknown values and each known point give 3 equations (as shown in equation 3.4), it is required to have at least 4 points to find the transform matrix.

(43)

      X∗ Y∗ Z∗ 1       =       a00 a01 a02 a03 a10 a11 a12 a13 a20 a21 a22 a23 0 0 0 1             X Y Z 1       ≡        X∗_{= a} 00X + a01Y + a02Z + a03 Y∗= a10X + a11Y + a12Z + a13 Z∗= a20X + a21Y + a22Z + a23 (3.4)

As shown in figure 3.3, instead of placing a well defined pattern vis-ible in both camera scenes, 4 calibration markers are placed in random positions, like the one in figure 3.2. Of course, these markers must to be visible by both cameras to be able to take the coordinates of the markers in both reference systems.

Figure 3.2: calibration marker

1 color image from each camera is taken and the < u, v > coordinates of the markers in each image are identified. In this simple implemen-tation, the < u, v > coordinates of the markers are identified by a user pointing and clicking over the images. This process can be automated,

(44)

as explained in next section 5.1. Moreover, at each marker is associated the depth value d read from the depth images at the same < u, v > coordinates. Actually, it is associated a mean value over a set of sample values acquired to reduce the noise contribution. This simple mapping can be done thanks to the Kinect driver. The driver associates < u, v > coordinates of depth and color images as described in section 2.4. Exploiting the pin-hole camera model described in section 2.2, starting from < u, v, d > values, it is possible to compute the < X, Y, Z > coor-dinates in camera space. Repeating the same process for both cameras, the coordinates of the 4 points expressed in both coordinate system are found. Thus, equation 3.5 can be solved. This equation can be expressed in matrix form for easier computation as shown in appendix A.2.                                                      x∗ 1= a00x1+ a01y1+ a02z1+ a03 y₁∗= a10x1+ a11y1+ a12z1+ a13 z∗₁= a20x1+ a21y1+ a22z1+ a23 x∗ 2= a00x2+ a01y2+ a02z2+ a03 y₂∗= a10x2+ a11y2+ a12z2+ a13 z∗₂= a20x2+ a21y2+ a22z2+ a23 x∗₃= a00x3+ a01y3+ a02z3+ a03 y3∗= a10x3+ a11y3+ a12z3+ a13 z∗₃= a20x3+ a21y3+ a22z3+ a23 x∗₄= a00x4+ a01y4+ a02z4+ a03 y4∗= a10x4+ a11y4+ a12z4+ a13 z∗₄= a20x4+ a21y4+ a22z4+ a23

where xi is the x coordinate of point i in τ

and x∗_i is the x coordinate of point i in τ∗

(3.5)

To better exploit this method, several depth maps from each camera are captured while taking the markers fixed in the scene. Then, for each marker, its depth coordinate d is computed as the mean of all the acquisitions to reduce the noise impact on the calculation. Moreover,

(45)

(a) (b)

(c) (d)

Figure 3.3: 4 calibration markers. (a-b) color frames of first and second kinect (c-d) depth frames of first and second kinect

this process works better when the markers are placed quite distant from each others, both in depth and in x,y coordinates.

If the capturing device is not provided with a driver that automatically maps depth pixels into color pixels, the same approach adopted in [14] and already discussed in section 2.5.2 can be exploited to calibrate both depth and color sensors. The center of the markers can be individuated in both color and depth frames provided that the markers are built with a reflecting material that prevent the capturing device to reconstruct the depth values of some pixels. For example, the black pixel of the example

(46)

marker in figure 3.2 can be realized with aluminum film.

3.2 Hole Filling

The depth maps acquired from RGBD devices are subject to black holes caused by interferences with other RGBD devices, difficult light condi-tions or reflective surfaces. Thus, also a software hole filling filter is implemented and tested. All captured input frames are filtered before they proceed in the system pipeline. The hole filling algorithm is im-plemented as GLSL compute shader and its output is stored in a GPU texture. This allows to transfer the depth map and the color frame to the GPU only one time during the computation pipeline and to perform the entire computation in GPU.

The first implementation is based on the median filter that aims to preserve the edges. The branchless median filter implementation of McGuire [48] is taken as a starting point for a fast GPU computation implementation. However, increasing the radius of the hole filling filter causes the computation time to significantly increase. This limits the computation to be applicable only on small radius (like 2 or 3 pixels), so that only small holes can be filled.

The second implementation of the hole filling is based on a bilateral fil-ter described in appendix A.1. Its computation time increases linearly with the input dimension (i.e. the radius of the filter), opposite to the search of the median value. This characteristic enables to fill larger holes in smaller time, but this algorithm does not have a good edge preser-vation as the median filter. The edges of the objects are smoothed to the near values in the depth map creating unwanted artifacts in the reconstructed mesh that link foreground objects to background objects. Moreover, since the edges are not preserved in the depth maps, the color texture mapping is not accurate as well and color pixels of the back-ground can be mapped to foreback-ground meshes and vice versa. Figure 3.4

(47)

shows the output of the latter implementation described. As previously explained, the holes are filled with plausible depth values but too much smoothing is generated in the output depth map and close surfaces are smoothly merged together. In this figure the depth map is not rendered as a grey-scale image but it includes contour lines representing equal depth pixels to better show even low depth variation.

Figure 3.4: On the right side, the raw input data with some holes (i.e.: missing depth data). On the left side, the filtered data with the simple bilateral filter.

To improve the hole filling based on the bilateral filter, the contri-bution weights are tuned to avoid depth values of different surfaces to take part and interfere in the hole filling process. Moreover, also the RGB frame is processed to consider the color distance as an additional weight. Considering all the previously involved pixels pa+i,b+j that

par-ticipate in the output value for pixel pa,b, their weights are dynamically

calculated based on the pixel distance between pa+i,a+j and pa,b that is

pi2_{+ j}2_{; the depth distance |p}

(48)

shown in figure 3.5 that presents depth maps of a book taken in one hand in the middle of a room, this improved filter prevents the edge smoothing effect around the book. If the smoothed depth pixels, that this improvement aims to eliminate, had reached the geometry creation step, they would have created unwanted surfaces linking the book to the background wall.

Figure 3.5: bottom right: a captured frame without filter application; bottom left: the bottom right frame filtered with the simple bilateral hole filling filter. top right: a captured frame without filter application; top left: the top right frame filtered with the improved bilateral hole filling filter.

3.3 Depth Data Compression

The next aim is to adapt standard video coding technique like H.264 to depth streaming. Thus the compression investigation starts with the

(49)

Figure 3.6: Example of U-V color plane, Y value = 0.5, represented within RGB color gamut[9]

implementation of some of the algorithms described in section 2.7 to get some baseline and to make some meaningful comparison. Given that the input devices are Microsoft Kinects, direct-depth-encoding like schemes are implemented. This choice is done because the acquired data are al-ready in the right format to be processed without loosing information. H.264 encodes YUV frames, where the Y component is the luminance value and U,V components represents the chrominance. Figure 3.6 shows the H.264 encoding scheme and the U,V space at a fixed lu-minance.

Starting from the direct depth encoding described in [13] and the en-coding described in [56] their depth spaces are studied.

As described in section 2.7, these depth encoders construct a YUV image to be compressed with H.264. They use the Y channel to en-code coarse grain depth information and the U,V channels to enen-code fine grain depth information. Actually both encoding schemes use the U,V channels in a very similar way: one encodes sinusoidal waves, the other encodes triangular waves as shown in the example in figure 3.7. Nevertheless, regardless of the video encoding, these depth encoders do

(50)

not exploit the full U,V space. The U,V channels are directly related one to the other because V is the shifted version of U and so the U,V en-coding space is the one depicted in figure 3.8. The H.264 enen-coding and decoding are sources of noise and the redundancy of these 2 channels help in achieving a better depth decoding.

(a) (b)

Figure 3.7: UV mapping for a sample of depth space (a) UV sinusoidal waves (b) UV triangular waves

(a) (b)

Figure 3.8: UV encoding space for (a) UV sinusoidal waves (b) UV triangular waves

3.3.1 Direct Depth Encoding Implementation

As a first experiment, the direct-depth-encoding [13] is implemented. Figure 3.9 shows the plotted U,V space of one encoded frame before

(51)

and after the compression/decompression steps. As expected, after the decompression, a lot of noise is introduced by x264 encoder. In the U,V decoding, after scaling and shifting, the decoding computation core is θ = arctan v_u00. Therefore the noise can be considered as a ∆θ summed

at each encoded pixel. This reduce the problem to 1 single coordinate. Figure 3.10 plots a colored U,V space after the decompression step to shows how much compression noise affects the decompressed frames. The next implementation is about the encoding strategy of [56] that adopts the triangular waves. Figure 3.11 shows the plotted U,V space of one encoded frame before and after the compression/decompression steps.

Figure 3.9: U,V space of (a) direct-depth encoded frame before H.264 compression (b) direct-depth encoded frame after H.264 decompression. Since a lot of pixels in the original frame are encoded with the same U,V coordinates, in these images they are plotted with transparency with a high value in the alpha channel. Thus, the pixel luminance, in these images, can be interpreted as the frequency in which pixels are encoded with that specific U,V coordinates.

Tuning the periodicity parameter P of sinusoidal waves in the en-coding equation 2.7 and triangular waves in the enen-coding equation 2.9,

(52)

Figure 3.10: U,V space of direct-depth encoded frame after H.264 de-compression.

Red points have an error greater than ∆θ = ₁₆π. Other points go from green to yellow with increasing errors lower than ∆θ = ₁₆π. Moreover, the blue lines plot the ∆θ = ₁₆π angle.

Since a lot of pixels in the original frame are encoded with the same U,V coordinates, in these images they are plotted with transparency with a high value in the alpha channel. Thus, the pixel luminance, in these images, can be interpreted as the frequency in which pixels are encoded with that specific U,V coordinates, while the colors are mixed together.

it is possible to show the changing distribution of encoded and com-pressed points in U,V space. When the periodicity values are too low, the encoded UV space is not fully exploited, while when the period-icity values are too high, the UV space is fully populated but the Y channel decreases its importance in decoding. It is important to no-tice that H.264 compression (and video compression in general) works down-sampling UV channels relying to the more importance of lumi-nance with respect to chroma. Moreover, parameter exist to decrease chroma importance in favor of luminance during loss-compression steps. Figure 3.12 shows the points distribution in UV space at the variation

(53)

Figure 3.11: U,V space of (a) encoded frame before H.264 compression (b) encoded frame after H.264 decompression.

Since a lot of pixels in the original frame are encoded with the same U,V coordinates, in these images they are plotted with transparency with a high value in the alpha channel and red points represent bigger errors. Thus, the pixel luminance, in these images, can be interpreted as the frequency in which pixels are encoded with that specific U,V coordinates.

of the P parameter. Given that a higher bitrate is used to decrease noise in these plots, the lower periodicity does not exploit the full UV space. This increases periodicity that it will decrease the noise because the compressed data uses less bits to encode the Y channel that holds less information.

Furthermore, tuning the bitrate compression parameter with a fixed periodicity P there is an increase of the noise for a decreasing bitrate as shown in figure 3.13

3.3.2 Spiral Encoding

Starting from the ideal plots of figure 3.8, the aim of this paragraph is to better exploit the UV space. These 2 encoding strategies distribute

(54)

en-Figure 3.12: Different periodicity values in encoding with a fixed bitrate. Increasing periodicity from left to right. The first row shows plot of UV space encoded with triangular waves of equation 2.9; the second row shows sinusoidal waves encoding of equation 2.7.

Figure 3.13: Same periodicity values in encoding with, from left to right, decreasing bitrate. The first row shows plot of UV space encoded with triangular waves of equation 2.9; the second row shows sinusoidal waves encoding of equation 2.7.

coded points on 2 different paths. Considering the UV space as a quad of side length equal to 1, the first encoding scheme described by equa-tion 2.9 is composed by 4 segments for a total length of 4

q

1

(55)

while the second encoding scheme described by equation 2.7 has a path length of π ≈ 3.14.

To better exploit the UV space, a possibility is to distribute the points on a spiral path but this has the drawback that the encoding functions are not periodic. Thus, the UV channels are used to encode a coarse grain of the depth values while the Y channel is used to encode the fine grain depth information. The difference with previous described encoding strategies resides in the fact that H.264 UV down-sampling will affect the most-significant bits instead of the least-significant ones. Given that data to be encoded comes from a depth camera, space conti-nuity of depth data can be exploited to develop a post-processing filter that corrects the most-significant bits of the decoded data. Moreover, the Y channel encodes the least-significant bits at full resolution with-out being affected by down-sampling.

Microsoft Kinect provides depth values with 16 bits integer numbers with a precision of 0.1mm. Thus, to subdivide each depth value in two values (one that will be encoded in the Y channel and one that will be encoded in the UV channels), a first step can be the decrease of the depth accuracy to the resolution of 0.8mm. This can be achieved simply discarding the least-significant three bits as shown in equation 3.6. Since the depth values retrieved from the Microsoft Kinect do not completely cover the range starting from 0 meters, a better approach consists to find a minimum depth value dmin for the target application

and to normalize the resulting depth range to 13 bits as shown in equa-tion 3.7. This approach provides a better accuracy lower than 0.8mm, reducing the encoded depth range.

d0 = d 23 (3.6) d0 = (d − dmin) 213 216_{− d} min (3.7)

The same approach can be exploited finding a max value dmax for the

depth range that cannot be exceeded as shown in equation 3.8.

d0 = (d − dmin) 213 216_{− d} min− dmax (3.8)

Given that d0 is a 13 bits representation of the acquired depth value d, its 5 most-significant bits can be encoded in the UV channels and the

(56)

remaining 8 bits can be trivially encoded in the Y channel. To encode 5 bits values on a spiral path in the UV space, it is possible to simply create a mapping between all possible 25 _{= 32 values to 32 coordinates}

of points belonging to the spiral path.

The spiral chosen is an Archimedean spiral due to its features. It assures that any ray from the origin intersects successive turnings of the spiral with a constant separation distance. This lets a good management of the distribution of points in the UV space to better exploit the UV space in its entirety.

Figure 3.14: Archimedean spiral[7]

r = a + bθ (3.9)

The equation 3.9 is the polar equation of the Archimedean spiral where b controls the distance from successive turnings and a rotates the spiral. More precisely, the distance between successive turnings is 2πb. Starting from this equation the parameters need to be tuned to get a spiral path that lets distribute 32 points, exploiting at best the UV space.

(57)

calculated.

(

x(t) = (a + bt) cos(t)

y(t) = (a + bt) sin(t) (3.10)

Then, the curvilinear abscissa of the curve is calculated. The key idea is to distribute the 32 points equidistant on the spiral path, taking into account also the distance between points placed in different turnings of the spiral. α(t) = (x(t), y(t)) S(t) = Z t 0 kα0(u)kdu S(t) = Z t 0 p

b2_cos2_{(u) + (a + bu)}2_sin2_{(u) − 2(ab + b}2_{u)sin(u)cos(u)+}

b2_sin2_{(u) + (a + bu)}2_cos2_{(u) + 2(ab + b}2_{u)sin(u)cos(u)du}

S(t) = Z t

0

p

b2_cos2_{(u) + (a + bu)}2_sin2_{(u) + b}2_sin2_{(u) + (a + bu)}2_cos2_(u)du

S(t) = Z t

0

p

b2_(cos2_{(u) + sin}2_{(u)) + (a + bu)}2_(sin2_{(u) + cos}2_(u))du

S(t) = Z t 0 p b2_{+ (a + bu)}2_du (3.11)

Since the rotation of the curve is not important, the a parameter can be set to a good value, letting it be a = 0. Thus, the curvilinear abscissa of equation 3.11 can be simplified.

S(t) = Z t 0 p b2_{+ (a + bu)}2_du S(t) = Z t 0 p b2_{+ (bu)}2_du S(t) = Z t 0 bp1 + u2_du S(t) = b 2 tpt2_{+ 1 + sinh}−1_(t) (3.12)

Distributing the points equidistant on the curvilinear abscissa a good distribution of points in the XY plane is obtained if the distance between successive points is similar to the distance of successive turnings. An exception occurs near the origin where the curvature of the spiral path is higher.