Human Body Pose Estimation - UNIVERSITÀ DEGLI STUDI DI PARMA Dottorato di Ricerca in Tecnologie

86 Chapter 5. Model-based Object Recognition

(a) (b) (c)

Figure 5.3: On the left, the kinematic chains used to describe the human model. On the centre, the skeleton of the model. This is the base position of the model in the first frame. On the right, the base mesh generated by the skeleton.

node. A joint (root) corresponding to the pelvis is the top of the hierarchy. Five dif-ferent kinematic chains are linked to it to describe the entire body. Figure5.3a de-scribes the kinematic model used for the skeleton, which is shown in Figure 5.3b.

The distances between two joints are based on the size of a standard model and are tuned according to the evolution of the Dimensions parameter vector (see next para-graph). Since the anatomical limits of the human body do not allow some rotations, the degrees of freedom of the whole skeleton are only 29.

Dimensions

Dimensionsis a vector of 7 parameters that define the actual size of the mesh. The reference model is defined as an average-height man, and these scaling parameters affect the distances between the joints and the thickness of the mesh built around them. These parameters are used as multipliers of the standard mesh, i.e., when they are equal to 1, the reference model is used as is; when they are larger than 1 the model is taller/fatter than the standard and when they are smaller than 1 the model is shorter/thinner.

5.3 Human Body Pose Estimation 87

Mesh

The actual mesh (Figure5.3c) is composed of 11 cylinder-spheres [10], whose posi-tions depend on the location of the skeleton’s joints and whose widths depend on the Dimensionsparameters. Two of them are used to describe the torso, one represents the head, and the remaining eight stand for the arms and the legs.

In conclusion, the total number of parameters subjected to DE optimization is 29 angles + 7 scale factors + 6 values that represent the position and orientation of the model with respect to the world reference system, for a total of 42 parameters. The model is intrinsically hierarchical, and this feature is also exploited during the opti-mization. In fact, the optimization is not performed all in one step, but is split into four steps, in each of which only some parameters of the model are optimized. In this way, one high-dimensional problem can be seen as four lower-dimensional problems, at the cost of performing four different and consecutive optimization processes. Af-ter one optimization step ends, its resulting parameAf-ters are frozen and the next step considers them as constants. This choice has been made after some tests made clear that tackling the pose estimation problem in a single step led to large errors.

5.3.2 The distance function

The fitness of a candidate solution is proportional to the sum of the distances of all points in the input cloud from the mesh. For each point in the cloud, the distance from each cylinder that composes the mesh is calculated using a variant of the method presented in [10]. This method has been chosen due to the very light computation needed to calculate the distance between a solid and a point. The shortest distance (di) from each point i to the closest part of the mesh undergoes this simple equation:

d⁰i=

( d_i^α if di≤ β β^α otherwise

The two parameters α and β affect the precision of the localization. A value of α > 1 can be used to help the model ignore small details, but be more focused on reaching all parts of the body (particularly hands and feet) while a value of α < 1

88 Chapter 5. Model-based Object Recognition

Figure 5.4: The repetition of these five steps compose the body pose estimation pro-cess. On the right the percentage of time spent on every function is reported. The total does not add up to 100% because initialization and result retrieval are not included.

induce the model to refine good solutions. β can be seen as the distance beyond which a point does not affect the pose estimate any more. In this way, the behavior of the model becomes more robust with respect to outliers. The final fitness is then calculated as the sum of all d⁰s terms.

5.3.3 Implementation details

Like all methods described in this Chapter, this application has been implemented on GPU. In this way, the huge amount of computation needed can be performed in a fast way and reach real-time performance (average 43.9 ± 3.4 ms for a frame). The final architecture is composed of 5 different functions (see Figure5.4). The first and the last implement the DE algorithm, while the remaining three are in charge of pose estimation and fitness evaluation:

• “Compute Skeleton” reads all the DE elements to be evaluated in parallel and uses them to generate the location of the skeleton joints. Despite the hierar-chical structure of the model, all joints can be computed in parallel in order to save time. A joint that depends on other joint locations, must go back to the root of the hierarchy and compute the chain;

• “Compute Mesh” generates the positions of the cylinders that compose the mesh using the joint locations and size modifiers;

5.3 Human Body Pose Estimation 89

• “Evaluate Fitness” computes, for each point, the distance from the closest part of the mesh and adds them up to obtain the final fitness. This is by far the most computationally expensive part of the algorithm.

5.3.4 Experimental Results

This algorithm has been tested on a publicly available dataset [45]. It is composed of 28 sequences collected using a Time-of-Flight camera, with a resolution of 176×144 pixels. During the acquisition, the subject was wearing markers, whose location can be used as ground truth to assess the precision of the pose estimation. This dataset has also been used in [6,45,161,190], which allows a comparison with the results reported therein to be made.

Every sequence was processed 10 times; the average results (compared to the other four approaches) are presented in Figure5.5. The sequences of this dataset are approximately sorted by difficulty. While in the first ones the subject performs only small movements, in the last ones the movements are more complicated (including a 360^◦ rotation and a tennis serve). Our method is one of the worst performing on the simpler sequences (probably due to our mesh definition, since a cylinder is not a perfect approximation of a limb) while, when processing the most complicated ones, results are comparable to the state-of-the art, showing that our method has good generalization properties and good tracking results.

Figure 5.5: Comparison with the state-of-the-art. Results represent the average dis-tance over all markers on all frames of the sequence over 10 independent runs.

90 Chapter 5. Model-based Object Recognition

Nel documento UNIVERSITÀ DEGLI STUDI DI PARMA Dottorato di Ricerca in Tecnologie dell’Informazione (pagine 99-104)