HUMAN BODY MODELING AND TRACKING USING VOLUMETRIC REPRESENTATION: SELECTED RECENT STUDIES AND POSSIBILITIES FOR EXTENSIONS

Size: px

Start display at page:

Download "HUMAN BODY MODELING AND TRACKING USING VOLUMETRIC REPRESENTATION: SELECTED RECENT STUDIES AND POSSIBILITIES FOR EXTENSIONS"

Barrie Clarke
6 years ago
Views:

1 HUMAN BODY MODELNG AND TRACKNG USNG VOLUMETRC REPRESENTATON: SELECTED RECENT STUDES AND POSSBLTES FOR EXTENSONS Cuong Tran Mohan M. Trivedi Computer Vision and Robotics Research Laboratory University ofcalifornia, San Diego ABSTRACT Articulated human body modeling and tracking from vision data is an attractive research area with many potential applications. There has been a tremendous amount of related research works in this area. Therefore, having a comprehensive insight into high quality existing works and awareness of the research frontier in the area is essential for follow-up research studies. With that objective, this paper provides a review of the subarea of model based methods for human body modeling and tracking using volumetric (voxel) data. We will focus on analyzing and comparing some recent techniques, especially which are in the past two years, in order to highlight trends in the domain as well as to point out limitations of the current state of the art. Based on this analysis, we will discuss our idea of combining Laplacian Eigenspace (LE) based voxel segmentation [20] and Kinematically Constrained Gaussian Mixture Model (KC-GMM) method [3] to have a more powerful human body pose estimation system as well as discuss other possibilities for future work. Keywords- Vision based, markerless, human body pose estimation, volumetric reconstruction 1. NTRODUCTON Vision-based pose estimation and tracking of articulated human body is the problem of estimating kinematic parameters of the body model (such as joints position and joints angle) from static image or video sequence as the body's position and configuration change over time. Related research studies in this area include body pose estimation, hand pose estimation, head pose estimation. Among those, the most extensive subfield is body pose estimation, which refers to the articulated body model normally with torso, head, and 4 limbs but without details of hand, foot, or facial variation. A good body pose estimation system has many potential applications including advance Human Computer nteraction (HC), 3D animation, intelligent environment, robot control, etc. Compared to previous technologies using markers or some specific devices, markerless vision-based approaches provide more natural, non-contact solutions. This is however a very challenging task. One major reason is the very high dimensionality of the pose configuration space, e.g. in [3], 19 DOF (Degree Of Freedom) are used for body model and 27 DOF are used for hand model. Moreover, we also have to deal with other common issues in computer vision like self occlusion, variation in lighting condition, shadow, object appearance (e.g. different clothes, hair,...). Some surveys of several techniques for human body pose estimation can be found in [14, 15, 17, 23], each with different focus and taxonomy. Werghi [23] provided a general overview of both 3D human body scanner technologies and approaches dealing with such scanned data, which focus on one or more of the following topics: body landmark detection, body scanned data segmentation, body modeling, body tracking. Poppe [17] surveys on pose estimation techniques, in which they mentioned the division into 2D approaches and 3D approaches, depends on the goal to achieve 2D pose or 3D pose representation; The division into model-based approaches and model-free approaches, depends on whether a priori kinematic body model is employed. This survey split the pose estimation process into modeling process, which is the construction of the likelihood function and estimation process, which is concerned with finding the optimal pose given the likelihood. Moeslund et al. [14] split the pose estimation process into initialization, tracking, pose estimation, and recognition. n [15], they also provided an updated review of advances in human motion capture for the period from 2000 to We see that it is not easy to have a unified taxonomy of the broad area of human body modeling and tracking. n Figure 1, we describe a simple block diagram of generic human body pose estimation system, in which we first need some components to extract useful features from input vision data and then a procedure to infer body pose from extracted features. We can loosely categorize related research studies into monocular [9, 12, 18] and multi-view approaches [1, 2, 3, 4, 5, 6, 10, 11, 13, 16, 20]. Compare to /08/$ EEE

monocular view, multi-view data can help to reduce the self occlusion issue and provide more information to make the pose estimation task easier as well as to improve the accuracy.

2 monocular view, multi-view data can help to reduce the self occlusion issue and provide more information to make the pose estimation task easier as well as to improve the accuracy. Among multi-view approaches, some methods use 3D features reconstructed from multiple views [1, 2, 3,4, 5, 6, 13, 20], e.g. volumetric (voxel) data, while others still use 2D features [10, 11, 16], e.g. color, edges, silhouette. Because the real body pose is in 3D, using voxel data can help avoiding the repeated projection of 3D body model onto the image planes to compare against the extracted 2D features. Furthermore, reconstructed voxel data help to avoid the image scale issue. These advantages of using voxel data allow the design of simple algorithms and we can make use of our knowledge about shapes and sizes of body parts. For example, Mikic et al. [13] used specific information about shape and size of head and torso to have a hierarchical growing procedure (detecting head first, then torso, then limbs) for body model acquisition that can be used effectively even when there is a large displacements between frames. Several methods using voxel data only indicate that voxel data is a strong cure for body pose estimation. Of course, there is an additional computational cost for voxel reconstruction but efficient techniques for this task have also been developed [1, 4, 5, 19]. Another input used in many methods is a predefined kinematic model of the human body. These methods called model-based methods, in which there is an underlying kinematic model and a procedure to fit that model onto real input data. There are also model-free methods, which assume no underlying kinematic model and contain procedures to learn a direct mapping from feature space to pose configuration space. Although information from an underlying kinematic body model can help to improve the accuracy and robustness, the advantage of model-free methods is that they do not suffer from (re)initialization issue and can be used for initialization of model-based methods. Regarding the pose estimation result, two types of research directions have emerged. One only aims to extract high-level abstract information corresponding to motion and posture of the body, which can then be applied for gesture classification for example. The other aims to recover the real (full) 3D motion and posture ofhuman body. The latter one is more challenging but it is also worth dealing with, because it provides more general, principled methods that can be adapted to extract different high-level abstract information depending on application area. Moreover, various types of interaction styles and applications also explicitly rely on the full 3D pose information. This paper focuses on model-based methods for real 3D human body pose estimation using reconstructed voxel data. The remainder of the paper is organized as follows. Section 2 is a review of selected recent model-based methods for human body pose estimation using voxel data. rc;;;;l ~ Pose/motiOn capkre Aim 10 extract high.. level information correspondng to pose motion Figure 1. Block diagram of a generic human body pose estimation system. Dash line means that the underlying kinematic model can be used or not. Gray boxes show the focus of this paper, which are model based methods using voxel data and aim to extract full 3D posture. Camera calibration! Data capture Delamerre etat '01 (6) Cheng at a1. '07 (3) Extract uset\j features 120~~ ~ / ~:_~ Voxel reconstruction.. Cheung et at '03 (5) Sundaresanet a (20) Mikic'03etal. (13) Caette et 81.'08 (1] Figure 2. Flowchart of common model-based methods for articulated human body pose estimation using voxel data. Dashed boxes mean that some methods mayor may not have all of these steps. The initialization/segmentation may be called at each frame or just be called when we need to initialize or re-initialize body model during the tracking. A summary of steps contained in some selected methods is shown at the bottom. Motivated from this review, Section 3 is about the idea of combining LE based voxel segmentation and KC-GMM method into a more powerful system for human body pose estimation. Finally in Section 4, we have some concluding remarks and discussion about directions for future work. 2. A REVEW OF SELECTED RECENT MODEL BASED METHODS FOR HUMAN BODY POSE ESTMATON USNG VOXEL DATA

Uses physical estimating pose in voxel manual initial (truncated forces, a simpler form of space (only using 2 views placement cones, spheres, terative Closest Point (CP), to with epipolar geometry

3 Authors nifta.lizadon Model ~stimation'~pr~eedure EValuadQll Comment.: Delamarre et ale A priori known 3D primitive Tracking-based method Visual only Not really a method for '01 [6] model with shape model (Kalman filter). Uses physical estimating pose in voxel manual initial (truncated forces, a simpler form of space (only using 2 views placement cones, spheres, terative Closest Point (CP), to with epipolar geometry parallelepipeds make the 3D model and the reconstruction) -> limitation components) voxel data intersect due to possible ambiguities. Mikic et ale Hierarchically Ellipsoidal, Tracking-based method. Uses Visual only Fully automated; Can track '03 [13] grows model cylindrical Kalman filter to predict next for even large displacement; over data from components. pose then update using head to torso, Described with growing procedure & to limbs twists Bayesian Networks Cheung et at. CSP alignment Skeletal body Uses Colored Surface Point Synthesized Fully automated; '03 [5] & model (CSP). Hierarchical ground-truth segmentation Segmentation/ SFS alignment via motion to recover motion, shape and clustering joint Sundaresan et at. Voxel 6-chains Segment voxel data in Synthesized Fairly general LE based '07 [20] segmentation representation Laplacian Eigenspace (LE). ground-truth voxel segmentation; Fully inle and a ofbody. Probabilistic register automated; However is procedure for Superquadric segmented voxel to body parts sensitive to noise in voxel body parts components in then estimate skeletal and data registration body model superquadric parameters Cheng et at. Manual Ellipsoidal ntegrate kinematic constraints Synthesized Has generality: Has been '07 [3] initialization of components. in KC-GMM model. Derive & Marker- applied for both body and body parts Described by EM algorithm with KC-GMM based hand; However require a dimension & Kinematically for pose estimation (no motion manual initialization initial pose Constrained additional projection step) capture for Mixture Model ground-truth (KC-GMM) Caillette et at. A priori known Skeletal body Tracking-based method. Break Manually Fast (real-time); However '08 [1] skeletal model. model & complex movement into basic annotated limited to trained movement nitialize with Gaussian blobs motions ground-truth sequences K-mean blob Use Variable Length Markov fitting Model (VLMM) to predict procedure candidate pose. Evaluate with blob fitting. Use colored voxel for more robust tracking. Table 1. Summary of selected model based methods for body pose estimation using voxel data. The last three rows are recent methods chosen to discuss in more details Figure 2 shows a typical flowchart of common model-based pose estimation methods using voxel data. There are five main steps: camera calibration/data capture, voxel reconstruction, initialization/segmentation (segment voxel data into different body parts), modeling/estimation (estimating pose using current frame only), and tracking (use temporal information from previous frames in estimating body pose in current frame). The first two steps are common for all methods of this kind while among the last three steps different methods may touch different combinations of these steps. A summary of which steps are contained in some selected methods is shown at the bottom of Figure 2. Compare to previous surveys [14, 15, 17, 23], this review focus on analyzing model based methods for body pose estimation using voxel data. n addition to some methods already mentioned in previous survey (prior to

2007), we emphasize our analysis on some selected methods in 2007, 2008, which could be considered the current state of the art including a general probabilistic method that can applied for both body

4 2007), we emphasize our analysis on some selected methods in 2007, 2008, which could be considered the current state of the art including a general probabilistic method that can applied for both body model and hand model Cheng et al. '07 [3], a method with real-time performance Caillette et al. '08 [1], and a new general LE based method for voxel segmentation Sundaresan et al. '07 [20]. n data capture and voxel reconstruction steps, there are some human body scanner technologies mentioned in [23], in which we are concerned about vision-based technique to reconstruct voxel data of the body from multipleperspective cameras. A common approach to do this is the shape-from-silhouette (visual hull) approach. First, the images from multiple synchronized cameras are segmented into object silhouette using some background subtraction techniques [7, 8]. Then some efficient shape-from-silhouette techniques [4, 5, 19] can be used to retrieve the 3D voxel data. There is also another approach called shape-fromphoto consistency (photo hull) [21] that uses other features (not just the silhouette) from the photos to have more accurate geometry of the reconstructed photo hull. n case of body pose estimation, the more accurate geometry of voxel data is not necessary so using visual hull is more appropriate (should be faster and more robust to noise) The modeling and tracking steps can be considered as a mapping from input space of voxel data Y and information in the predefined model (e.g. kinematic constraints) C to the body model configuration space 0. M: (Y,C) J-7 0 The body model configuration contains both static parameters (i.e. shape and size of each body component) and dynamic parameters (i.e. mean and orientation of each body component), in which the static parameters are estimated in the initialization step. Some methods have automatic initialization step like [13, 20] while others require a priori known or manually initialized static parameters. The main differences between methods ofthis kind are in the body model that they use and how they implement the mapping procedure M. Methods that have modeling step but no tracking step are also called single frame-based methods while methods with tracking step are called tracking-based methods. Because the tracker in tracking based methods would be lost over long sequences, multiple hypotheses at each frame can be used to improve the robustness oftracking. Single frame based approach is a more difficult issue because it does not make any assumptions on time coherence. However, we see that this kind of approach is needed for initialization or reinitialization of tracking-based methods. Regarding the evaluation step, some methods only have visual evaluation while others have both visual evaluation and quantitative evaluation using ground-truth data got from synthesized data, manual annotation or maker-based motion capture system. According to the factors mentioned above, a summary of several recent model based methods for human body pose estimation using volumetric data is shown in Table 1. n the following section, we will discuss in more details some selected state-of-the-art methods, which is published in the past two years, to emphasize important results and limitations ofeach one Kinematically Constrained Gaussian Mixture Model (KC-GMM) approach for both body and hand pose estimation [3] This is one of very few methods that have been applied (with experimental result) for both body models and hand models. Among several methods competing in the Workshop on Evaluation of Articulated Human Motion and Pose Estimation - CVPR EHuM (including [3, 10, 16]), this method won the first prize. The hand model and body model used in this method are shown in Figure 3.(a). For hand, there are 16 components with 27 DOF (degree of freedom). For body, there are 11 components with 19 DOF. The pose estimation procedure of this method uses the same paradigm of probabilistic clustering. Each body/hand component is described by a Gaussian and the set of components are kinematically constrained according to the predefined model. The goal is then to estimate optimal value for the Gaussian Mixture Model (GMM) under those kinematic constraints. They represent these kinematic constraints by 3 equations corresponding to 3 types of constraint: spherical (3 DOF) constraint, hardy-spicer (2 DOF) constraint, and revolute (1 DOF) constraint. c s (0) = Pi + ROiaij -(Pj + ROja ji ) c h (0) = ROiqij X ROjqji c r (8) = ROiqij - Rojqji where 8 is the embodiment ofthe kinematic constraints and all configuration parameters, Pi, Pj are the means of components i and j, Roh Roj are the rotation of the components relative to the world coordinates, Qij' Q ji are the joint positions in component coordinate frame (the origin is at the component center), qij, qji are the rotation axes of each component in either component coordinate frame. We can interpret these equations as follow: C s = 0 means two joints on two component are coincided, we have 3 DOF constraint; Ch = 0 means 2 rotation axes are perpendicular, combined with C s = 0 we have 2 DOF constraint; C r = 0 means 2 rotation axes are aligned, combined with C s = 0 we have 1 DOF constraint. n a previous work of the same authors [2], these constraint equations are satisfied by adding a constraining step (C-step) into EM algorithm for Gaussian mixture

The primary contribution of [3] is the removal of this C-step by incorporating kinematic constraints into the probability model in the form of a prior probability to have a Kinematically Constrained

5 (a) Hand model and body model used in (3) (b) Body model used ;n [1] (c) 6-chain~ skeletal. and superquadric body model used in {20] ~f~ } Figure 3. Some body/hand models used in analyzed methods estimation. However this additional C-step may compete with the M-step and cause instability in the optimization. The primary contribution of [3] is the removal of this C-step by incorporating kinematic constraints into the probability model in the form of a prior probability to have a Kinematically Constrained Gaussian Mixture Model (KCGMM) P(Y,c 18) = P(c 8 )[J P(Yn c,8) 1 n where Y={Yn} represent the distribution of input voxel data. The EM algorithm for this new probability model is then derived for estimation of Gaussian component parameters, which can be interpreted to body configuration. This method is quite general and was applied successfully for both HumanEva body data and synthesized hand data. Some visual and quantitative results in [3] are shown in Figure 4. However, this method is not fully automated because it requires a careful manual initialization step. This is obviously an obstacle if we want to use this method in real time application. Another issue of KC-GMM method is that due to the nature of EM algorithm, it could stuck in a sub-optimal solution especially when there is a large displacement between frames Real-time approach using Variable Length Markov Models (VLMM) pose prediction followed by a Gaussian blob fitting procedure for body modeling and tracking [1] Due to the complex, high dimensional model of articulated body, running speed of pose estimation algorithm is really an issue. Real-time performance is a prominent goal of this methods and this is one of very few methods that has reported the run-time performance. The body model used in this method is shown in Figure 3.(b), which consists of a skeletal model and Gaussian blobs attached to bones of this skeletal model. For real-time performance in voxel reconstruction, the authors proposed not to perform binary segmentation of the input images but instead to compute a measure of the distance to the background model for each 2D sample. The statistics on these distances across the available views are then used to classify voxels. n this method, the color information is also kept along with each voxel, which allows more robust tracking. This is a tracking-based method that exploits temporal dependencies from previous frames. For more accurate and efficient hypothesis propagation (pose prediction), complex human activities such as dancing are broken into elementary movements using variant EM algorithm (partition parameter space into Gaussian clusters). The transition between clusters is predicted using Variable Length Markov Model (VLMM), which can explain high-level behaviors over a long history. The evaluation of the likelihood is done with a Gaussian blob fitting procedure. This blob-fitting procedure can detect tracking failures, e.g. the best achieved likelihood is below a threshold. A reinitialization can then be requested then by performing blob fitting from all clusters, which however might provoke a considerable lag. Figure 5 shows some experimental results of this method, which indicate an improvement compared to some other standard particle filter based algorithms. The runtime performance was also reported with the total time of both voxel reconstruction and body pose inference is around so110 milliseconds depending on the configuration parameters. However the performance of this method depends largely on the correctness of the prediction result, which means that it requires a good training phase and it

:j.., f.....!,.~ ~~...-l,..j ~ -i -1 ~~,;,;, ~-;;-;;.7";:';- '_ --1,,1..'_';"'_ 70 i : Overall Positional Error: Voxel Resolution 0.5 em ~ ~ r LL 40 10 t: 0.015 0.01 JaL.L 0.006... 2SO 1 Spatial 0.

Experimental results in [3]: The first and second rows are visual results with HumanEvall body data and synthesized hand data.

Experimental results in [1]: The visual result with ballet sequence and the quantitative results Goint position/angle) which show comparative performance of the method in [1] with some other standard

6 :j.., f.....!,.~ ~~...-l,..j ~ -i -1 ~~,;,;, ~-;;-;;.7";:';- '_ --1,,1..'_';"'_ 70 i : Overall Positional Error: Voxel Resolution 0.5 em ~ ~ r LL t: JaL.L SO 1 Spatial 0.6 Error 1.6 [em) %=32.4 " " RMS-19.8 mean = 8.61 medi8n = 3.77 mode = " 'f' 300 3SO 400 Frame ndex 12 LL Angu~rEnorldegreee] Figure 4. Experimental results in [3]: The first and second rows are visual results with HumanEvall body data and synthesized hand data. The third and forth rows are quantitative results Goint position/angle) of synthesized hand data SO Flame ndex Figure 5. Experimental results in [1]: The visual result with ballet sequence and the quantitative results Goint position/angle) which show comparative performance of the method in [1] with some other standard particle filter based methods. could only work well with some specific types of trained movements. Current implementation of this method does not handle case of new movements that are previously unseen in the training data. Figure 6. Experimental results in [20] with synthesized data. They also have experiment with scanned data, real captured data and HumanEvall dataset. Figure 7. LE-based voxel segmentation in [20] performed successfully in case of self contact, which some previous voxel segmentation algorithms do not address Laplacian Eigenspace (LE) based approach for body modeling [20] This is a kind of skeletonization method that obtains the skeletons of individual articulated body chains. The voxel segmentation technique in this method is quite general and can handle poses where there is self contact, Le. when one or more limbs touch other body parts. n this method, first the voxel data is segmented into 6 chains representing the body (torso, head, and 4 limbs). Based on this segmented result, more detailed skeletal model and superquadric model representing the body are estimated. These representations (6-chain, skeletal and superquadric model) of body are shown in Figure 3.(c). A main contribution of this method is to discover the

7 Achieve body Segment initial nitialize body Continue voxel data at an... voxel data.. model using... tracking with... J'"... initial specific using spline segmented KC-GMM pose fitting in LE voxel data method Figure 8. ntended steps of a method combining LE-based voxel segmentation KC-GMM method for automatic initialization and tracking of human body properties of Laplacian Eigenspace (LE) transformation after comparison of several manifold techniques (Laplacian, somap, MDS...). By mapping into high dimensional (e.g. 6D) LE, voxel data of body chains like limbs, which have their length greater than their thickness, will form an 1-D smooth curve which can then be used to segment voxel data into different body chains. The procedure for LE mapping is as follows: First, we compute the adjacent matrix W of voxel data, such that Wij = 1 only if voxel i is. a neighbor of voxel j. Then, we compute D matrix, so that D.. = ~m WOk and Dii = 0 for i:~j.. The first d ~k=l " eigenvectors of L=D-W with minimum eigenvalues give us the d basis of the needed LE. After mapping into LE, there are several 1-D curves corresponding to different body chains. A spline fitting process is used to segment the curves which results in the segmentation of their respective body chains. Segmented voxel clusters are then registers to their actual body chain using a probabilistic registration procedure. And next using general knowledge about human stature they have a procedure to estimate the skeletal and superquadric model of the body. They did experiment with several synthesized, scanned and real captured body data (e.g. Figure 6). Figure 7 shows that the proposed LE-based voxel segmentation performed successfully in case of self contact, which was not addressed in some previous voxel segmentation algorithms. Their experimental result with HumanEva dataset however was not good (only around 9% of the total frames were successfully segmented and registered), which indicates the sensitization of LE based segmentation step to voxel noise and this will affects the whole subsequent steps. 3. DEA OF COMBNNG LE-BASED VOXEL SEGMENTATON AND KC-GMM METHOD NTO A MORE POWERFUL SYSTEM FOR HUMAN BODY POSE ESTMATON As discussed above, a desired improvement of KC-GMM method [3] is an automated initialization step. A possible solution is to use results in [13], which is based on specific information about the shape and size of the head and torso to have hierarchical growing procedure for body model acquisition: starting by locating the head, then torso, then limbs. n doing so, however, we will lose the generality advantage of KC-GMM method which means we cannot apply it to other articulated models like hand. The voxel segmentation using LE transformation in [20] has the generality, for example we should be able to apply it for hand case because fingers also have their length greater than their thickness (however it should be mentioned that the subsequent steps in [20] of probabilistic registration and model estimation is specific for body case). Therefore LE based voxel segmentation would be a more appropriate choice for improving KC-GMM method with an automated initialization step. Regarding LE based method for body modeling [20], their experiment with HumanEva dataset indicates the voxel segmentation step is sensitive to noise and failure in this initial step will affect their whole process. This motivates the idea that instead of doing voxel segmentation at every frame, we only use it for initialization purpose. n subsequent frames, a tracking based method that exploit temporal information like KC-GMM method could help in overcoming the sensitization to noise to some extent (we know that KC-GMM method had quite good experiment with HumanEva dataset). The intended steps of a combined method following this idea are shown in Figure 4: The body starts at an initial specific pose (e.g. stretch pose), which clearly reveals the body's structure. LE transformation is applied to segment body voxel at this initial pose into different parts (e.g. limbs). With a selected initial pose, we can expect to have good segmentation result. Also because we request the initial pose to be specific, it is possible to develop a simple procedure to initialize body model from segmented voxel. After body model is initialized, KC-GMM method will be used for body pose estimation in subsequent frames. 4. CONCLUDNG REMARKS AND FUTURE WORK n this paper, we provide a review ofthe sub-area ofmodelbased method for real human body pose estimation using volumetric data. After a briefoverview to put in context this concerned subarea, we focus on analyzing and comparing several selected methods, especially some recent methods in the past two years to high light their important results including increasing generality, real time performance, and a new general LE based method for voxel segmentation.

Based on this analysis, we discuss about our idea of a method combining LE based voxel segmentation and KC GMM methods for an automatic human body model initialization and tracking using voxel data.

8 Based on this analysis, we discuss about our idea of a method combining LE based voxel segmentation and KC GMM methods for an automatic human body model initialization and tracking using voxel data. A close follow up work for us is to implement this idea. We may think of several other directions for future work in improving performance & robustness of current pose estimation methods. First, we can keep trying to combine good characteristics from different methods to have a more robust one. For example, we may want to incorporate some kind ofprediction information as done in [1, 13] to the proposed combined method. Second, we can find some way to use both 3D voxel feature and 2D features. n [1, 5], they have associate color information to voxel data. Other 2D features like edges, appearance model, etc... should be also useful. Regarding the major difficulty ofhigh-dimensional body pose configuration space, we can also exploit the divide and conquer principle by trying to break the problem into smaller dimensional ones like the hierarchical estimating of body pose in [13] (detect head first, then torso and so on) or the breaking of complex human movement into basic motions in [1]. There are also some opened related research areas that should be mentioned. First is the issue ofhuman body pose estimation at multilevel (e.g. body level, head level, hand level) which was mentioned in [22]. We can see the benefits of having such a multilevel human body pose estimation system: Combined information from different level is more useful (e.g. in intelligent environment, the combination of body pose, hand pose, head pose would give better interpretation of human status/intention); nformation from different levels can support each other and help to improve the estimation performance. However typical approaches in the area only deal with each task ofbody pose estimation, hand pose estimation, head pose estimation separately. Therefore, it is worth to have some studies that analyze the reasons why typical approaches only deal with one task at a time and find a way to achieve the goal ofa full body model (e.g. including body, head, and hand). Another opened related research area that is worth to dealing with is the issue of pose estimation and tracking of multiple objects simultaneously. ACKNOWLEDGEMENT We would like to thank our colleges at CVRR lab, especially Dr. Shinko Cheng for useful discussions and assistances. REFERENCES [1] F. Caillette, A. Galata, T. Howard, "Real-time 3-D Human Body Tracking Using Learnt Models of Behaviour", Computer Vision and mage Understanding (109),2008. [2] S. Cheng, M. Trivedi, "Multimodal Voxelization and Kinematically Constrained Gaussian Mixture Model for Full Hand Pose Estimation: An ntegrated Systems Approach", EEE nt. Conference on Computer Vision Systems, pages 34-42, [3] S. Cheng, M. Trivedi, "Articulated Human Body Pose nference from Voxel Data Using a Kinematically Constrained Gaussian Mixture Model", CVPR EHuM2: 2nd Workshop on Evaluation of Articulated Human Motion and Pose Estimation, [4] G. Cheung and T. Kanade, "A Real-time System for Robust 3D Voxel Reconstruction of Human Motions", EEE Proc. Computer Vision and Pattern Recognition Conference, pages , [5] G. Cheung, S. Baker, and T. Kanade, "Shape-From-Silhouette of Articulated Objects and ts Use for Human Body Kinematic Estimation and Motion Capture", EEE Computer Vision and Pattern Recognition Conference, [6] Q. Delamarre and o. Faugeras, "3D Articulated Models and Multiview Tracking With Physical Forces", Computer Vision and mage Understanding, 81(3): ,2001. [7] A. Doshi, M. Trivedi, "Hybrid Cone-Cylinder Codebook Model for Foreground Detection with Shadow and Highlight Suppression", EEE nternational Conference on Advanced Video and Signal based Surveillance, Nov [8] T. Horprasert, D. Harwood, and L. S. Davis, "A Statistical Approach for Real-time Robust Background Subtraction and Shadow Detection", EEE Proceedings CCV Frame-Rate Workshop, September [9] E. Hunter, "Visual Estimation of Articulated Motion Using Expectation-Constrained Maximization Algorithm", PhD thesis, University ofcalifornia, San Diego, [10] Z. Husz, A. Wallace, "Evaluation of a Hierarchical Partitioned Particle Filter with Action Primitives", CVPR EHuM2: 2nd Workshop on Evaluation of Articulated Human Motion and Pose Estimation, [11] D. Knossow, R. Ronfard, R. Horaud, "Human Motion Tracking with A Kinematic Parameterization of Extremal Contours", nternational Journal of Computer Vision, vol. 79, pages , [12] M.W. Lee, R. Nevatia, "Human Pose Tracking in Monocular Sequence Using Multi-level Structured Models", EEE Transactions on Pattern Analysis and Machine ntelligence, 2008.

[13]. Mikic, M. Trivedi, E. Hunter, P. Cosman, "Human Body Model Acquisition and Tracking using Voxel Data", nternational Journal ofcomputer Vision, pages 199-223, July 2003. [14] T. Moeslund and E.

9 [13]. Mikic, M. Trivedi, E. Hunter, P. Cosman, "Human Body Model Acquisition and Tracking using Voxel Data", nternational Journal ofcomputer Vision, pages , July [14] T. Moeslund and E. Granum, "A Survey of Computer Vision Based Human Motion Capture", Computer Vision and mage Understanding, 81(3): ,2001. [15] T. Moeslund, A. Hilton, and V. Kruger, "A Survey on Advances n Vision-based Human Motion Capture and Analysis", Computer Vision and mage Understanding, pages , [16] R. Poppe, "Evaluating Example-based Pose Estimation: Experiments on the HumanEva Sets", CVPR EHuM2: 2nd Workshop on Evaluation ofarticulated Human Motion and Pose Estimation, [17] R. Poppe, "Vision-based Human Motion Analysis: An Overview", Computer Vision and mage Understanding, vol. 108, pages 4-18, [19] G. Slabaugh, B. Culbertson, and T. Malzbender, "A Survey of Methods for Volumetric Scene Reconstruction for Photographs", nternational Workshop on Volume Graphics, pages ,2001. [20] A. Sundaresan, R. Chellappa, "Model Driven Segmentation of Articulating Humans in Laplacian Eigenspace", EEE Transactions on Pattern Analysis and Machine ntelligence, [21] G. Slabaugh, R. Schafer, M. Hans, "mage Based Photo Hulls", nternational Symposium on 3D Data Processing Visualization and Transmission, [22] M. Trivedi, "Human Movement Capture and Analysis in ntelligent Environments", Machine Vision and Application, vol. 14, pages , [23] N. Werghi, "Segmentation andmodeling offull HumanBody Shape From 3-D Scan Data: A Survey", EEE Transactions on Systems, Man, and Cybernetics, Part C 37(6): (2007). [18] D. Ramanan, D.A. Forsyth, and A. Zissennan, "Tracking People by Learning Their Appearance", EEE Transactions on Pattern Analysis and Machine ntelligence, 2007.

Hand Pose Estimation Using Expectation-Constrained-Maximization From Voxel Data

Hand Pose Estimation Technical Report, CVRR Laboratory, November 2004. 1 Hand Pose Estimation Using Expectation-Constrained-Maximization From Voxel Data Shinko Y. Cheng and Mohan M. Trivedi {sycheng, mtrivedi}@ucsd.edu