Low Cost Motion Capture - PDF Free Download

Low Cost Motion Capture R. Budiman M. Bennamoun D.Q. Huynh School of Computer Science and Software Engineering The University of Western Australia Crawley WA 6009 AUSTRALIA Email: budimr01@tartarus.uwa.edu.au, {bennamou,du}@csse.uwa.edu.au Abstract Traditionally, computer animation techniques were used to create movements of an object. Unfortunately, these techniques require much human intervention to work out the different joint angles for each movement. Not only is the task a very time-consuming one, the movements created are often not realistic either. Modern motion capture techniques overcome those problems by capturing the actual movements of a performer (e.g. human being) from the detected positions or angles of the sensors or optical markers on the subject. Despite their advantages, motion capture has always been considered to be an expensive technology. In this paper, we describe a low cost motion capture system that uses two low cost webcams. We also demonstrate our experimental results of 3D reconstruction of the lower body part of a human subject. Keywords: Motion capture, Mean-shift algorithm, Camera calibration, 3D reconstruction 1 Introduction Motion capture, or mocap, is a technique of digitally recording the movements of real beings, usually humans or animals. Traditionally, computer animation techniques are used to create movements of a being. However, this technique is proven to be time consuming and difficult. Motion capture is considered to be a better technique for accurately generating movements for computer animation. There are three types of motion capture techniques [1]. The first technique is called the optical motion capture in which photogrammetry is used to establish the position of an object in 3D space based on its observed location within the 2D fields of a number of cameras. The second technique is called the Magnetic motion capture, where the position and orientation of magnetic sensors are calculated with respect to a transmitter. The last technique is called electro-mechanical motion capture and it involves modelling movements using body suit with sensors attached. The need for optical motion capture can be justified by the fact that this technique is able to cover a large active area and, due to the lightness in weight of the markers, provides more freedom of movement for the subject. Despite these advantages, optical motion capture technologies have been known to be expensive. The high cost is mainly contributed by the cost of hardware components (i.e. high speed cameras). In this paper, we describe the design and implementation of a low cost optical motion capture system that requires two low cost calibrated webcams. This low cost system falls under the optical motion capture category as advanced computer vision techniques are employed to establish the joint positions of a subject. As all motion capture systems involve a tracking phase, we adopt the Mean-shift algorithm as the basis of object tracking. While our current system is constrained by several limitations such as the inability to handle occlusion, it is still able to demonstrate the fundamental idea of motion capture and provides input to animation applications, such as Poser [2]. The outline of this paper is as follows. In Section 2, a brief overview of the Mean-shift algorithm will be explained. In Section 3, the hardware components and the setup of our system will be described. Experiments and results are reported in Section 4. Finally, conclusion and future work are given in Section 5. 2 Mean-shift: An overview The mean-shift algorithm [3, 4, 5] is one of the tracking techniques commonly used in computer vision research when the motion of the object to be tracked cannot be described by a motion model, such as one required by the Kalman filter [6]. The colour and texture information that characterizes the object can be grouped together to form the feature vector in the tracking process. The algo-

rithm requires only a small number of iterations to converge and can easily adapt to the change of scale of the tracked object. The key notion in the mean-shift algorithm is the definition of a multivariate density function with a kernel function K(x) over a region in the image: n x xi 1 X K, f (x) = nhd i=1 h where {xi i = 1,, n} is a set of points falling inside a window of radius h centred at x. There are a number of kernel functions that one can choose from. The commonly used ones are the Normal, Uniform, and Epanechnikov kernel functions. At each iteration the algorithm produces the meanshift vector that describes the movement of the region that encloses the tracked target. As the mean-shift vector is defined in terms of the negative gradient of the kernel profile, a kernel function that has the simplest profile gradient is preferred. Amongst the commonly used kernel function above, the Epanechnikov kernel, which has its kernel profile as a uniform distribution, is preferrable than the other two. Comaniciu et al. [3, 4] formulate the target estimation problem as the derivation of the estimate that maximizes the Bayes error associated with the target model and target candidate distributions. This approach suggest that the larger is the probability of error, the more similar are the distributions. Based on this assumption, the Bhattacharyya coefficient [7] is used to calculate the similarity measure between the two distributions. 3 System description The setup of our motion capture system is intended to be low-cost. The necessary pieces of equipment required are two low-cost webcams, two tripods, and a calibration frame. The block diagram in Fig. 1 shows all the components of the system. Each of these components will be described in detail later. The system uses two low cost webcams for motion capture. Each webcam mounted on a tripod must be calibrated prior to any experiments. The current version of our system focuses on the capture of movements on the lower part of the body only. This requires a total of 9 white circular markers to be put on the following joints (see Fig. 2): the hip (1), two upper legs (2), knees (2), ankles (2), and feet (2). To simplify the tracking process, we darken the background by putting in a black curtain and instruct the subject to wear a dark non-glossy tight Figure 1: System block diagram. (a) (b) Figure 2: (a) Setup of the system. (b) A subject with white circular markers on the lower part of his body. suit so that the white circular markers can be easily detected. This requirement is not considered to be a limitation of the system as most movie editing systems would require the background to be of a certain colour (often in blue) for easy segmentation. The two webcams are directly connected to a PC via two USB ports. This allows video images captured by the webcams to be immediately transferred to a PC for processing. We currently use functions under the Matlab Image Acquisition Toolbox for image acquisition; however, any other equivalent functions from other application software can be used also. 3.1 Camera calibration Camera calibration is a step for determining the 3 4 matrix that maps coordinates in the 3D world into the 2D image. The matrix can be recovered linearly via a method commonly referred to as DLT (Direct Linear Transform) [8, 9] using at least 6 known non-coplanar reference scene points and their corresponding image points. In our system, we use a calibration target with two orthogonal faces, each of which has 6 reference points. The calibration target also implicitly defines in the scene a global coordinate system that can be referenced to in some other applications, such as Poser [2], for graphics rendering.

Without moving the calibration target, each webcam was calibrated in turn. The calibration process produces two 3 4 matrices, one for each webcam. 3.2 Detection of markers The nine markers were detected via a thresholding process. This thresholding process involves choosing a threshold value t from the pixel intensity range of 0 to 255. Using a specific threshold value t, the system converts all intensity values within a gray scale image that are greater than t into 1 and all intensity values less than t into 0, and hence produces a binary image. Since we have a much simplified scene for marker detection, we can inspect the intensity histogram to automatically compute the threshold value. As expected, our intensity histogram is bi-modal. Consecutive frequency values in the intensity histogram can be examined to determine the flat region that separates the two modes. The threshold value can then be estimated from the flat region in the intensity value histogram. 3.3 Automatic labellings of markers The 9 markers are automatically labelled using a heuristic method. At the start of each experiment, the subject must adopt the standing pose position as show in Fig. 3. After all nine markers have been detected, the system labels the top middle marker as marker #1. The initial standing pose shows that there are four markers on each leg. Hence, the four markers that are positioned to the left of marker #1 are labelled as marker #2 to #5 and the four markers to the right of marker #1 are labelled as marker #6 to #9. The assignment of marker number depends on the y component of the marker coordinates. So, for the four markers on the left side, marker #2 is given to the marker which has the smallest y value compared to those of the other three markers. The same labelling algorithm is applied to the four markers to the right side of marker #1. 3.4 Mean-shift tracking and 3D reconstruction of markers The mean-shift algorithm is employed to track the nine white markers independently. The system setup described above allows the tracking to be done on grey level images rather than colour. The feature that we used for tracking is therefore simply the pixels intensity values and the density function is the intensity histogram inside the kernel window. Note that, as defined by the Epanechnikov kernel function, the weighting factors for points near the centre of the kernel are higher, the intensity histogram is computed with these different weighting factors incorporated. There are 3 free parameters that can be set to finetune the performance of the mean-shift algorithm: 1. The radius, h, of the kernel window. 2. The threshold value, ɛ, which is used for terminating the tracking iteration between consecutive images. 3. The number of histogram bins, 1 < m < 255, for storing the frequencies of pixels intensity values inside the kernel window. We will describe in the following section what values these parameters were set to in our experiments. For the computation of the 3D coordinates of each marker, the two 3 4 matrices obtained above are combined to give 4 linear equations for the detected image coordinates of the marker in the two images. The 3D coordinates of each marker, relative to the implicit global coordinate system defined by the calibration frame, can be estimated using leastsquares. 4 Results Many experiments have been conducted to test the tracking algorithm and the 3D reconstruction of markers. We also evaluated the performance of the mean-shift algorithm using different values of the free parameters discussed in Section 3.4 above. In most of our experiments, we found that h = 6 ± 1 pixels, ɛ = 10 4, and m = 128 gave the best performance. The result of tracking and 3D reconstruction using these parameters in one of our experiments is presented in Fig. 3. In every experiment, we tested our system to track the movement of markers over 200 frames. Each webcam took an image in sequence and performed mean-shift tracking of the markers. It is not possible to use software to synchronize the two webcams on-line. This is due to the fact that we are only using a single processor computer and therefore the execution of instructions has to be interleaved. Hence, the system instructions for acquiring two images from the two webcams cannot be executed simultaneously. We found that there is a 0.016 seconds delay between the acquisition of an image by the first webcam and the second webcam. The human subject can perform small movements from the initial standing position while the system attempts to track the markers movements.

(a) seq. 1 (b) seq. 2 From our experiments, we found that the radius of the kernel windows is a crucial parameter to the performance of the mean-shift algorithm. Indeed, this issue has also been reported in [5] that a window size that is too large can cause the tracker to become more easily distracted by background clutter and a window size that is too small can cause the kernel to roam around on a likelihood plateau around the mode, leading to poor object localization. In Fig. 4, we show the result of tracking using a h value of 10 pixels in another experiment. We found that this h value is too large to be used as a radius value as it is able to sometimes encapsulate two markers within a kernel window. As shown in Fig. 4(a) and 4(b), a kernel window that is too large allows the white markers to drift slightly away from the centre of, yet still being enclosed within, the kernel window. (c) front view (d) side view 5 Conclusion and future Work Figure 3: (a) and (b) show the tracking results of the mean-shift algorithm on the 9 white markers using a h value of 6 pixels; (c) and (d) the result of reconstruction. (a) seq. 1 (b) seq. 2 We have presented a low cost motion capture system using two webcams. While the current version of our system only captures movement of the lower part of the subject s body, it can be further extended to include the upper part, and hence allows full body movements to be animated. The current labelling algorithm can also be modified in order to cater for the other initial poses other than the standing one. Furthermore, since the h value is an important parameter to the meanshift algorithm and it affects the overall performance of our motion capture system, instead of relying on human intervention to provide an initial h value, the system can be further improved by automatically determining this value and adapting to scale changes during tracking. The notion of low cost motion capture is important for demonstrating the fundamental idea of motion capture and for providing inputs for various advanced animation applications. References [1] Meta Motion, http://www.metamotion.com /motion-capture/motion-capture.htm, Motion Capture - What is it?, 2004. (c) front view (d) side view Figure 4: (a) and (b) show the tracking results of the mean-shift algorithm on the 9 white markers using a h value of 10 pixels. (c) and (d) the result of 3D reconstruction. [2] e-frontier America, Inc, http://www.efrontier.com/go/poser hpl, Poser 6, 2005. [3] D. Comaniciu, V. Ramesh, and P. Meer, Real- Time Tracking of Non-Rigid Objects using Mean Shift, in Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, vol. 2, pp. pp 142 149, 2000.

[4] D. Comaniciu and P. Meer, Mean Shft Analysis and Applications, in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, vol. 2, pp. pp 1197 1203, 1999. [5] R. T. Collins, Mean-Shift Blob Tracking through Scale Space, in Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol. 2, pp. pp 18 20, 2003. [6] G. Welch and G. Bishop, An Introduction to the Kalman Filter, Tech. Rep. 95-041, Department of Computer Science, University of North Carolina, Chapel Hill, 1995. [7] T. Kailath, The Divergence and Bhattacharyya Distance Measures in Signal Selection, IEEE Trans. on Communications, vol. 15, pp. 52 60, Feb 1967. [8] Y. I. Abdel-Aziz and H. M. Karara, Direct Linear Transformation from Comparator to Object Space Coordinates in Close-Range Photogrammetry, in ASP Symposium on Close- Range Photogrammetry (H. Karara, ed.), (Urbana, Illinois), pp. 1 18, 1971. [9] C. C. Slama, C. Theurer, and S. W. Henriksen, eds., Manual of Photogrammetry. Falls Church, Virginia, USA: American Society of Photogrammetry and Remote Sensing, 1980.