Hand Segmentation Using 2D/3D Images

Size: px

Start display at page:

Download "Hand Segmentation Using 2D/3D Images"

Preston Lee
5 years ago
Views:

1 S. E. Ghobadi, O. E. Loepprich, K. Hartmann, O. Loffeld, Hand Segmentation using D/3D Images, Proceedings of Image and Vision Computing New Zealand 007, pp. 6 69, Hamilton, New Zealand, December 007. Hand Segmentation Using D/3D Images S. E. Ghobadi 1, O. E. Loepprich 1, K. Hartmann 1, and O. Loffeld 1 1 Center for Sensor Systems (ZESS), University of Siegen, Paul-Bonatz-Str. 9-11, Siegen, Germany. {Ghobadi,Loepprich,Hartmann,Loffeld@zess.uni-siegen.de} Abstract This paper describes a fast and robust segmentation technique based on the fusion of D/3D images for gesture recognition. These images are provided by the novel 3D Time-of-Flight (TOF) camera which has been implemented in our research center (ZESS). Using modulated infrared lighting, this camera generates an intensity image with the range information for each pixel of a Photonic Mixer Device (PMD) sensor. The intensity and range data are fused to be used as the input information for the segmentation algorithm. Our proposed segmentation technique is based on the combination of two unsupervised clustering approaches: K-Means and Expectation Maximization (EM). They both attempt to find the centers of natural clusters in the fused data. The experimental results show that the proposed gesture segmentation technique can successfully segment the hand from user s body, face, arm or other objects in the scene under variant illumination conditions in real time. Keywords: Segmentation, D/3D Images, Gesture, K-Means, Expectation Maximization 1 Introduction A fast and robust segmentation is the first challenge in the vision-based gesture recognition for the real time man machine interaction. Gesture segmentation approaches based on intensity information or color images (D images) suffer from low efficiency and lack of robustness in the cluttered background as well as under variant lighting conditions. For this reason, in D gesture segmentation, the problem is usually simplified by introducing some assumptions or applying some constraints either to the scene or to the user. These limitations include wearing special gloves by the user [1], [], [3], controlling the illumination in the scene, avoiding to have the objects with the skin color in the image and using markers []. To overcome these difficulties and improve the robustness of the segmentation a lot of research has been done based on the range images in recent years [1], [], [5], [6], [7]. The range information can be provided by using different kinds of cameras such as laser range camera [], stereo camera [5], Coded Light Approach (CLA) camera [1] and Time of Flight (TOF) camera [6], [7], [8]. No matter, which kind of the camera system is used, the performance of the gesture segmentation for the real time application relies on how fast and how robust it is. These requirements depend on the frame rate of the camera and the quality of the range images taken by the camera. Although the CLA and laser range camera provide highly accurate range images, they suffer from the low acquisition rate. In [1] Caplier et al. used a CLA camera with the frame rate of 1 Hz and Heisele and Ritter used a laser range camera with the frame rate of 7Hz []. The stereo vision camera has also a low frame rate due to it s computational time to calculate the disparity map from right and left images. In [5] Nefian et al. used a stereo camera with the frame rate of 11 Hz. Another open issue in using stereo vision camera is that the stereo range images are heavily affected by texture and environmental illumination conditions. We have already discussed this problem in [8]. The 3D Time of Flight (TOF) sensors, on the other hand are becoming very attractive in man machine interaction field [6], [7], [8] because they can provide the gray level images with reliable range data for each pixel. While the range images of TOF camera are independent of the texture and lighting conditions, they are somehow affected by the color of the object. This is because the range image in TOF camera is calculated from the phase difference between the transmitted modulated infrared light to the object and the received infrared light back from the object. As the colors have different reflection factors, the range image in TOF camera is affected by the color of the object, ie. two objects with different colors at the same distance might get different range data in a TOF image. 6

This paper on one hand addresses the solution to this problem in a robust gesture segmentation by fusing the intensity and range information into a D feature space, and on the other hand proposes a

We have used the combination of two following unsupervised clustering techniques for segmentation: K-Means Clustering Expectation Maximization The paper continues as follows: Section presents an

2 This paper on one hand addresses the solution to this problem in a robust gesture segmentation by fusing the intensity and range information into a D feature space, and on the other hand proposes a fast segmentation technique using our novel 3D TOF camera. We have used the combination of two following unsupervised clustering techniques for segmentation: K-Means Clustering Expectation Maximization The paper continues as follows: Section presents an overview of our camera system which we have used. In Section 3 the data fusion is discussed. Section introduces the clustering techniques which have been used for the segmentation in this paper. Section 5 summarizes our experimental results while Section 6 concludes this work. 3D-Time of Flight Camera Range imaging in a 3D-Time of Flight camera is the fusion of the distance measurement technique with the imaging aspect. The principle of the range measurement in a TOF camera is based on the measurement of the time the light needs to travel from one point to another. This time which is socalled time-of-flight t is directly proportional to the distance d the light travels: d = c t where c denotes the speed of light. (1) Our 3D non-scanning Time of Flight (TOF) camera system consists of an infrared lighting source, Photonic Mixer Device (PMD) sensor [9], FPGA based processing and the communication unit. The lighting source illuminates the scene with the modulated infrared light signal which is generated using a MOSFET based driver and a bank of high speed infrared emitting diodes at the frequency of 0 MHz. The illuminated scene is observed by an intelligent pixel array (PMD) via an optical lens for focusing, where each pixel on the PMD sensor samples the amount of modulated light and determines the turnaround time of the modulated light [8]. Typically this is done by using continuous modulation and measuring of the phase delays in each pixel [10]. Assuming continuous sinusoidal or rectangular modulation, the distance is calculated as follows [10]: d = c ϕ π f mod () Figure 1: An Example of TOF Image (Left: Range Image, Right: Intensity Image) where f mod denotes the modulation frequency and ϕ = π f mod t represents the phase delay. At the modulation frequency of 0 MHZ, the unambiguous distance is equal to 15 meters, ie. the maximum distance for the target is 7.5 meters, because the illumination has to cover the distance twice: from the sender to the target and back to the sensor chip. To calculate the phase delay, the autocorrelation function of electrical and optical signal is analyzed by a phase-shift algorithm. Using four samples A 1, A, A 3 and A, each shifted by 90 degrees, the phase delay ϕ can be calculated using the following equation [8]: ϕ = arctan( A 1 A 3 A A ) (3) In addition to the phase shift of the signal, the strength of the received signal a and the gray scale value b are formulated respectively as follows [8]: (A1 A 3 ) a = + (A A ) () b = A 1 + A + A 3 + A (5) TOF camera, unlike the stereo vision camera, is texture independent and since the range is calculated directly in each pixel unit with minimal processing, a very high frame rate, dependent on the exposure time, can be obtained. In our work, we have used a PMD sensor with the resolution of 3K (6x8). The exposure time is set at 5 msec at the frequency of 0 MHz. Under these conditions and using a USB.0 communication protocol, the frame rate of 50 images per second is obtained which is suitable for real time applications. The range accuracy of this camera under the mentioned condition is about ±1 cm. An example of a TOF image, including the intensity and range images, is shown in Figure 1. The range image is coded in color such that the pixels in the foreground which represents the object closer to the camera are brighter. These two images are fused to provide the input data for the segmentation algorithm. 65

3 3 Range and Intensity Image Fusion The TOF camera delivers three data items for each pixel at each time step: intensity, range and amplitude of the received modulated light. The intensity image of the TOF camera comparable to the intensity images in CCD or CMOS cameras relies on the environment lighting conditions, whereas the range image and the amplitude of the received modulated light are mutually dependent. They both depend on the reflection factor of the object, ie. the constitution of the object. None of these individual data can be used solely to make a robust segmentation under variant lighting and color conditions. Fusing these data provides a new feature information which is used to improve the performance of the segmentation technique. In this paper we have used the basic technique for the fusing of the range and intensity data which has already been used in other fields like SAR imaging. The range data in a TOF image is dependent on the reflection factor of the object surface (how much light is reflected back from the object). Therefore, there is a correlation between the intensity and range vector sets in a TOF image. These two vector sets are fused to derive a new data set, so-called phase, which indicates the angle between two intensity and range vector sets and is derived as follows: First, using the intensity and range data in each image a new resulting set of complex number C is derived. C rc = g rc + i d rc (6) where g rc corresponds to the normalized gray value and d rc represents the normalized range information for each pixel in row r and column c of the intensity and range images respectively. Next, the phase of each complex number ϕ is calculated in the polar coordinate system for the whole array of the pixels: ϕ rc = arctan( d rc g rc ) (7) The phase of the complex value and range data are then combined into a D feature space where each pixel is described by a feature vector f rc, containing range and phase information. f rc = (d rc, ϕ rc ) (8) Here, range denotes the position of the object in Z direction in the world coordinate system which is aligned to the optical axis. The other D information in XY coordinate system are neglected, because in our gesture segmentation problem the pixels with the similar XY information do not necessarily belong to the the same physical object. Segmentation Segmentation is the first step of the image processing in the computer vision applications such as gesture recognition. It is the process of distinguishing the object of interest from the background as well as the surrounding non interesting objects. In other words, image segmentation aims at a better recognition of objects by grouping of the image pixels or finding the boundaries between the objects in the image. Gesture segmentation in this paper is treated as a clustering problem. Clustering is an unsupervised learning technique to identify the group of unlabeled data based on some similarity measures. Each group of unlabeled data so-called cluster corresponds to an image region while each data point is a feature vector which represents a pixel of the image. Two techniques have been combined and employed for clustering in this paper which are discussed in the next sections..1 K-Means technique K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem by partitioning the data set {x 1, x,..., x N } into some number K of clusters. For each data point x n, a binary membership function is defined as: r nk = { 1 if xn assigned to cluster k 0 otherwise K-means aims at minimizing the objective function, given by [1]: J = n=1 k=1 K r nk x n µ k (9) where x n µ k is the distance between the data point x n and the cluster center µ k. In fact, the goal is to find the values for the {r nk } and the {µ k } so as to minimize J. This is done through an iterative procedure in which each iteration involves two successive steps corresponding to successive optimization with respect to the r nk and the µ k [1]. The main advantages of this algorithm are its simplicity and speed. The computational cost of K- means is O(KN), which allows it to run on the large data sets. However, k-means is a data dependent algorithm. Although it can be proved that the procedure will always terminate, the algorithm does not achieve a global minimum. Since K-means is a distance based or hard membership algorithm, every data point, at each iteration, 66

4 is assigned uniquely to one, and only one, of the clusters. For the data points which lie roughly midway between cluster centers, the hard assignment to the nearest cluster might not be the most appropriate one. By adopting the probabilistic approaches, like Expectation Maximization (EM), a soft assignments of data points can be obtained.. Expectation Maximization Technique Expectation Maximization (EM) is a powerful method to find the maximum likelihood solution for models with latent variables. This approach can be used for image segmentation where each segment (cluster) is mathematically represented by a parametric Gaussian Distribution. The entire data set is therefore modeled by a mixture of these distributions. EM consists of the iterations of two steps: E-step and M-step. After the initialization, these two steps are repeated till the algorithm converges and gives a maximum likelihood estimation. The implementation of EM has been done as follows [1]: Initialization: the parameters we want to learn are initialized. These consist of mean µ k, covariance Σ k and mixing coefficients π k. Expectation: In the expectation step the expected values E(z nk ) for each data point x n is calculated. It is actually the probability that x n is generated by the kth distribution. E(z nk ) = π k N (x n µ k, Σ k ) K j=1 π jn (x n µ j, Σ j ) (10) Maximization: Once the expected values E(z nk ) have been calculated, the parameters are reestimated: Σ new k = 1 N k where µ new k = 1 N k n=1 E(z nk )x n (11) n=1 E(z nk )(x n µ new k π new k N k = = N k N )(x n µ new k ) T (1) (13) E(z nk ) (1) n=1 Evaluation: In this step the log likelihood is evaluated and the convergence of the parameters or the log likelihood is checked. If the convergence criterion is not satisfied the algorithm returns to the Expectation Step K-Means Expectation Maximization (KEM) Technique As it is already mentioned the K-means algorithm has a hard membership function and a small shift of a data point can flip it to a different cluster. The solution to this problem is to replace hard clustering of k-means with soft probabilistic assignments [1]. In our paper this is done by EM algorithm because EM has no strict boundary among clusters and a data point is assigned to each cluster with a certain probability. However, the techniques such as EM might yield poor clusters if the parameters are not initialized properly. To solve this problem we have proposed a technique which combines K-Means with EM, so-called KEM. This technique is similar to that presented in [13]. It employs K-means as the the initial clustering to find the initial cluster centers. This reduces the sensitivity of the initial points and gives the centers which are widely spread within the data. These centers are used as the initial parameters for EM and it starts iterating to find the local maxima. 5 Experiments and Results All experiments have been done in the real time. The range and intensity images are taken directly in each snap shot of a TOF camera based on PMD sensor. The resolution of the camera we have used is 3k (6x8 pixels). The modulation frequency and the exposure time have been set to 0 MHz and 5 msec respectively. Under these conditions, the frame rate of the camera is about 50 images per second, including the intensity and range images. Using K-Means Expectation Maximization (KEM) which was discussed in the last section, each image is segmented. The frame rate of the segmented images is above video frame rate which is suitable for the real time gesture tracking and recognition. We evaluate our segmentation technique for three following cases: gesture is posed in the foreground of a simple scene. Figure depicts some images of this case. In these images the gesture is posed in the distance of over 10 cm from the background, torso or face. The first row in Figure shows the intensity images. The second row shows the coded range images such that the pixels of the background are darker than the pixels of the gesture in the foreground. The third row shows the results of segmentation using KEM technique for six clusters. The images 1 to 3 show the gesture in a plain background while the images to 8 show the gesture in the scene where

5 Figure : Results of gesture segmentation in a sequence of movement from foreground to the background. First row: intensity images, Second row: range images, Third row: segmented images) Some images of this case are shown in Figure 3. In this case the gesture is posed in a cluttered and complex scene where the lighting condition as well as the color of the objects affect the TOF images and make the problem more complicated. In this case, we have segmented the images once using just the range data as the single feature and once using the fusion of the range and phase data. The first and the second row of the Figure 3 show the intensity and the range images respectively. The third row shows the results of the segmentation using the range data while the last row shows the segmented images using the fused data. As it can be seen from the results, the segmented images using just range information get too much error and the pixels related to the gesture do not get isolated from the pixels of the other objects very well. In the images 1 and the range data are affected by the color, ie. the black color ( the color of the shirt in image 1 and the color of the hair in image ) does not reflect too much infrared light and therefore the range data get wrong values for these objects. This is one of the problem of TOF camera which we have already discussed in this paper. In the range images and 6 since the face is not illuminated very well by the lighting system, we get some errors in the range data. Also, the range data in images 3 and 5 get error because the gesture span to the torso, face, arm or other objects in these images is smaller than the statistical error rate of TOF image (about cm). However, these images show that the TOF range images can not be used solely to build a robust gesture segmentation under these conditions. Fusing the range data with the intensity images which has been proposed in our paper solves this problem. As the results of the segmentation based on fused data in the last row of Figure 3 show, the pixels related to the gesture have been grouped in one cluster and the gesture has been segmented very well from the face, torso or other objects in the complex scene. Figure : Results of gesture segmentation in a simple scene. First row: intensity images, Second row: range images,third row: segmented images) Figure 3: Results of gesture segmentation in a complex scene. First row: intensity images, Second row: range images, Third row: segmented images using range feature, Fourth row: segmented images using the fusion of range and phase features) the user s body, face or arm are observed as well. As it is seen from the segmented images, the pixels related to the gesture have been grouped in one cluster very well without any error. In this case, since the gesture span from the background, torso or face (10 cm) is over the statistical noise of TOF range images (about cm), the range information, without fusing with the intensity data, can be used as a single feature for the segmentation algorithm and it yields the same results as when the fusion of range and phase is employed as the input feature vector for the segmentation. A sequence of moving gesture from foreground to the background. gesture is posed in the foreground in a cluttered and complex scene. 68

6 Figure shows a sequence of moving gesture from foreground to the background in the steps of 15 cm. As the previous figures, the first and second row show the intensity and range images respectively while the third row shows the segmented images. The hand gesture is segmented from the user s body, face and arm very well in all of the sequences except image sequence number where the hand gesture and face are posed in the same distance from the camera and they both have the same intensity values. This is actually the case that the segmentation fails. This problem can be solved using our new novel D/3D multi-camera [1] which employs two sensors to provide a colorful high resolution image with the range information. In our ongoing project these images are used for gesture recognition. 6 Conclusion This paper describes a fast and robust gesture segmentation technique based on the fusion of range and intensity images of a TOF camera. This camera provides these images with the frame rate of 50 images per second which is suitable for the real time gesture tracking and recognition applications. The two unsupervised clustering techniques, K- means and Expectation Maximization, are combined to segment the images. The results show that the proposed technique is on one hand fast to deliver the segmented images at the video frame rate and on the other hand is robust enough to be used as the first step of processing technique in D/3D gesture tracking and gesture recognition. We have also shown that in the case where the gesture gets the same range and intensity information as the face, the segmentation fails. The solution to this problem is to use our new novel D/3D camera which delivers a colorful high resolution image with the range information. 7 Acknowledgments This research has been supported by the DFG Dynamisches 3D Sehen- Multicam Project and DAAD IPP s Program in Center for Sensor System (ZESS) in Germany. References [1] A. Caplier, L. Bonnaud, S. Malassiotis, and M. G. Strintzis, Comparison of d and 3d analysis for automated cued speech gesture recognition, in 9-th International Conference on Speech and Computer, SPECOM0, 00. [] C. Keskin, O. Aran, and L. Akarun, Real time gestural interface for generic applications, in European Signal Processing Conference, EUSIPCO 05, 005. [3] T. Burger, A. Caplier, and S. Mancini, Cued speech hand gestures recognition tool, in European Signal Processing Conference, EU- SIPCO 05, 005. [] B. Heisele and W. Ritter, Segmentation of range and intensity image sequences by clustering, in IEEE International Conference on Information Intelligence and Systems, [5] A. V. Nefian, R. Grzeszczuk, and V. Eruhimov, A statistical upper body model for 3d static and dynamic gesture recognition from stereo sequences, in IEEE International Conference on Image Processing, 001. [6] B. S. Goektuerk and C. Tomasi, 3d head tracking based on recognition and interpolation using a time-of-flight depth sensor, in IEEE Conference on Computer Vision and Pattern Recognition, 00. [7] Z. Mo and U. Neumann, Real-time hand pose recognition using low-resolution depth images, in IEEE Conference on Computer Vision and Pattern Recognition, 006. [8] S. Ghobadi, K. Hartmann, W. Weihs, C. Netramai, O. Loffeld, and H. Roth, Detection and classification of moving objects-stereo or time-of-flight images, in Computational Intelligence and Security. IEEE, 006, pp [9] PMD, Photoics pmd 3k-s 3d video sensor array with active sbi, [10] T. Moeller, H. Kraft, and F. J., Robust 3d measurement with pmd sensors, in PMD Technologies GmbH, [11] S. Gokturk, H. Yalcin, and C. Bamji, A time of flight depth sensor, system description, issues and solutions, in IEEE workshop on Real-Time 3D Sensors, 00. [1] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 006. [13] S. Nasser, R. Alkhaldi, and G. Vert, A modified fuzzy k-means clustering using expectation maximization, in IEEE World Congress on Computational Intelligence, 006. [1] T. Prasad, K. Hartmann, W. Weihs, S. Ghobadi, and A. Sluiter, First steps in enhancing 3d vision technique using d/3d sensors, in Computer Vision Winter Workshop,

Real Time Hand Based Robot Control Using Multimodal Images

Real Time Hand Based Robot Control Using Multimodal Images Seyed Eghbal Ghobadi Omar Edmond Loepprich Farid Ahmadov Jens Bernshausen Klaus Hartmann Otmar Loffeld Abstract In the interaction between man