3D Hand and Fingers Reconstruction rom Monocular View 1. Research Team Project Leader: Graduate Students: Pro. Isaac Cohen, Computer Science Sung Uk Lee 2. Statement o Project Goals The needs or an accurate hand motion tracking have been increasing in HCI and applications requiring an understanding o user motions. Hand gestures can be used as a convenient user interace or various computer applications in which the state and the action o the users are automatically inerred rom a set o video cameras. The objective is to extend the current mousekeyboard interaction techniques in order to allow the user to behave naturally in the immersed environment, as the system perceives and responds appropriately to user actions. HCI related applications remains the most requent however; we have ocused our eort on the accurate detection and tracking o un-instrumented hands or assessing user perormance in accomplishing a speciic task. 3. Project Role in Support o IMSC Strategic Plan Multimodal communications rely on speech, haptics and visual sensing o the user in order to characterize the user s emotional and/or physical state. The proposed research ocuses on augmenting the interaction capabilities o a multimodal system by providing an automatic tracking o hands motion and recognizing the corresponding hands gestures. The objective here is to augment or replace the mouse and keyboard paradigm with unctionalities relying on natural hands motion or driving a speciic application. 4. Discussion o Methodology Used Inerring hand motion rom monocular view is challenging, as the detected 2D silhouette is not suicient to depict a complicated hand pose. Thereore, several studies introduce respective hand model and model-based approach. Although these methods can track the articulated hand motion rom monocular view, many o them are computationally too expensive and have diiculties in addressing hand s sel-occlusions. We revised the articulated hand model proposed previous year to deal with natural hand motion constraints, and it allows tracking detected hand eiciently in the presence o sel-occlusions and has a low computational cost. Ater the segmentation o the hand silhouette rom the video stream, an articulated hand model is itted onto the 2D mask o a hand. For initialization, 3D hand model coniguration such as joint oset and size are reined by the avorable stretched-out pose. Section 4.2 describes the initialization process in detail. First joints connecting palm and each inger orm a rigid shape. Once a 3D model is reined, global hand pose is computed using this rigid orm constraint and then estimation o each inger motion is perormed. To reconstruct accurate pose, 3D model is updated by re-projecting it on 2D detected silhouette in the tracking process. An overview o the proposed approach is shown in igure 1. 231
Figure 1. Overview o the proposed approach. 4.1 Modeling Hand Constraints Without any constraints, it is a very exhaustive task to analyze human hand motion. However, joints in a hand are interrelated and their motions are related by physical constraints. Huang et al [4] showed in detail that human hand motion is highly constrained. A learning approach was proposed to model the hand coniguration space directly sensed rom a CyberGlove. In this project, three types o constraints are deined or natural hand motion. Figure 2 shows the articulated hand model proposed to model hand motion constraints. Figure 2. The articulated hand model. Each joint ji is a revolute joint where is the index o ingers enumerated rom thumb to little inger ( = ~ 4). All the irst joints (where i = ) connecting palm and each inger have 2 degrees o reedom and other joints have a single rotational axis. Thereore, the suggested model has 2 DOF. Intra-Finger Constraints. In this category, the main constraint considered is o inger joints during a bending. The number o rotational axes in a joint is constrained. By using the planar 3- joint-arm inger model, DOF o a inger is reduced to 4. j, the irst joint o a inger, has 2 rotation axes one or bending motion and another is or abduction/adduction motion but other joints have single rotation axis. Table 1 shows the range o each joint angle or bending motion. Table 1. Joint angle constraints or bending motion ji i = i = 1 i = 2 = -1 o ~ +45 o + o ~ +9 o -5 o ~ +9 o = 1~4-15 o ~ +9 o -5 o ~ +9 o -5 o ~ +9 o Each contiguous joint has the ollowing constraint: = a (1) j i i ji -1 232
where inger index =1 to 4 and joint index i = 1, 2. a i is deined by (limit angle o j i ) / (limit angle o ji - 1 ). This type o constraint is not strictly applied, but is an eicient guide o joint oset estimation. Inter-Finger Constraints. Proposed model has ixed geometric proportion among j (osets o all irst joint o a inger). No joint can overlap each other in 3D coordinates. Interrelated motion constraints or abduction/adduction motion between adjacent ingers are deined in this category. Adjacent inger tends to ollow the dominant articulated motion. Global Motion Constraints. This category includes some restrictions on target motion. Tracking objective is natural hand motion, begins with avorable pose or initialization. No external torque on a joint is expected. Motion should be smooth. By these constraints, we address the problem o tracking articulated hand motion with sel-occlusion, rotation and translation, but the proposed approach does not address the scaling and depth estimation rom monocular view. Some recent studies introduce similar constraints to the above mentioned ones. However, they are hardly modeled as an explicit orm o equations. Figure 3. Model initialization. (a) Distance map. (b), (c) Approaching median axis (minimum distance) along the gradient o distance map. (d) Joint position reinement. (e) Coarse 2D model with ellipsoid itting. () An example ater reinement. Figure 4. Kinematics inger model 4.2 Hand Model Fitting with Hand Constraints A sequence o hand motion begins with a avorable initial pose as shown in igure 3. By using an ellipsoid itting [12] on top o the segmented hand silhouette, brie scale and orientation o a hand are computed. Borgeors two-pass algorithm [12] is used to compute the distance map d(x,y) o a hand image or the model reinement. D ( x, y) = Ï- d( x, y), ( x, y) Œ Silhouette (2) Ì Ó d( x, y), Otherwise By minimizing the distance D(x,y) along the gradient D( x, y), each joint oset can be reined. Each joint width (thickness) can be also estimated with D(x,y). Figure 3 shows an example o model initialization in 2D. Based on this inormation, the initial coniguration o 3D hand model is reined. 233
Using inter-inger constraints and global motion constraints, a hand motion can be divided into 1 global pose and individual inger motion. At time t, irst joints o j, j, and j 4 orm a palmtriangle PT(t) which has three sides o w 1 t, w 2 t w t. In the two consecutive palm-triangle PT(t) and PT(t+1), the proportion o corresponding sides o each palm-triangle provides the global pose. By the global motion constraint, j is the origin o a hand coniguration. Thus, the translation vector T g can be estimated by two consecutive 2D position o j detected rom hand silhouette. p p By the rigid motion constraint, w t / wt + 1 gives the angle or rotation vector R g. In Figure 5, a palm-triangle o bright violet shows the estimation results o global pose. Figure 5. 3D hand reconstruction 1: 1 st row show the result o the global pose estimation with palm-triangle. Re-projected hand model 2 nd row shows the reconstructed 3D hand model. Figure 6. 3D hand reconstruction 2: 1 st Row shows the original hand pose. 2 nd row represents the reconstructed pose. 3 rd row is dierent angle view. Ater estimating the global rotation and translation [R g, T g ], each joint angle o a inger can be estimated rom the equations in Figure 4. Each length o links in a inger was ixed in the initialization. Thereore, joint angle qi or the bending motion is explicated. The proposed tracking technique is based on tracking by synthesis. To maximize the likelihood, possible sets o coniguration is tested. The adjustment o a hand model is guided by the motion constraints and the coniguration o previous rame. Disparity vector o the closest boundary between the model segment and the detected silhouette is computed or estimating the position o a joint angle. The adjusted 3D hand coniguration is projected back onto the 2D image and updated rom 2D image measurements back and orth. Figure 5 shows some results o motion tracking. Reconstructed 3D hand model and its re-projected 2D model are correctly matched with the detected silhouette. Various hand poses including sel-occlusion and the corresponding results are shown in Figure 6. 5. Short Description o Achievements in Previous Years The 3D shape o the hands is reconstructed rom 2D hands silhouettes extracted rom two calibrated cameras using GC-based approach. Hand pose inormation can be computed by matching these synthesized shapes to pre-deined hand model or learning hands postures using a 3D shape description. The previous ramework we have developed or 3D body reconstruction 234
[14] has been modiied or 3D hands reconstruction. The detected silhouettes are segmented into separate ingers: we compute edge inormation in the segmented hand region (inside the silhouettes) to reconstruct each inger as a cylindrical orm. This local edge analysis allows us to reconstruct the 3D hand shape accurately when the ingers are splayed or when they are not. 5a. Detail o Accomplishments During the Past Year The goal o this project is to understand and reconstruct human hand motion with natural articulation and occlusion in real time. The objective o this research eort is to augment or replace the mouse and keyboard paradigm with unctionalities relying on natural hands motion or driving a speciic application. During this year we have ocused on inerring accurate hand motion rom monocular view. An articulated model is itted accurately by the proposed model reinement technique and global pose o the hand can be computed by characterizing the orientation o the palm. Individual inger motion is modeled and tracked as a planar arm. We have mainly addressed the problem o tracking the simultaneous motion o rotation, translation, and sel-occlusion in monocular view. Proposed approach is easy to extend or real-time and multi-view ramework, although the target motion is somewhat restrictive. 6. Other Relevant Work Being Conducted and How this Project is Dierent A large number o studies have been proposed or capturing human hand motion. Some o them have ocused on very speciic domain and thus have a large number o restrictive assumptions. In [9.1.11], trajectories o hand motion are considered or classiying human gestures and are suicient or the studied domain. Several rameworks have been proposed we can distinguish two types: appearance-based and model-based methods. In [7] an appearance-based method is proposed or characterizing the various hand coniguration and inger pose. This relies on a learning-based approach and requires training the algorithm on a large data set that depicts all possible conigurations. Similarly in [8], articulated pose o a hand is ormulated as a database indexing problem. From a large set o synthetic hand images, the closest matches o an input hand image are retrieved as an estimated pose. Model based approaches rely on the physical properties o the hand and inger motion [1.2.3] in order to iner it rom a single or multiple views. Huang et al. [5] proposed a model-based approach that integrates constraints o natural hand motion. They also proposed a learning approach to model the kinematics constraints directly rom a real input device such as the CyberGlove [4]. They ocused on the analysis o local inger motion and constraints rom observed data. For articulated motion tracking, a Sequential Monte Carlo algorithm based on importance sampling is employed. In a ixed viewpoint, this produces good results; however, global hand position and motion are not dealt with. Recently, some studies o integrating various cues such as motion, shading and edges were proposed. Metaxas et al [6] computes the model driving orces using gradient-based optical low constraint and edges. A orward recursive dynamic model is introduced to track the motion in response to 3D data derived orces applied to the model. They assumed that global hand motion is rigid motion. No simultaneous articulated motion is expected. When the ingers are in motion o signiicant relative motion with occlusions, they hardly drive the hand model. 235
7. Plan or the Next Year We plan to extent the capabilities o the current system to multi-view ramework. We will also ocus the development o an eicient joint angle tracking technique or both o the real-time inger motion tracking and the hand localization in 3D. 8. Expected Milestones and Deliverables For the next ive-years, we will ocus on expanding current method or recognition o hands gestures, as well as tracking hands motion o a person manipulating objects. 24-25 Articulated model itting and tracking Eicient tracking technique or real time perormances 25-26 Gesture recognition rom the inerred articulated model Computer interaction using basic hand gestures. 26-28 Multimodal interaction using hand motion 9. Member Company Beneits We are investigating two potential applications: manipulation o 3D objects displayed on a 3D LCD and evaluation o handwriting deiciencies. We initiated a collaboration with occupational therapy specialists that identiy this technology to be very important or characterizing handwriting deiciencies among persons alicted with ADD syndrome. A potential use o the technology or monitoring operators perorming maintenance tasks on expensive machinery. 1. Reerences [1] J.M. Rehg and T. Kanade, "Model-Based Tracking o Sel-Occluding Articulated Objects", ICCV95 [2] Q. Delamarre and O. Faugeras, Finding pose o hand in video images: a stereo-based approach, In IEEE International Conerence on Automatic Face and Gesture Recognition, Japan, 1998 [3] B. Stenger, P. R. S. Mendonça, and R. Cipolla, Model-Based 3D Tracking o an Articulated Hand, In CVPR, Volume II, pages 31-315, Kauai, USA, December 21. [4] J. Lin, Y. Wu and Thomas S. Huang, "Modeling Human Hand Constraints", In. Workshop on Human Motion, Austin, Dec., 2. [5] Y. Wu, J. Lin and T. S. Huang, "Capturing Natural Hand Articulation", in ICCV 1, pp.426-432. [6] S. Lu, G. Huang, D. Samaras and D. Metaxas, Model-based Integration o Visual Cues or Hand Tracking In Proc. WMVC 22 [7] R. Rosales, V. Athitsos, L. Sigal, and S. Sclaro, "3D Hand Pose Reconstruction Using Specialized Mappings" IEEE ICCV. Canada. Jul. 21. [8] V. Athitsos and S. Sclaro, "Estimating 3D Hand Pose rom a Cluttered Image", In CVPR 23. [9] R. Cutler and M. Turk, View-based interpretation o realtime optical low or gesture recognition, In Proc. Face and Gesture Recognition, pp. 416 421, 1998. 236
[1] A. Nishikawa, A. Ohnishi, and F. Miyazaki. Description and recognition o human gestures based on the transition o curvature rom motion images, In Proc. Face and Gesture Recognition, pp. 552 557, 1998. [11] Y. Xiong and F. Quek, Gestural Hand Motion Oscillation and Symmetries or Multimodal Discourse: Detection and Analysis, In Proc. Workshop on Computer Vision and Pattern Recognition or HCI. June, 23. [12] Intel Corp. Open Source Computer Vision Library Reerence Manual, 1999 [13] L. Sciavicco, B. Siciliano, Modeling and control o robot manipulators MacGraw-Hill, NY, 1996. [14] Isaac Cohen and Mun Wai Lee, "3D Body Reconstruction or Immersive Interaction", Second International Workshop on Articulated Motion and Deormable Objects Palma de Mallorca, Spain, 21-23 November, 22. 237
238