3D Hand Pose Reconstruction Using Specialized Mappings

Size: px

Start display at page:

Download "3D Hand Pose Reconstruction Using Specialized Mappings"

Amos Henderson
6 years ago
Views:

Boston University Coputer Science Tech. Report No. 2000-22,Dec. 2000 (revised Apr. 2001)

1 Boston University Coputer Science Tech. Report No ,Dec (revised Apr. 2001). To Appear in Proc. IEEE International Conf. on Coputer Vision (ICCV). Canada. Jul D and Pose Reconstruction Using Specialized Mappings Róer Rosales, Vassilis Athitsos, Leonid Sigal, and Stan Sclaroff Boston University, Coputer Science Departent 111 Cuington St., Boston, MA eail: Abstract A syste for recovering 3D hand pose fro onocular color sequences is proposed. The syste eploys a non-linear supervised learning fraework, the specialized appings architecture (SMA), to ap iage features to likely 3D hand poses. The SMA s fundaental coponents are a set of specialized forward apping functions, and a single feedback atching function. The forward functions are estiated directly fro training data, which in our case are exaples of hand joint configurations and their corresponding visual features. The joint angle data in the training set is obtained via a Cyberlove, a glove with 22 sensors that onitor the angular otions of the pal and fingers. In training, the visual features are generated using a coputer graphics odule that renders the hand fro arbitrary viewpoints given the 22 joint angles. The viewpoint is encoded by two real values, therefore 24 real values represent a hand pose. We test our syste both on synthetic sequences and on sequences taken with a color caera. The syste autoatically detects and tracks both hands of the user, calculates the appropriate features, and estiates the 3D hand joint angles and viewpoint fro those features. Results are encouraging given the coplexity of the task. 1 Introduction The estiation of hand pose fro visual cues is a key proble in the developent of intuitive, non-intrusive huancoputer interfaces. The shape and otion of the hand during a gesture can be used to recognize the gesture and classify it as a eber of a predefined class. The iportance of hand pose estiation is evident in other areas as well; e.g.,video coding, video indexing/retrieval, sign language understanding, coputer-aided otion analysis for ergonoics, etc. In this paper, we address the proble of recovering 3D hand pose fro a onocular color sequence. Our solution to this proble akes use of concepts fro stochastic visual segentation, coputer graphics, and non-linear supervised learning. Our contribution is an autoatic syste that tracks the hand and estiates its 3D configuration on every frae, that does not ipose any restrictions on the hand shape, does not require anual initialization, and can easily recover fro estiation errors. 2 Related Work Several existing systes include autoated hand detection and tracking. Such systes typically ake restrictive assuptions on the doain: only hands ove, the hands are the fastest oving objects in the scene [17, This work was supported in part through Office of Naval Research Young Investigator Award N , and National Science Foundation grants IIS and EIA Figure 1: and pose estiation overview. 38, 40, 3, 21, 23], hands are skin colored, or they are the only skin-colored objects in the scene [21, 34]. Often the background is assued to be static, and known [21, 8]. Soe systes use such assuptions to obtain several possible regions where the hands are, and use atching with appearance-based odels to choose aong those regions [38, 9]. Stochastic tools, such as Kalan filtering [34, 38, 40], can be used to predict the hand position in a future frae. Overall, hand detection and tracking algoriths tend to perfor well in restricted environents, where assuptions about the nuber, location, appearance and otion of hands are valid, and the background is known. Reliable perforance in ore general doains, is still beyond the current state of the art. Previous systes representation of hand pose varies widely. For certain applications, hand trajectories can be sufficient for gesture classification [3, 23]. owever, in soe doains, knowledge of ore detailed hand configuration ust be used to disabiguate between different gestures; e.g.,in signed languages. Pose can be estiated in 2D or 3D. Most 2D-based approaches try to atch the iage of the hand with view-based odels corresponding to a liited nuber of predefined hand poses [9, 4, 38, 11, 17, 35, 34]. In [20] the condensation algorith is used to track the index and thub of a hand. Such ethods are valid in restricted doains, in which users are observed fro a known viewpoint, perforing a liited variety of otions. One liitation in view-based ethods is that pose recog-

2 nition is not viewpoint invariant. Iages of the sae 3D hand shape fro different viewpoints, or even rotated exaples of the sae iage would be considered different poses. Soe of those liits have been addressed by using ultiple caeras [35], and stereo [8]; naturally, such ethods will not work in onocular sequences. Our approach will avoid this liitation through the use of probabilistic odeling, Specialized Mappings (SMA), to ap iage features to likely 3D hand poses. A related approach to SMA is described in [39], where a syste is trained with views corresponding to any different hand orientations and viewpoints. Soe training views are labeled with the 3D pose category they correspond to, but ost of the are unlabeled. The categories of the unlabeled data are treated as issing values in a D-EM (Discriinant Expectation-Maxiization) fraework. The syste can recognize 14 hand configurations, observed fro a variety of viewpoints. A difference between that approach and ours is that, in their syste, the configuration estiation is forulated as a classification proble, in which a finite nuber of classes are defined. Our SMA approach is based on regression rather than classification, allowing for theoretically continuous solutions of the estiation proble. Soeties, such continuous solutions are preferable to siply recognizing a liited nuber of classes. For exaple, in a virtual reality application, we ay want to accurately reconstruct the hand of the user in the virtual environent and estiate the effects of that particular configuration on the environent. Even in cases where the ultiate goal is classification, accurate 3D inforation can iprove recognition by aking it robust to viewpoint variations. An iportant decision in estiating 3D pose is the representation and paraeterization. Link-and-joint odels are used by [25, 31], whereas a esh odel is used by [10]. In those three systes, the hand configuration at the beginning of a sequence ust be known a priori. In addition, self-occlusions and fast otions ake it hard to aintain accuracy while tracking. Our proposed SMA approach avoids these drawbacks. SMA is related to achine learning odels [16, 12, 6, 28] that use the principle of divide-and-conquer to reduce the coplexity of the learning proble by splitting it into several sipler ones. In general these algoriths try to fit surfaces to the observed data by (1) splitting the input space into several regions, and (2) approxiating sipler functions to fit the input-output relationship inside these regions. The splitting process ay create a new proble: how to optially partition the proble such that we obtain several sub-probles that can be solved using the specific solver capabilities (i.e.,for of apping functions). In SMA s, we address this proble by solving for the partitions and the appings siultaneously. In the work of [6], hard splits of the data were used, i.e.,the paraeters in one region only depend on the data falling in that region. In [16], soe of the drawbacks of the hard-split approach were pointed out (e.g.,increase in the variance of the estiator), and an architecture that uses soft splits of the data, the ierarchical Mixture of Experts, was described. In this architecture, as in [12], at each level of the tree, a gating network is used to control the influence (weight) of the expert units (apping functions) to odel the data. owever, in [12] arbitrary subsets of the experts units can be chosen. Unlike these architectures, in SMA s the apping selection is done using a feedback atching process, currently in a winner-take-all fashion, but soft splitting is done during training. In applications where a feedback ap can be coputed easily and accurately, this is an iportant advantage. Also, the shape of the regions that deterine ownership to given specialized functions is general; therefore, we do not assue any fixed functional for or discriinant function to define these regions (gating networks). With respect to work on learning based approaches for estiating articulated body pose, Point Distribution Models have been applied to recovering upper-body pose fro silhouettes or skin-colored blobs [1, 24]. In [13], a aussian probability odel for short huan otion sequences was built. owever, this ethod assues that 2D tracking of joints in the iage is given. In [2], the anifold of huan body configurations was odeled via a hidden Markov odel and learned via entropy iniization. In [33] dynaic prograing is used to calculate the best global labeling of the joint probability density function of the position and velocity of body features; it was also assued that it is possible to track these features for pairs of fraes. These last three approaches odel the dynaics of otion, a proble that in general requires uch ore training data to build a reasonable approxiation to the underlying probability distribution. 3 Overview An overview of our approach can be seen in Fig. 1. First is first trained, given a nuber of exaple hand joint configurations are acquired using a Cyberlove (at approx. 15 z). The Cyberlove easures 22 angular DOF of the hand. Coputer graphics software can be used to render a shaded view of any hand configuration captured by the Cyberlove. Using this coputer graphics rendering function, we can generate a unifor sapling (with size ) on the whole view sphere, and render views (iages) of every hand configuration fro all sapled viewpoints. We can then use iage processing to extract visual feature vector fro each of the iages generated; in our case we extract oent based-features, but other features are possible [13]. This process yields a set, where is each of the hand joint configurations fro each viewpoint 1, and, where is a vector of visual features corresponding to each. These sets and constitute saples fro the inputoutput relationship that we will attept to learn using our architecture. iven a new iage of a hand, we will copute its visual feature vector. We then copute the apping fro to the ost likely 24 DOF hand configuration. Note that this apping is highly abiguous. In fact the relationship is any to any; therefore no single function can perfor this task. Using the Specialized Mapping Architecture (SMA), we split (partition) this apping into any appings. Each of these hopefully sipler probles is then solved using a different specialized function. The SMA learning schee solves for partitions and appings siultaneously. The SMA tries to learn a ultiple apping so that, when perforing inference, given a vector of visual features, an output in the output space of hand configurations can be 1 This vector is then coposed of 22 internal pose paraeters plus two global orientation paraeters.

3 provided. On the right colun of Fig. 1, a diagra of the inference process is shown. First video input is obtained, and using a segentation odule, regions with high likelihood of being skin colored are found. Fro these regions we extract visual features (e.g.,oents). Then the given vector of visual features is presented to SMA, which generates several output estiates, one of which is chosen using a defined cost function. Most of the details, including the processes of learning and inference by SMA are presented in the following sections. Our approach can easily integrate different choices of features. Furtherore, the sae approach can be used to estiate the pose of articulated objects other than hands. 4 and Shape Representation The hand odel that we use is ipleented in the Virtualand prograing library [36]. The paraeters of the odel are 22 joint angles. For the index, iddle, ring and pinky finger, there is an angle for each of the distal, proxial and etacarpophalangeal joints. For the thub, there is an inner joint angle, an outer joint angle and two angles for the trapezioetacarpal joint. There are also abduction angles between the following pairs of successive fingers: index/iddle, iddle/ring and ring/pinky. Finally, there is an angle for the pal arch, an angle easuring wrist flexion and an angle easuring wrist bending towards the pinky finger. The Virtualand library provides tools that can render an artificial hand fro an arbitrary viewpoint, given values for the 22 angles. Fig. 3 shows exaples of hand renderings. Using a Cyberlove (anufactured by VirtualTechnologies) we collected about 2,400 exaples of hand poses (paraeterized as vectors containing the 22 angles). We rendered each pose fro 86 different viewpoints. Those viewpoints fored an approxiately uniforly distributed set on the surface of a sphere centered at the hand. The synthetic iages obtained this way were used for training and testing as described in the experiental results. The Virtualand library was also used to reconstruct the estiated 3D hand shape for testing data, based on the output of our syste. 5 Learning Algorith The estiation paradig used in this work consist of apping the observed low-level visual features to hand joint configurations. The underlying approach for finding this apping is based on the Specialized Mappings Architecture (SMA), a non-linear supervised learning architecture. iven an input and output space and respectively, SMA s consist of several specialized forward apping functions and a feedback atching function, which in this case is known (visual features can be obtained given the joint configurations by using coputer graphics based rendering). In order to estiate these appings, we use a supervised learning approach with training data, with! #" an input-output pair (visual features and hand joint angles respectively). Our architecture generates a series of $ functions % in which each of these functions is specialized to ap certain inputs (their specialized doain) better than others. The specialized doain can be for exaple a region of the input space. owever, this specialized doain of can be ore general than just a connected region in the input Figure 2: SMA diagra illustrating (a) an estiated SMA odel with & specialized functions apping subsets of the training data (each subset is drawn with a different color) and (b) the inference process in which a given observation is apped by all the specialized functions, and then a feedback atching step is perfored to choose the best of the & estiates. space. We propose to deterine these specialized doains and functions siultaneously. Fig. 2(a) illustrates the basic idea of this odel. We use different colors (gray-levels) to represent the doain of each specialized function. At initialization rando colors are assigned to each point, the goal is to find an optial apping and partition that is efficient in reducing soe error function. Once the odel has been learned our apping ay look like Fig. 2(a), in which each function is in charge of apping certain inputs only. 5.1 Probabilistic Model Let the training sets of output-input observations be ' )( +*,**, and -. /( +*,**, respectively. We will use 0 1 2" to define a given output-input training pair,.3 *** 4 represents our observed training set. Define the unobserved rando variables 5 with 6 87 ** 9 and : ;58 5 ( <**,* 5= ". In our odel the variables 5 have doain the discrete set > 87 ** $ of labels for the specialized functions, and can be thought as the func- is the tion nuber used to ap data point 6, therefore $ nuber of specialized functions in the odel. Define the odel paraeters? ;??( <***?A@CBD", where? represents the paraeters of the apping function 6. The vector B E B FCB ( ", where B represent ;5 JI? ". Using Bayes rule and assuing independence aong observations, we have the joint probability of the observed and hidden variables conditioned on our odel paraeters: ; :I? " ;KI :? " ;:I? " ML ; I 5? " (a) (b) ;5 CI? " (1)

4 7 9 L + : : 5.2 SMA Paraeter Estiation and the EM Algorith The optiization proble defined by Eq. 1 is coputationally very expensive. ere, the probabilistic paraeter estiation proble is approached under the Expectation Maxiization (EM) algorith fraework [5]. We use the notation followed by [22]. Note that Eq. 1 akes reference to a still undefined distribution # I 5 1? ". Several options had been proposed [27]. ere we will use a aussian distribution with ean defined by the error incurred in using the possibly nonlinear function as a apping function, and a variance : ;DI 5? "? " " (2) Using this distribution, the E-step consists of finding ;: " ;: " L :I? ". In our case, this factorizes as: B ; CI 5? " B ; I 5? " B!" $#%#&('*),+ $-!" $#.# B -0/!/ #%# & ' ),+ -0/!1/ #%# The M-step consists of finding? # !9;: <>=@?BADC EF 4 ;: I? "$. can be shown that:? # ! I $ # ;5 2" C EF 4 Using our odel, it ; I 5!? "1J EF 4 (5) This gives the following update rules for B and (where Lagrange ultipliers were used to incorporate the constraint B 7 ). B The update for?f depends on the for of. ere we have chosen a non-linear function of the for:?f " M ON=(4QPSR T UV ( M T # N WP UXV (3) (4) T ZY # "!" (8) where Y is the 6 K\[] coponent of the visual feature vector, ^ # and ^ ( # are weights and biases (part of?= ), N and N ( are a sigoidal and linear function respectively, _ and _ ( are the nuber of nodes in each layer, and ` is just the diension index of the output vector. This is a 1-hidden layer feed-forward network. Unfortunately, using this function (as it would be by using ost non-linear functions) forces us to use iterative optiization for the M-step. 5 I? "$2* 5 %I 1? " (6) using # ;5 " K?F "!" *K /!?F "1"L # ;5 " 5.3 Stochastic Learning The update equations described above are useful to find a local iniu given the initial values of the paraeters. In order to iprove this process, and avoid soe of the local inia that inevitably arise, we use an annealing schedule on the # probabilities during the M-step. In this way, we redefine: # 5 a " < cb de =f?a T #.#.gh # < b de =f?a #%#.gh # In our experients the teperature paraeter i decays exponentially. This step not only does help in avoiding local inia, but it also creates two desirable effects. It forces # ;5 ja " to be binary (either 7 or k ) at low teperatures, as a consequence each point will tend to be apped by only one specialized function at the end of optiization. Moreover, it akes # 5 " ( 7 ** $ ) be fairly even at high teperatures, aking the optiization less dependent on initialization. Note that there is no closed for solution for the M-step as described above. In practice we have decided to perfor two or three iterations per M-step. Another source of randoness added to the process so far described consists in choosing data points randoly uniforly distributed when perforing the M-step. These two variants of the M-step have been justified in the sense of a partial M-step [22]. 5.4 Feedback Matching Once the odel paraeters have been estiated each specialized function aps (with different levels of accuracy) the whole input space. Therefore, the following question arises: during reconstruction, given a point in input space, how do we choose the apping function % that should be used to ap this point? Fig. 2(b) illustrates the inference process. When generating an estiate l of body pose given an input (the gray point with a dark contour in the lower plane), SMA s generate a series of output hypotheses n obtained points pointed by the arrows). iven the set n, we define (7) the ost accurate hypothesis to be that one that iniizes the function q2/ T " ", over a, in this paper we use: srt T (9) ", with po > (illustrated by each of the # T "XK " L u 2/ T "vk " (10) and ake l, where u is the covariance atrix of the eleents in the set (i.e.,the input vectors in our training set) and 6 is the assigned label. In Fig. 2(b) we can see that each of the points in the output space is apped back to the input space, once in this space, these points can be copared (using a given cost function e.g.,eq. 10) to the initial input observation. The for of the cost function could vary, using Eq. 10 is the sae as assuing that u ". I " w 6 and Detection and Segentation Soe of our test data consists of video sequences collected with a color digital caera. In those sequences the background is static, there is only one person present, and the

5 person is facing towards the caera. Our syste tracks both hands of the user autoatically, using a skin color tracker. In the first frae of the sequence, the tracker needs to be initialized, by locating in the iage the objects that we want to track. That could be done by applying a skin detector syste, like the one described in [15]. owever, using that detector, clothes are labeled as skin, soeties, because of their color. We can locate and segent the hands ore accurately using the fact that their color is very siilar to the color of the face. The position of the face can be found reliably using a face detector syste [29]. For each pixel in the detected face we copute a easure of how skin-like the pixel color is. That easure is based on histogras of skin and non-skin color distributions, coputed fro a database of thousands of iages in which regions were labeled as skin and non-skin. Those labeled iages were fraes fro coercially available DVD ovies. We select the top 50% of the pixels in the face, for which the easure of skin siilarity is the highest. For each of those pixels we copute itsn color ( N ), and we find the eancn color of all selected pixels. Then, for each pixel in the iage, we calculate the distance of itsn color fro the eann color. We label as skin all pixels for which that distance is less than a threshold. The threshold we use is 17, for RB values between 0 and 255. The objects we want to track are the three largest connected coponents of the skin pixels. One of the overlaps with the face, and the other two are considered to be the hands. We initialize the skin tracker with the position of the face and hand regions, and the tracker locates the face and hands in the rest of the fraes in each sequence. The skin tracker odels skin color distribution as a histogra in SV space. It can handle distributions that change fro one frae to the next, because of varying illuination or otion with respect to light sources. The changes in skin color that occur in a new frae are odeled as the results of translating, rotating and scaling the current histogra. Furtherore, the evolution of the histogra is odeled as a second-order Markov process. The tracker is initialized in the first frae, by being told which regions to track, and it estiates the initial color distribution. In the next 8-30 fraes, in addition to tracking and adapting the skin color histogra, it also learns the paraeters of the Markov process. After the learning stage, it uses those paraeters to predict the color distribution in every new frae, while still updating the Markov odel, based on the actual histogra that is observed in the new frae. The learning and tracking stage are explained in detail in [32]. Our siple hand detection and tracking algorith would not work at any frae where the hands overlap with each other or with the face. In our video sequences we took care to avoid such situations. Our syste could be ade ore general by including odules to predict occlusion of an object by another and to detect when those objects are separated again. A siilar approach has been successfully applied in the doain of ultiple person tracking with occlusion handling [26]. 7 Experiental Results The described approach was tested in experients with training data consisting of approxiately 30 sequences obtained through the use of a Cyberglove. Input-output pairs were generated using coputer graphics by rendering fro 86 viewpoints roughly uniforly distributed on the view sphere. The output consisted of 24 joint angles of a huan hand linearly encoded by nine real values using Principal Coponent Analysis (PCA). The input consisted of seven real-valued u oents [14] coputed on synthetically generated silhouettes of the hand. u oents are functions of central iage oents. They are invariant to translation, scaling, and rotation on the iage plane. These invariances ease the observation process (e.g.,we do not need to be concern about where and how large the hand appears on the iage). owever, rotation invariance akes hand rotation parallel to the iage plane unobservable. For the real experients observation inputs were obtained tracking skin color distributions [32]. Approxiately 300,000 iages were generated synthetically. Of these, 8,000 were used for training and the rest for testing. We used cross-validation for early stopping the training procedure and avoid overfitting. In the experients shown, the nuber of specialized functions was set to 30. Each of these functions was a one hidden layer, feedforward network with 5 hidden neurons. The annealing schedule was 7 where was the iteration nuber in the EM algorith. Other experients were perfored to test the convergence and fitting properties of the odel, due to space liitations these results will not be presented in this paper. 7.1 Quantitative Experients Fig. 3 shows exaple hand configuration estiates obtained in representative test fraes (not in the training set). Synthetic iages were used in this experient, because ground-truth data was available for quantitative perforance evaluation. As can be seen in the figure, selfoccluding configurations are obviously harder, but still the estiate is close to ground-truth given that no huan intervention nor pose initialization was required. In order to provide quantitative easures of perforance, test data was used to generate viewpoint dependent error easures. Fig. 4 shows the ean squared error and its variance per viewpoint at the equator 4(a) and at different latitudes 4(b). Note in 4(a) that for views on the equator the error is saller for longitudes closer to radians, this corresponds to a view of the pal (fro different latitudes). These perforance differences are ost likely due to that at side-view angles there is an increased aount of self-occlusion and also because the projections involve fewer pixels, reducing the saples used to calculate iage oents. In 4(b) we can observe that reconstruction errors increase at the poles of the view sphere, where there is also little inforation projected to the iage plane. While the MSE result is encouraging, the variance suggests that certain hand poses are not accurately recovered (we discover they ostly correspond to coplex hand configurations coing fro the Aerican Sign Language part of our data).

6 MSE in 22 joint config. space Mean error and variance for all views on the viewsphere equator Views (x 2π/16 rads.) MSE in 22 joint config. space Mean error and variance grouped by latitude Latitude (x 2π/16 rads.) evenly spaced fro 90 o to +90 o Figure 4: Quantitative experiental results. Mean square error in the reconstruction is shown in (a) taken at the equator of the view sphere, varying the longitude and (b) at different latitudes, averaging over all the longitudes. Longitude and latitude radians represents a view towards the pal of the hand. 7.2 Experients with Real Sequences In the next set of experients, we tested the syste against real segented visual data. The sequences were segented to yield blobs that corresponded to hands in each frae, as described in Sec. 6. The resulting reconstruction for several relatively coplex gesture sequences is shown in Fig. 5. Note that given blob iages, recovering 3D hand pose is a difficult task even for a huan observer. This difficulty is increased by perforing inference fro blob oents, obviously with an inferior descriptive power. Methods for addressing this issue will be covered further in Sec D Reconstruction Reliability It should be noted that SMA s can provide a easure of reconstruction reliability by using the log-probabilities coputed in Eq. 10. Abiguous inputs can be discovered by looking at the relative scores given by Eq. 10 (another option is to look at the entropy of I " ). This is extreely iportant because even though the forward aps are designed to handle abiguities, the inference process clearly still suffers fro abiguities. Therefore, it can be ipossible to recover soe configurations with enough reliability. As an exaple, in Fig. 5, the configurations 5-6 have low reliability score, even though we obtain good estiates. Soe of the copeting hypotheses include estiates that are also consistent (in ters of the visual features used) with the input presented, and soe of these consistent hypothesis are far fro the true 3D reconstruction. Thus, it was very likely to choose one of the bad estiates instead of the good ones shown in configurations Discussion and Conclusions In this paper we addressed the proble of recovering 3D hand pose fro a onocular color sequence. The ain contributions of our work are: 1. A single observed frae can be used for estiation 2. As a consequence, no anual initialization is required. Furtherore, the sequence can start with the hand in any position and orientation. 2. No liitation is iposed in the caera viewpoints allowed. 3. The syste does regression rather than classification, thereby providing a continuu of pose estiates rather than recognizing a finite nuber of classes. 4. A novel non-linear supervised learning fraework is adapted to the pose estiation proble. This fraework allows us, aong other things, to avoid the pitfalls of explicit tracking and to easure reliability of estiates during inference. 5. Reconstruction can be accoplished at near frae rate. The ain advantage of using SMA s in this doain over other function estiation paradigs is that it allows odeling of the abiguous input-output relationships that arise. For instance, different hand configurations can generate the sae visual features, due to self-occlusion. Different visual features can be related to a single hand configuration, due to inaccurate observations or variations in hand orphology. SMA s splits (partitions) the proble into sipler apping probles. This allows for odeling different parts of the output space independently, as well as coputation of ultiple possible configurations in abiguous situations. owever, so far we choose one estiate only. This is an interesting aspect not fully addressed in this paper, Sec. 7.3 extends a little on this topic. In our current ipleentation, teporal context is not used for iproving the output estiates during apping, but only for segentation. The hand pose is re-estiated at every frae given the segented data. We expect that using previous estiates in coputing the current hand shape will iprove accuracy, and we plan to extend our approach to allow this. owever frae independence allows a very attractive inference tie of ", with specialized functions. Our algorith could be used as a front end in several gesture recognition applications that take the hand configuration as input. Current systes rely alost exclusively on non-vision techniques to obtain such data, such as Cyberloves [7, 18, 19, 30] and color arkers [11]. An autoated coputer vision technique like ours iposes no restrictions on users. It can also be used in doains where we do not have control of the data collection, and therefore we cannot require the use of ore sophisticated input devices. In future work, we plan to experient with sets of features that are richer and ore descriptive than binary silhouettes; e.g.,orientation histogras, or other texture features. Using stereo should further increase the accuracy of the syste, by providing ore shape constraints than a single 2D iage does. Finally, ore sophisticated odels of teporal dependencies, like linear aussian Models in general [11, 34, 37], could be used in the feedback atching to guide the choice of best reconstruction. Even though we have a useful estiate of confidence, given by Eq. 10, we are looking at alternatives for decreasing the error variance. 3D and pose reconstruction fro a single iage is a very difficult task, and at present no fully-general solution to the proble exists. Our results show that it is possible to approach this proble using a cobination of vision and statistical learning tools. We consider this an iportant step considering the coplexity of the task and the low descriptive power of the features currently eployed. 2 We insist that in applications where highly correlated fraes can be observed, it is iperative to use this teporal inforation. owever, the ability of our fraework to estiate hand pose given only a single frae affords autoatic initialization, faster estiation, and could be used as a bootstrap echanis in ore coplex systes.

7 References [1] R. Bowden, T. Mitchell, and M.Sarhadi. Non-linear statistical odels for the 3d reconstruction of huan body pose and otion fro onocular iage sequences. Iage Vision Cop., 18(9: ), [2] M. Brand. Shadow puppetry. In ICCV, [3] R. Cutler and M. Turk. View-based interpretation of realtie optical flow for gesture recognition. In Face and esture Recognition, pages , [4] T.J. Darrell, I.A. Essa, and A.P. Pentland. Task-specific gesture analysis in real-tie using interpolated views. PAMI, 18(12), [5] A. Depster, N. Laird, and D. Rubin. Maxiu likelihood estiation fro incoplete data. Journal of the Royal Statistical Society (B), 39(1), [6] J.. Friedan. Multivatiate adaptive regression splines. The Annals of Statistics, 19,1-141, [7] M. Fröhlich and I. Wachsuth. esture recognition of the upper libs : Fro signal to sybol. In I. Wachsuth and M. Fröhlich, editors, esture and Sign Language in uan- Coputer Interaction, esture Workshop, pages , Bielefeld, erany, [8] R. rzeszczuk,. Bradski, M.. Chu, and Jean-Yves Bouguet. Stereo based gesture recognition invariant to 3d pose and lighting. In CVPR, volue 1, pages , [9] R. adan, F. eitz, and L. Thoraval. esture localization and recognition using probabilistic visual learning. In CVPR, volue 2, pages , [10] T. eap and D. ogg. Towards 3d hand tracking using a deforable odel. In Face and esture Recognition, pages , [11]. ienz, K. Kraiss, and B. Bauer. Continuous sign language recognition using hidden arkov odels. In Intl. Conf. on Multiodal Interfaces, volue 4, pages 10 15, [12]. inton, B. Sallans, and Z. hahraani. A hierarchical counity of experts. Learning in raphical Models, M. Jordan (editor), [13] N. owe, M. Leventon, and B. Freean. Bayesian reconstruction of 3d huan otion fro single-caera video. In NIPS, [14] M. K. u. Visual pattern recognition by oent invariants. IRE Trans. Infor. Theory, IT(8), [15] M.J. Jones and J.M. Rehg. Statistical color odels with application to skin detection. In CVPR, pages I: , [16] M. I. Jordan and R. A. Jacobs. ierarchical ixtures of experts and the EM algorith. Neural Coputation, 6, , [17] M. Kohler. Special topics of gesture recognition applied in intelligent hoe environents. In Proceedings of the esture Workshop, pages , [18] R. Liang and M. Ouhyoung. A real-tie continuous gesture recognition syste for sign language. In Face and esture Recognition, pages , [19] J. Ma, W. ao, and C. Wang J. Wu. A continuous chinese sign language recognition syste. In Face and esture Recognition, pages , [20] J.P. MacCorick and M. Isard. Partitioned sapling, articulated objects, and interface-quality hand tracking. In ECCV, [21] J. Martin, V. Devin,, and J.L. Crowley. Active hand tracking. In Face and esture Recognition, pages , [22] R. Neal and. inton. A view of the e algorith that justifies increental, sparse, and other variants. Learning in raphical Models, M. Jordan (editor), [23] A. Nishikawa, A. Ohnishi, and F. Miyazaki. Description and recognition of huan gestures based on the transition of curvature fro otion iages. In Face and esture Recognition, pages , [24] E-J. Ong and S. ong. Tracking hybrid 2d-3d huan odels through ultiple views. In ICCV Workshop on Modelling People, Corfu, reece, [25] J.M. Rehg. Visual Analysis of igh DOF Articulated Objects with Application to and Tracking. PhD thesis, Electrical and Coputer Eng., Carnegie Mellon University, [26] R. Rosales and S. Sclaroff. Iproved tracking of ultiple huans with trajectory prediction and occlusion odeling. In IEEE CVPR Workshop on the Interpretation of Visual Motion, [27] R. Rosales and S. Sclaroff. Specialized appings and the estiation of body pose fro a single iage. In IEEE uan Motion Workshop. Austin, TX, [28] R. Rosales and Stan Sclaroff. Inferring body pose without tracking body parts. In CVPR, [29].A. Rowley, S. Baluja, and T. Kanade. Rotation invariant neural network-based face detection. In CVPR, pages 38 44, [30]. Sagawa and M. Takeuchi. A ethod for recognizing a sequence of sign language words represented in a japanese sign language sentence. In Face and esture Recognition, pages , [31] N. Shiada, Y. Shirai, Y. Kuno, and J. Miura. and gesture estiation and odel refineent using onocular caera - abiguity liitation by inequality constraints. In Face and esture Recognition, pages , [32] L. Sigal, S. Sclaroff, and V. Athitsos. Estiation and prediction of evolving color distributions for skin segentation under varying illuination. In CVPR, [33] Y. Song, Xiaoling Feng, and P. Perona. Towards detection of huan otion. In CVPR, [34] T. Starner and A. Pentland. Real-tie aerican sign language recognition using desk and wearable coputer based video. PAMI, 20(12): , [35] A. Utsui and J. Ohya. Multiple-hand-gesture tracking using ultiple caeras. In CVPR, volue 1, pages , [36] Virtual Technologies, Inc., Palo Alto, CA. Virtualand Software Library Reference Manual, August [37] C. Vogler and D. Metaxas. Toward scalability in asl recognition: Breaking down signs into phonees. In Proceedings of the esture Workshop, [38] J. Weng and Y. Cui. Recognition of hand signs fro coplex backgrounds. In R. Cipolla and A. Pentland, editors, Coputer Vision for uan-machine Interaction. Cabridge University Press, [39] Y. Wu and T.S. uang. View-independent recognition of hand postures. In CVPR, volue 2, pages 88 94, [40] M. Yang and N. Ahuja. Recognizing hand gesture using otion trajectories. In CVPR, volue 1, pages , 1999.

(a) (b) (a) (b) (a) (b) Figure 3: Exaple reconstruction

Each set (2 rows each) consists of (a)input iages,

Because our approach can provide us with a reconstruction

8 (a) (b) (a) (b) (a) (b) Figure 3: Exaple reconstruction of several synthetic test sequences. Each set (2 rows each) consists of (a)input iages, (b)reconstruction. Because our approach can provide us with a reconstruction confidence, we used this to show high-ediu-low confidence estiates (one pair of rows each). Figure 5: Reconstruction obtained fro perforing hand segentation in a huan subject. The two top pairs of rows show good reconstruction while the botto pair show exaples of bad perforance. Reconstruction is shown fro a fixed viewpoint (latitude -longitude rads.).

NON-RIGID OBJECT TRACKING: A PREDICTIVE VECTORIAL MODEL APPROACH

NON-RIGID OBJECT TRACKING: A PREDICTIVE VECTORIAL MODEL APPROACH V. Atienza; J.M. Valiente and G. Andreu Departaento de Ingeniería de Sisteas, Coputadores y Autoática Universidad Politécnica de Valencia.