ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis

Size: px

Start display at page:

Download "ASL Recognition Based on a Coupling Between HMMs and 3D Motion Analysis"

Anastasia Ball
5 years ago
Views:

1 An earlier version of this paper appeared in the proeedings of the International Conferene on Computer Vision, pp. 33 3, Mumbai, India, January 4 7, 18 ASL Reognition Based on a Coupling Between HMMs and 3D Motion Analysis Christian Vogler and Dimitris Metaxas Department of Computer and Information Siene, University of Pennsylvania, Philadelphia, PA vogler@gradient.is.upenn.edu, dnm@entral.is.upenn.edu Abstrat We present a framework for reognizing isolated and ontinuous Amerian Sign Language (ASL) sentenes from three-dimensional data. The data are obtained by using physis-based three-dimensional traking methods and then presented as input to Hidden Markov Models (HMMs) for reognition. To improve reognition performane, we model ontext-dependent HMMs and present a novel method of oupling three-dimensional omputer vision methods and HMMs by temporally segmenting the data stream with vision methods. We then use the geometri properties of the segments to onstrain the HMM framework for reognition. We show in experiments with a 53 sign voabulary that three-dimensional features outperform two-dimensional features in reognition performane. Furthermore, we demonstrate that ontextdependent modeling and the oupling of vision methods and HMMs improve the auray of ontinuous ASL reognition. 1 Introdution Amerian Sign Language (ASL) is the primary mode of ommuniation for many deaf people in the USA. It is a highly infleted language with sophistiated grammatial properties, whih onstrain strongly the order and appearane of signs. Beause of the onstraints, it provides an appealing test bed for understanding more general priniples governing human motion and gesturing, inluding humanomputer gesture interfaes. Suh interfaes are essential in virtual reality appliations, where the user must be able to manipulate virtual objets by gesturing. A working ASL reognition system ould also failitate interation of deaf people with their surroundings. To date, most attempts at ASL reognition have either used only two-dimensional omputer vision methods, or they have used other input devies, suh as datagloves, instead of omputer vision, to ollet input from the signer [18, 3, 23]. In this paper we present a new approah to ASL reognition. First, we use omputer vision methods to extrat the three-dimensional parameters of a signer s arm motions. We then use Hidden Markov Models (HMMs) to reognize isolated and ontinuous ASL utteranes from the three-dimensional input. We develop ontext-dependent modeling of HMMs and methods for oupling the appliation of HMMs and the appliation of three-dimensional omputer vision methods to improve ontinuous reognition performane. Our approah attempts to overome some of the limitations of the previous approahes that use two-dimensional visual input, do not use ontext-dependent modeling, or do not ouple omputer vision methods with HMMs [18, 3, 17, 12]. Three-dimensional image-based shape and motion traking of a human s arm and hand is diffiult beause of the omplexity of the motions and olusion effets. Reently, a methodology has been developed [8, 10] that allows three-dimensional traking of human motion from multiple images. In this paper we augment this methodology to trak the three-dimensional motion of a subjet s arms and hands from multiple images. This method is based on the use of deformable models, whose shape and motion fits the given image sequenes based on oluding ontour information and theorems from projetive geometry. The output of this method onsists of the threedimensional motion parameters of the subjet s arms. For effiieny reasons, and beause arm movements already arry muh of the information needed for reognizing ASL signs, we do not use the hand information in this paper. Apart from obtaining aurate data, ASL reognition is diffiult, beause there are always statistial variations in the way humans perform motions, even with idential meaning. In addition, in ontinuous utteranes, there are no lear boundaries between individual signs. HMMs provide a framework for apturing statistial variations in both position and duration of the movement, as well as impliit segmentation of the input stream. Furthermore, ontinuous reognition is ompliated by oartiulation effets, that is, the pronuniation 1 of a sign is influened by the preeding and following signs. Coartiulation effets an be partly alleviated by training ontext-dependent HMMs. The theory behind HMMs makes several assumptions that are often not valid in pratie. For this reason, we develop a new approah that ouples omputer vision methods with HMM modeling. It is based on a temporal segmentation proess that operates by extrating geometri properties of the three-dimensional omputer vision pa- 1 By pronuniation we mean motion. We follow the terminology of spoken language linguistis where appliable. 1

2 rameters. These properties are obtained independently from the HMM algorithms and are used to impose additional onstraints on HMM-based reognition. To test our algorithms and assumptions, we performed a series of experiments based on a voabulary onsisting of 53 different signs that make extensive use of spae. We experimented with both isolated and ontinuous ASL reognition for both three-dimensional and two-dimensional data. As HMMs require large amounts of training data and the omputer vision proess is omputationally expensive, we used data from an Asension Tehnologies Flok of Birds and omputer vision proesses interhangeably. Our goal is to disover and analyze a usable framework for both isolated and partiularly ontinuous ASL reognition. We do not address more general gesture reognition topis and signer independene in this paper. Neither do we address the involved aspets of ASL linguistis [1] at this point, but obviously, a viable future ASL reognition system should be able to handle them. In the following setions, we disuss related work and give an overview on the theory behind the vision methods and HMMs. Afterward, we address the use of HMMs for isolated and ontinuous ASL reognition, and oupling omputer vision proesses with the HMM algorithms. Finally, we outline data olletion and provide experimentation results for isolated and ontinuous reognition and the oupling of omputer vision and HMMs. 2 Previous Work Previous work on sign language reognition fouses primarily on fingerspelling reognition and isolated sign reognition. Some work uses neural networks [3, 22]. For this work to apply to ontinuous ASL reognition, the problem of expliit temporal segmentation must be solved, whih is a limitation that HMM-based reognition does not have. Mohammed Waleed Kadous [23] uses Power Gloves to reognize a set of 5 isolated Auslan signs with 80% auray, with an emphasis on omputationally inexpensive methods. Kirsti Grobel and Marell Assam [4] use HMMs to reognize isolated signs with 1.3% auray out of a 22 sign voabulary. They extrat the features from video reordings of signers wearing olored gloves. There is very little previous work on ontinuous ASL reognition. Thad Starner and Alex Pentland [18] use a view-based approah to extrat two-dimensional features as input to HMMs with a 40 word voabulary. Yanghee Nam and Kwang Yoen Wohn [12] use three-dimensional data as input to HMMs for ontinuous reognition of a very small set of gestures. 3 Model-based Traking of a Human s Arms In this setion we give a brief overview of our formulation that allows the three-dimensional arm shape and motion estimation from multiple images [, 7, 8, 10]. Our approah onsists of two parts. The first part [, 7] onsists of an ative, integrated approah that identifies reliably the parts of a moving artiulated objet and estimates their shape and motion from a ontrolled set of motions that reveal the objet s struture. We use the algorithm developed in [, 7], whih segments the apparent body ontour of a moving human into the onstituent parts. Initially, a single deformable model is used in order to fit the image data. As the model deforms to fit the deformed (due to the motion of the human) subsequent image ontours, a novel Human Body Part Identifiation Algorithm (HBPIA) is developed to identify all the body parts. By applying the HBPIA iteratively over the subsequent frames, all the moving parts are identified. In addition, we have extended this algorithm to allow the estimation of the three-dimensional shape of a subjet s body parts, based on the integration of images taken from three orthogonally plaed ameras. We used this methodology to estimate the three-dimensional shape of the subjet s arms shown in the examples in Setion 7. It is worth noting that we have reovered the lower arm and the hand as one part, sine in our ASL reognition experiments we did not use the motion of the lower arm and the hand relative to eah other. The seond part of the algorithm onsists of using the extrated three-dimensional shape of the arm to trak the three-dimensional position and orientation of a subjet s body parts [8]. To alleviate diffiulties arising from olusion and degenerate views during the unonstrained movement of the arm, we use three alibrated ameras plaed in a mutually orthogonal onfiguration. At every image frame and for eah body part, we derive a subset of the ameras that provide the most informative views for traking. This ative and time varying seletion is based on the visibility of a part and the observability of its predited motion from a ertain amera. One a set of ameras has been seleted to trak eah part, we use onepts from projetive geometry to relate points on the oluding ontour to points on the three-dimensional shape model. Using a physis-based modeling approah, we transform this orrespondene, in addition to two-dimensional fores arising from the disrepany between the model s oluding ontour and the image data, into generalized fores that are applied to the model to estimate the model s translational and rotational degrees of freedom. To improve the traking results further, the dynami system is embedded within an extended Kalman filter framework, and we use the predited motion of the model at eah frame to establish point orrespondenes between oluding ontours and the three-dimensional model. We used this two-step approah to trak the motion of the subjet s arms performing the ASL gestures, as shown 2

3 W in Setion 7. The output of the system is a set of rotation,, and translation,, parameters that we use as input to the HMMs and the vision-based segmentation algorithm presented in the following setions. 4 Hidden Markov Models Hidden Markov Models (HMMs) are a type of statistial model. They have been used suessfully in speeh reognition, and reently in handwriting, gesture, and sign language reognition. We now give a summary of the basi theory behind HMMs, whih is overed in detail in [15]. 4.1 Definition of HMMs An HMM onsists of a number of states, together with transitions between states. The system is in one of the HMM s states at any given time. At regularly spaed disrete time intervals, the system takes an outgoing transition from its urrent state to a new state. Eah transition from to has an assoiated probability of being taken. Hene,. Eah state also has an initial probability of the system starting in. In addition, eah state generates output "!$#, whih is distributed aording to a probability distribution funtion % '& )(*,+.- Output is / System is in 10. An example is given in Figure 1. The model depited there is also an example of a left-right model; that is, 3254 implies 87: In other words, transitions only flow forward from lower states to the same state or higher states, but never bakward. This topology is the most ommonly used one for modeling proesses over time. b 1 a 11 a 24 a 55 a 12 a 34 a 45 S 1 S S S S a b 2 b 3 Figure 1: Example left-right HMM with its transition and output probabilities. Left-right means that transitions our only from left to right, and never bakward. 4.2 The Three Fundamental HMM Problems There are three fundamental problems in HMM theory: (1) For a sequene of observations ;DC, ;!E#, ompute the probability + & ;F/ GH( that an HMM G generated ;. (2) For some ; and an HMM G, reover the most likely state sequene IAA C that generated ;. (3) Adjust the parameters of an HMM G suh that they maximize + & ;F/ GH( for some ;. b 4 b 5 The first problem orresponds to maximum likelihood reognition of an unknown data sequene with a set of HMMs, eah of whih orresponds to a sign. For eah HMM, the probability + & ;F/ G( is omputed that it generated the unknown sequene, and then the HMM with the highest probability is seleted as the reognized sign. For omputing + & ;F/ G(, let JKLJ MJ A NJ C be a state sequene in G : OQP & (R$+ & ;*? M;D? AA M; P MJ P S / GH(5>T T" U (1) OQP]\ & (R$% M& ; P]\ A( + & ;F/ G(V O C & (B (2) YX O & (R$Z%[ & ; (B (3) W O P & (' ^ *T"_VT5`"a5 (4) ^X These equations assume that the ; are independent, and they make the Markov assumption that a transition depends only on the urrent state, a fundamental limitation of HMMs. This method is alled the forward-bakward algorithm and omputes + & ;F/ GH( in ; & `*( time. The seond problem orresponds to finding the most likely path J through an HMM G, given an observation sequene ;, and is equivalent to maximizing + & Jb N;F/ GH(. Let P & (R gihnjkkk dbe?f j g lnmih + & J J A^J P o ^ M;F/ G(B (5) P]\ & (ps% ^& ; P]\ (qrdfef Ns sh - P & (t ^^0 () dbe?f g + & Jb M;F/ G(uvdbe?f [s sh - C & ( 0 (7) P & ( orresponds to the maximum probability of all state sequenes that end up in at time _. Equations and 7 follow from Equation 5 by indution on _. The Viterbi algorithm is a dynami programming algorithm that, using Equation 7, omputes both the maximum probability + & Jb M;F/ G( and the state sequene J in ; & `D( time. The reovery of the state sequene makes the Viterbi algorithm invaluable for ontinuous reognition, sine it bypasses the diffiult problem of segmenting the utteranes into its individual parts. Instead, a sequene of HMMs orresponding to individual signs is onatenated into a network, as shematially depited in Figure 2. Thus, the most likely state sequene reovers the sequene of signs. The Viterbi algorithm also has the property that it an be optimized with the beam-searhing algorithm. While updating P]\ & (, this optimization onsiders only those states in the HMM network for whih P & ( is above a threshold value. The assumption is that if the probability 3

4 W Œ W ( ( ( C C C C C Initial state HMM 1 HMM 2... HMM n... Final state Figure 2: Conatenation of HMMs into a network of a partial path through the network beomes too low, it annot ontribute to the most likely path. Beam-searhing is essential for making large-sale appliations tratable. The third problem orresponds to training the HMMs with data, suh that they are able to reognize previously unseen data orretly after the training phase. There exists no analytial solution for maximizing + & ;F/ GH( for given observation sequenes, but an iterative proedure, alled the Baum-Welh proedure, maximizes + & ;F/ GH( loally. In the ase of ontinuous density output probabilities, the reestimation proess works as follows. Define %^ & ;>( as %^ & ;>(wxsy z X { z* & ;. '}H z [~Q z (, where desribes the number of mixtures, is the state number, { desribes the weight of mixture in state, and is a Gaussian density with mean }, and ovariane matrix ~. Define the bakward variable as P & P & (R$+ & ; P]\ ; P]\? AA M;DC / J P $ MGH([ (8) ( HC & (Rƒ () %M & ; P]\ (t P]\ & ([ (10) ^X >T T" >T _VT5`"a5ˆ (11) Furthermore, define and Š as P OQP & (R & (t % ˆ& ; P]\ A(1 P]\ & + & ;F/ G( (12) Š P & (R P & ([ (13) X P P & ( an be interpreted as the expeted number of transitions from to ; likewise P Š P & ( an be interpreted as the expeted number of transitions taken from. With these interpretations, the reestimation formulae for the transitions and output probabilities are Œ Š & PC Ž X P & PC Ž X Š P & (B (14) (15) Œ ~ z Œ { z Œ }H z PC X Š P & P X Š P & ' U( P X y X Š P & N )( P X Š P & ' U('; P P X Š P & ^ ( ^ ( & ; P a8}h z ( & ; P a }H z ( C P X Š P & ^ ( (1) (17) (18) Repeated use of this proedure onverges to a maximum probability [15], typially after 5 10 iterations. 5 Use of HMMs for ASL Reognition In the previous setion we reviewed the extration of three-dimensional features from omputer vision and the HMM theory. We now disuss how they fit in the framework of ASL reognition. HMMs are an attrative hoie for proessing threedimensional sign data, beause their state-based nature enables them to desribe how a sign hanges over time and to apture variations in the duration of signs, by remaining in a state for several time frames. There are two ways to approah the reognition problem that pose very different researh problems. Isolated reognition attempts to reognize one single sign at a time. Hene, it is based on the assumption that eah sign an be individually extrated and then individually reognized. Continuous reognition, on the other hand, attempts to reognize an entire stream of signs, without any artifiial pauses or any other form of marked boundaries between the individual signs. Clearly, ontinuous reognition is desirable for the most natural interation possible between humans and mahines, but it is also muh more diffiult to takle than isolated reognition. The next two subsetions disuss eah of the two approahes in detail. 5.1 Isolated Reognition Isolated sign reognition assumes that eah sign an be extrated individually. This requires learly marked boundaries between signs. Suh a boundary ould simply be silene, that is, a brief resting phase after eah sign, during whih the signer performs no movements. Silene is easily deteted through an analysis of the global variane over the hand movements. One there are learly marked boundaries between signs, HMM reognition is omparatively straightforward. The reognition proess extrats the signal orresponding to eah sign individually. It then piks the HMM that yields the maximum likelihood for that signal as the reognized sign. Training the HMMs to maximize reognition performane is also omparatively straightforward. Initially, all signs in the training set are labeled. For eah sign in the ditionary, the training proedure then omputes the 4

mean and ovariane matrix over the data available for that sign and assigns them uniformly as the initial output probabilities to all states in the orresponding HMM.

5 mean and ovariane matrix over the data available for that sign and assigns them uniformly as the initial output probabilities to all states in the orresponding HMM. It also assigns initial transition probabilities uniformly to the HMM s states. Unlike the initial output probabilities, initial transition probabilities do not influene the performane of the fully trained HMMs greatly. The training proedure then runs the Viterbi algorithm repeatedly on the training samples, so as to align the training data along the HMM s states. The aligned data are then used to estimate better output probabilities for eah state individually. This realignment yields major improvements in reognition performane, beause it inreases the hanes of the Baum-Welh reestimation algorithm onverging to an optimal or a near-optimal maximum. After onstruting these bootstrapped HMMs, the training proedure finishes by reestimating eah HMM in turn with the Baum-Welh reestimation algorithm outlined in Setion 4.2. The by far most hallenging problem in isolated reognition is extrating a feature vetor that optimizes reognition performane. Even after obtaining aurate threedimensional data from our omputer vision method desribed in Setion 3, we found that the features used for reognition and the way that they are represented greatly influene reognition performane. The experimental results given in Setion 8.1 demonstrate how the feature vetor affets performane. There are several reasons why performane is so sensitive to hoosing the type of feature vetor: First, some features arry more information than others; for example, three-dimensional features are more reliable than twodimensional ones. Seond, some features are more invariant to hanges in orientation and position than others; for example, polar oordinates are more invariant to rotations than Cartesian oordinates [1]. Third, the statistial properties of some features hange, depending on the duration of a sign. For this reason, the positions of the hands in three-dimensional spae perform better than the veloities of the hands (see also Setion 8.2). Fourth, the statistial distribution of the features during the ourse of a sign seems to play a role. For some features, their distribution fits Gaussian densities naturally, whereas for others it does not. If the latter explanation holds true, we should see a major improvement in reognition performane from using multiple Gaussian mixtures as the output probabilities for HMMs, instead of using just one single Gaussian density. However, we did not experiment with multiple mixtures beause of the lak of suffiient training data. The number of states and the topology used for the HMMs is also important. Sign language as a time-varying proess lends itself naturally to a left-right model topology. Finding the optimum number of states, whih depends on the frame rate and on the omplexity of the signs involved, is an empirial proess. We used the same model topology for all signs, and determined experimentally that for our task a model with states was suffiient, whih is depited in Figure 3. The output probabilities were single Gaussian densities with diagonal ovariane matries, beause we had insuffiient training data for multiple mixtures. Figure 3: Left-right HMM topology for isolated ASL reognition. 5.2 Continuous Reognition Continuous sign reognition, on the other hand, is muh harder than isolated sign reognition. There is no silene between the signs, so the straightforward method of using silene to distinguish boundaries fails. Here HMMs offer the ompelling advantage of being able to segment the streams of signs automatially with the Viterbi algorithm. Coartiulation effets further ompliate ontinuous reognition. We now disuss them in detail, before we desribe the tehniques needed to train HMMs for ontinuous reognition The Coartiulation Problem Coartiulation means that the pronuniation of a sign is influened by the preeding and following signs. One of the most visible effets of oartiulation in ASL is that a wide range of movements are inserted between signs. For example, the sign for FATHER is performed by repeatedly tapping the forehead, and the sign for READ is performed in neutral spae in front of the hest. If these two signs are performed in suession, an extra movement from the forehead to neutral spae appears (Figure 4). This phenomenon is alled movement epenthesis [5]. We disuss its impliations for ASL reognition more thoroughly in [20]. Figure 4: Movement epenthesis. The arrow in the middle piture indiates an extra movement between the signs for FATHER and READ that is not present in their lexial forms. 5

6 Speeh reognizers handle oartiulation by training phoneme ontext-dependent HMMs. They train a separate model for eah possible ombination of three phonemes in sequene that ould our during natural speeh. In priniple, the same idea applies to sign language reognition, and we performed some experiments to verify the appliability, see Setion 8.3. A possible way to train ontext-dependent models for ASL reognition is to use whole signs as the phonologial unit in ASL. 2 Thus, triphone ontext-dependent models from speeh reognition orrespond to tri-sign ontextdependent models in ASL reognition. In other words, a separate model is trained for eah ombination of three signs in sequene. The first and the third sign in the sequene form the ontext for the middle sign, with whih the model is assoiated. Tri-sign ontext-dependent modeling, however, is prohibitively expensive, beause it requires ; & x ( models overall, where is the voabulary size. Colleting suh a large amount of training data neessary to obtain reliable estimates for the models is intratable even for small voabulary sizes. This intratability is a negative onsequene of using whole signs as the phonologial unit. Unlike for speeh reognition, whih has to handle only approximately 40 lasses of allophones, there is no upper bound on the number of models required for ASL reognition with whole signs as the smallest unit. Therefore, we used only bi-sign ontext-dependent models, whih require a model for every possible ombination of two signs. The model is assoiated with the seond sign, and the first sign forms its preeding ontext. Bi-sign ontext-dependent modeling requires ; & ( models. Although this omplexity is an improvement over ; & ƒ (, it is still too large for anything but a small voabulary. Speeh reognizers redue the number of models required by using the observation that many ontexts are very similar. Therefore, they tie the parameters of the models orresponding to similar ontexts, suh that the transition and output probabilities are shared between these models. This tehnique signifiantly redues the number of distint models. Parameter tying is also appliable to ASL reognition, but it is not as effetive as for speeh reognition. The main reason for the redued effetiveness is that movement epenthesis inserts many movements unrelated to the signs lexial forms. The impliation is that ontextdependent models will work well only with prohibitively large amounts of training data. In fat, it is questionable whether ontext-dependent modeling is a good solution to the oartiulation prob- 2 This assumption is not orret: Whole signs are not the smallest unit in ASL phonology, but this topi is beyond the sope of this paper. lem in ASL reognition at all. Movement epenthesis is a phonologial proess in ASL and should be treated as suh; that is, the movements indued by epenthesis are separate phonemes. Using ontext-dependent models to apture them is implausible from a phonologial point of view. It seems to make more sense to model the movements expliitly. We follow up on this idea in [20] and show that it leads to better reognition performane The Training Proedure A sign in our data olleted at natural signing speeds was between 10 and 45 frames long, not ounting the frames needed for the transition between signs. Beause of the movements between signs, the HMM topology must be more flexible than the one desribed for isolated reognition in Setion 5.1. These onsiderations led us to using the left-right model shown in Figure 5. Figure 5: Topology of the ontext-dependent model. The ars that skip states allow the modeling of variabilities in the duration of different signs. Like for isolated reognition, we determined the optimal number of states experimentally. For the output probabilities, we hose a single Gaussian density with diagonal ovariane, as we had insuffiient training data for estimating full-rank ovariane matries. Training ontinuous reognition models is muh harder than training isolated reognition models, beause it is diffiult to obtain good initial estimates of the HMM parameters. Viterbi realignment (see Setion 5.1) works only if the training data is aurately labeled, inluding the boundaries between the individual signs. Obtaining these boundaries is very diffiult and time-onsuming; even humans have trouble determining where a sign ends and the next one starts. The alternative to using Viterbi realignment is using a flat-start sheme. It onsists of omputing the global mean and ovariane matrix over the entire training data set and assigning these as the initial output probabilities to the HMMs. We used this sheme to initialize the HMMs. We then used embedded training [24] to reestimate the HMMs. Eah iteration of this proedure onatenates the HMMs orresponding to the individual signs in a training sentene into a single large HMM. It then reestimates the parameters of the large HMM with a single iteration of the Baum-Welh algorithm desribed in Setion 4.2, as usual. The reestimated parameters, however, are not immediately applied to the individual HMMs. Instead, they are pooled

7 in aumulators, and applied to the individual HMMs only after the training proedure has iterated over all sentenes in the training set. Hene, embedded training effetively trains all models in parallel with the entire training set. It yields better parameter estimates than training the HMMs independently [24]. In the ase of ontext-independent models, using the flat start sheme followed by several embedded training runs is all that is neessary to train HMMs for reognition. Context-dependent models are more diffiult to train than ontext-independent models, beause the training involves two extra steps. These onsist of generating the ontextdependent models, and tying the parameters of HMMs with similar ontexts (see also Setion 5.2.1). The first extra step, whih onsists of generating the ontext-dependent models, requires are, beause for ontext-dependent models there exist far fewer training examples per model than for ontext-independent models. In this ase, embedded training is likely to yield the best parameter estimates for ontext-dependent models if they have already been initialized with better values than the global mean and ovariane matrix from the flat-start sheme. Therefore, we ran several embedded training runs on the ontext-independent models and then generated ontextdependent models with the same parameters as the ontextindependent models. It is vital to avoid overtraining the ontext-independent models by keeping the number of initial training passes low. The probabilities should not have fully onverged yet. Otherwise, using ontext-dependent models atually dereases reognition performane. The seond extra step, whih onsists of tying the parameters, is also vital to the ontext-dependent models performane, espeially beause of our relative lak of training data. Tying parameters redues the number of models, as signs with similar ontexts then share a ommon model. As a result, more training data per model beomes available. Unfortunately, parameter tying is a highly empirial proess. Our experiments indiated that tying the transition probabilities properly had the greatest influene on reognition results. We used the ending loations of the signs in the preeding ontext to deide on the tying. For example, the signs for BROTHER and SISTER end in the same loation. As a result, the two models for a sign ourring after the signs for BROTHER or SISTER, suh as LIKE, an share the same transition probabilities. We also used the ending loations to deide on tying the output probabilities. For our data set, the tying proess redued the number of models to less than one sixth of their original number. Coupling of Vision and HMMs In the preeding setion we reviewed how HMMs an be used for ASL reognition. The use of HMMs alone, however, imposes some limitations, one of whih is insuffiieny of training data, espeially while training ontextdependent models. Furthermore, the probability theory assumptions underlying the HMM theory, as desribed in Setion 4.2, are often not valid: Suessive observations are often not independent, the transition from one state to the next often depends not only on the urrent state, but also on the state history, and the distribution of observations does not neessarily resemble a normal density. Another problem is that the HMM theory does not provide for any dynami weighting of features depending on a sign s ontext. For example, the invariant features for some signs, suh as I, are the endpoints of their movements with respet to a body part, and the movements are unimportant. For other signs, only the movements are invariant. The parts of the feature set that should be examined and ignored for eah lass of signs are mutually exlusive. To alleviate these limitations, we investigated the oupling of the HMM reognition proess with an independent omputer vision-based motion analysis that temporally segments the signal and extrats its geometri properties. The idea is that a sign an be desribed in terms of one or more geometri primitives, suh as hand movements along a line, in a plane, or a irle. This idea is supported by the existene of transription systems, suh as the Ham- NoSys [14], that base the desription of the movements on geometri primitives. The presene of three-dimensional information is ruial for the oupling to work. In the past, geometri fitting of planes has already been used for rough segmentation [12], but not for providing additional information about the nature of the fits to the HMM reognition proess..1 Segmentation of the Signal To extrat the geometri properties of the ontinuous signal estimated with our omputer vision methods, it must first be segmented temporally into its parts. Any hange of the type of arm movement is likely to be aompanied by a dip in the veloity. Thus, minima in the absolute values of the veloity vetor provide strong hints at segmentation boundaries. However, there are typially many more veloity minima than segmentation boundaries. Thus, the segmentation proess must provide failities to merge adjaent segments. After performing initial segmentation based on veloities, our algorithm attempts to fit geometri primitives to the individual segments. These urrently onsist of lines, planes, and holds 3 at a position in spae. 3 A hold is a short period of time, during whih no hand movements 7

8 W W W W The fit of a hold is determined by omputing the ovariane matrix over the segment s position data. If there is little movement, the eigenvalues of the matrix in every diretion are small, and onsequently its trae is small. The least-squares fit of a line is governed by S /Y/ a & qb ( / / (1) where is the distane of to the line, and is the line s unit diretion vetor. Let be a matrix ontaining the points i in the segments as its row vetors. Minimizing Equation 1 with respet to orresponds to maximizing C C. By Rayleigh s priniple, the maximaleigenvalue eigenvetor of C maximizes this equation, whih is equivalent to the maximal-eigenvalue eigenvetor of the points ovariane matrix. This eigenvetor is the line s diretion vetor. The other two eigenvalues indiate the goodness of fit the smaller they are with respet to the largest eigenvalue, the better the fit. The least-squares fit of a plane is governed by / / i qš / / (20) where is the distane of to the plane, and š is the plane s unit normal vetor. If is a matrix ontaining the points as its row vetors, the minimal-eigenvalue eigenvetor of C minimizes Equation 20 with respet to š. Hene, minimizing this equation is equivalent to finding the minimal-eigenvalue eigenvetor of the points ovariane matrix. The other two eigenvalues indiate the goodness of fit the larger they are with respet to the smallest eigenvalue, the better the fit. Using least-squares fitting is based on the assumption that the signal noise term is aptured by a normal distribution. If this assumption is not valid, the least-squares estimator is likely to yield poor results, beause of its sensitivity to outliers. On the other hand, in three-dimensional spae, the least-squares estimator is muh easier to ompute than more robust estimators. It would be interesting to ompare its performane on temporal segmentation to the performane of robust regression estimators [13], suh as the least median of squares estimator [2, 11], or the repeated median estimator [1, ]. After the initial fit, the algorithm pools the primitives into a direted ayli graph (DAG), shematially depited in Figure. Note that the individual segments are not mutually exlusive; for example, data an fit both a line and a plane. If the algorithm fails to fit any geometri primitives to some segment, it inserts the segment into the DAG as a take plae. wild ard, whih is defined onservatively to math any kind of geometri primitive. It then attempts to merge adjaent segments if they are ompatible, in an attempt to eliminate spurious segmentation boundaries. We defined adjaent segments to be ompatible for a merge if they shared the same type of geometri primitive in similar orientations, and if the merged segment still fit the same type of geometri primitive as its onstituting segments. In addition, we onsidered a wild ard to be ompatible with another geometri primitive if this primitive also fit the merged segment. Hold Line Line Plane Figure : Geometri primitives pooled into a DAG. Cirles denote segmentation boundaries. Dotted ars denote possible null transitions; they are neessary to ompensate for spurious segments. Sometimes data an fit multiple geometri primitives; in this DAG the data of the first two segments fit both a hold followed by a line, and a simple line. The DAG now gives all possible segment sequenes that are a valid representation of the signal. If a sequene is to be valid, it must be obtainable by traing a path through the DAG from the leftmost segmentation boundary to the rightmost segmentation boundary. In the example given in Figure the sequenes Hold, Line, Plane, and Line, Plane would both be valid sequenes, but Plane, Plane would not, beause the latter does not lie on any path through that DAG. This disussion has so far ignored the possibility of spurious segments arising from the vision analysis. That is, the analysis might reognize a segment that should be part of another, but the merge proess fails to merge it into another segment. The main reason for the existene of spurious segments is undersampling. If a segment onsists of very few samples, it is often impossible to extrat reliable information from it. Our algorithm attempts to solve this problem by adding ars to the DAG from eah segmentation boundary to the next (represented by the dotted ars in Figure ). Thus, a path through the DAG an optionally skip these spurious segments..2 Using the Motion Analysis with HMMs Eah sign in the voabulary has assoiated one or more templates that omprise the sign s geometri primitives with weights of eah feature s relative importane. These 8

primitives are mathed against those in the DAG.

the signs must form a path through the DAG. We all suh a sequene of signs valid with respet to the omputer vision DAG.

Then generate all possible sequenes of geometri primitives orresponding to the reognized signs and onstrut another DAG from them. Using dynami programming, math the two DAGs against eah other.

The justifiation for this algorithm omes from the following properties of the DAGs: If the two DAGs share a ommon path, there is a sequene of geometri primitives that forms a path through the omputer

Thus, the andidate sentene is valid with respet to the omputer vision DAG.

Thus, the andidate sentene is not valid with respet to the omputer vision DAG and should be rejeted.

The advantages of the HMM reognition method are automati segmentation during both training and reognition, and a fully formalized training proedure.

The advantages of the vision mathing method are the possibility of weighting the relative importane of features dynamially, and independene from insuffiient training data.

9 primitives are mathed against those in the DAG. Assuming that the segmentation proess yields orret results, the following must be true: If a sequene of signs is represented by the input signal, the sequene of geometri primitives orresponding to the signs must form a path through the DAG. We all suh a sequene of signs valid with respet to the omputer vision DAG. This observation suggests an appliation of the motion as a bakup hek for the HMM framework. First reognize a andidate sentene from the input signal via the Viterbi algorithm. Then generate all possible sequenes of geometri primitives orresponding to the reognized signs and onstrut another DAG from them. Using dynami programming, math the two DAGs against eah other. If the two DAGs share a ommon path, aept the andidate sentene as orret. Otherwise, rejet the andidate sentene as inorret. The justifiation for this algorithm omes from the following properties of the DAGs: If the two DAGs share a ommon path, there is a sequene of geometri primitives that forms a path through the omputer vision DAG. Furthermore, this sequene of geometri primitives is one of the possible sequenes generated from the andidate sentene. Thus, the andidate sentene is valid with respet to the omputer vision DAG. Conversely, if no suh ommon path exists, none of the sequenes of geometri primitives generated from the andidate sentene forms a path through the omputer vision DAG. Thus, the andidate sentene is not valid with respet to the omputer vision DAG and should be rejeted..3 Disussion of the Coupling The HMM reognition algorithm and the vision mathing algorithm omplement eah other. The advantages of the HMM reognition method are automati segmentation during both training and reognition, and a fully formalized training proedure. The disadvantages are poor performane in the presene of insuffiient training data, no formal way to weight features dynamially, and possible violations of the stohasti independene assumptions. The advantages of the vision mathing method are the possibility of weighting the relative importane of features dynamially, and independene from insuffiient training data. A signifiant disadvantage is that estimating the geometri properties of the signs in the voabulary requires manual labeling and analysis of the data. Furthermore, segmentation must be done expliitly, whih raises the possibility of spurious segments, as desribed in Setion.1, or the possibility of missing segments. Coartiulation sometimes also hanges the geometri properties of the signal, suh that the templates for the orret sequene of sign no longer math the atual signal. Coping with the hanges in the geometri properties is an important task for future researh. 7 Data Colletion For our experiments we olleted data, using both our omputer vision system, and an Asension Tehnologies Flok of Birds. The reason for using the latter was that it is faster at this point than the omputer vision system, and hene more suitable for prototyping. The omputer vision system yields rotation,, and translation,, of eah segment of the arm, as desribed in Setion 3. Figure 7 gives an example of the omputer vision traking proess. The images show the high auray of the omputer vision system; in fat, it is omparable to the auray ahieved by the Flok of Birds system. The Flok of Birds system onsists of a magnet and six sensors that detet their rotation,, and translation, œ, with respet to the magnet at 25 frames per seond. We used the data from both systems interhangeably with a simple alignment of oordinate systems. The oordinate system was right-handed, with the origin at the base of the signer s spine and the axis faing up. Figure 7: Fitting the three-dimensional models to the signer s arms. From top to bottom, the signs for FA- THER, I, and MAIL are displayed. From left to right, the front, side, and top views are displayed. We used the 53-sign voabulary listed in Table 1. Their pronuniations followed the ASL dialet used in the Philadelphia, PA, area. The goals in hoosing the voabulary were to be able to express sentenes that ould have ourred in a natural onversation, and to make intensive use of the signing spae, so as to demonstrate the advantages of three-dimensional data over two-dimensional data. We olleted 48 ontinuous ASL sentenes, eah between

10 Category Nouns Pronouns Verbs Adjetives Other Signs used Ameria, Christian, Christmas, book, brother, hair, ollege, family, father, friend, interpreter, language, mail, mother, name, paper, president, shool, sign, sister, teaher I, my, you, your, how, what, where, why at, an, give, have, interpret, like, make, read, sit, teah, try, visit, want, will, win deaf, good, happy, relieved, sad if, from, for, hi Table 1: The omplete 53 sign voabulary 2 and 12 signs long, with a total of 2345 signs. The only onstraints on the order and ourrene of signs were those ditated by the grammar of ASL [1]. Furthermore, we olleted examples of eah sign for isolated reognition. Beause part of the data were orrupted during the olletion proess, we disarded all signs for whih we did not have enough intat training examples. This left 5 examples over a range of 40 signs. Eah sign had at least examples available for the training set, and 2 examples available for the test set. 8 Experiments We performed isolated, ontinuous, and vision-hmm oupled ASL reognition experiments. We used Entropi s Hidden Markov Model Toolkit (HTK) Version 2.02 for training and testing in all of our experiments. 8.1 Isolated Reognition Experiments The goal of the isolated reognition experiments was to disover a set of features that maximizes HMM reognition performane. We used different features in our experiments, inluding wrist position oordinates of both hands (denoted by 'ž MŸ ), wrist position expressed in polar oordinates in the -ž plane (denoted by A ^ ), polar oordinates in the -Ÿ plane (denoted by ˆ ' A ), wrist position expressed in spherial oordinates (denoted by ^ ) M ), and wrist orientation angle (denoted by ), as well as derivatives of these (denoted by a dot). We also ombined several features in some experiments. We ran repeated experiments, more than 4 4ˆ44 total, with different features and randomly seleted training and test sets on a per-experiment basis. Three quarters of the examples for eah sign were in the training set and the rest were in the test set. Eah seletion yielded 178 test examples per experiment. Some typial results are given in Table 2. In addition, we performed experiments to ompare the merits of using three-dimensional oordinates versus two-dimensional oordinates by projeting the oordinates on planes. The results are shown in Table 3. Features } B W N ^žh MŸ 8.42% 0.% 100.0% 3.8% 43 ', Ÿ 8.72% 0.7% 100.0% 5.5% 44 A ' A, A ^, ^žh MŸ 8.78% 0.78% 100.0% 4.% 882 ' ) N.48% 1.31% 100.0% 3.3% 210 Q ž Ÿ.87% 1.21% 100.0% 3.3% % 0.2% 100.0% 5.5% 17 ^žh MŸ, Ÿ.28% 1.04% 8.% 3.8% 120 ª ) 5.8% 1.2% 8.% 2.1% 150 Table 2: Results of isolated sign reognition with threedimensional features. },, B, W, and N orrespond to the average perentage of orretly reognized signs, standard deviation, best ase, worst ase, and number of experiments, respetively. All experiments used a test set of 178 signs. Features } B W N ' A 8.0% 1.2% 100.0% 4.% 118 'ž 7.75% 1.20% 100.0% 4.% 118 Table 3: Results of isolated sign reognition with twodimensional features. The meaning of the olumns is the same as in Table Analysis of Isolated Reognition The low error rates of the best feature sets show that with a good seletion of features, the hand movements alone, without hand onfiguration information, arry suffiient information to disriminate among many different signs. Polar oordinates slightly outperformed Cartesian oordinates. A ombination of both yielded the best results, although the differene is not signifiant. However, the standard deviation of the ombined feature set was lowest, indiating that a omplex feature vetor is more robust than a simple feature vetor. Position oordinates signifiantly outperformed veloities. The reason for the poor performane of veloity features is that the statistial properties of the veloities hange with variations in the sign s duration. In ontrast, the statistial properties of position oordinates are largely unaffeted by the duration of signs, beause HMMs absorb variations in duration through transitions looping bak to the same state. Yet, position oordinates have the signifiant disadvantage that they are not invariant with respet to loation. The lak of invariane will ause problems for future appliations that attempt to apture ommonalities between movements at different loations in spae. 10

11 Three-dimensional features performed better than twodimensional features, although the differene is not large. The differene would probably beome more signifiant with a larger voabulary. The differenes in standard deviation, however, indiate that three-dimensional features are more robust than two-dimensional features. It is an important onsequene of the experiments results that the performane of the feature vetors depends on the atual examples in the training set, all other fators being equal. Thus, only performing a large number of experiments yields reliable estimates of the relative merits of different features. 8.3 Continuous Reognition Experiments We split the 48 sentenes randomly into a training set with 38 examples and a test set with 7 examples (ontaining 45 signs). Eah sign in the voabulary ourred at least one in the test set. The training and test sets were the same throughout all experiments, and no portion of the test set was used for training in any way. We ran three-dimensional experiments with and without ontextdependent HMMs, and two-dimensional experiments (by projeting the data on planes; the results given are the best that we found). In aordane with the results from isolated experiments that position oordinates perform better than veloities, and that a omplex feature vetor is more robust than a sparse one, we hose our feature vetor to be & 'ž MŸ ^ ' ž Ÿ) ( for both hands. That is, it onsisted of Cartesian and polar position oordinates, veloities, and wrist orientation angles. The task grammar was a simple word loop, so every sign was equally likely at any time in the HMM network. Table 4 shows the experimental results. We use word auray as our evaluation riterion. It is omputed by subtrating the number of insertion errors from the number of orretly spotted signs. The number of words in the result for two-dimensional data is lower than in the other results, beause for one sentene the Viterbi beam-searhing optimization pruned all paths through the HMM network (see also Setion 4.2). 8.4 Analysis of Continuous Reognition The results are learly in favor of using three-dimensional data over two-dimensional for ontinuous reognition. The.3 perent differene is large, although, aording to our experienes with isolated reognition, one experiment is not enough to estimate the real differene reliably. Context-dependent models outperformed ontext-independent models, but the inrease in performane was small, probably to a large extent beause of insuffiient training data ontext-dependent modeling requires huge amounts of data to beome effetive. Also, ross-sign ontext-dependent modeling for ASL is implausible from Type of Word experiment auray Details 3D ontext 87.71% H=41, D=8, S=32 independent I=1, N=45 3D ontext 8.1% H=424, D=, S=2 dependent I=14, N=45 2D ontext 83.3% H=34, D=14, S=44 dependent I=1, N=452 Table 4: Results of ontinuous reognition experiments. H denotes the number of orret signs, D the number of deletion errors, S the number of substitution errors, I the number of insertion errors, and N the total number of signs in the test set. a phonologial point of view (see Setion 5.2.1). The alternative is modeling movement epenthesis diretly, and it appears to perform better [20]. More than half of the substitution errors in eah experiment were onfusions between I and MY, and YOU and YOUR, whih differ only in hand onfiguration. We expet that adding features desribing the hand onfiguration will improve reognition performane signifiantly. Repeating the ontext-dependent experiment with fivebest reognition showed that the absene of a strong grammar for onstraining the HMM network degrades reognition performane signifiantly. In many ases, the orret sentene was the only grammatial sentene among the five best andidates. In other ases, all five andidates were ungrammatial. Unfortunately, using a strong grammar for a test set as diverse as ours is not pratial, beause the size of an HMM network grows exponentially with the number of rules present in the grammar. Statistial language models, suh as bigram models, have proved to be an effetive solution to this problem in speeh reognition. We show in [20] that bigram language models are promising for ASL reognition as well. However, they require a large orpus of labeled real-world data to beome truly effetive. Presently, no suh orpus exists for ASL. 8.5 Coupling Experiments To investigate the effets of oupling the three-dimensional motion analysis with the HMM framework, we performed two experiments. In the first experiment, we analyzed all sentenes in the test set with our motion analysis, so as to provide an upper bound on its performane. If the motion analysis had worked perfetly, it should have aepted all of these 7 test sentenes. In reality, however, it rejeted 10 out of these 7 sentenes. A loser look at the 10 rejeted sentenes revealed that five of these were not reognized orretly by the ontextdependent HMMs either. Thus, it is likely that these five 11

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

CS 9 Projet Final Report: Learning Convention Propagation in BeerAdvoate Reviews from a etwork Perspetive Abstrat We look at the way onventions propagate between reviews on the BeerAdvoate dataset, and