Extraction of Human Activities as Action Sequences using plsa and PrefixSpan

Extracton of Human Actvtes as Acton Sequences usng plsa and PrefxSpan Takuya TONARU Tetsuya TAKIGUCHI Yasuo ARIKI Graduate School of Engneerng, Kobe Unversty Organzaton of Advanced Scence and Technology, Kobe Unversty tonaru@me.cs.sctec.kobe-u.ac.jp takgu@kobe-u.ac.jp ark@kobe-u.ac.jp Abstract In ths paper, we propose a framework for recognzng human actvtes n our daly lfe. Snce a human actvty s represented as a sequence of actons, the actons are recognzed from vdeos and then the frequently-occurrng human actvtes can be extracted from them. We show the expermental results appled to the data taken n a deskwork envronment to demonstrate the performance of the proposed framework. The expermental results were as follows: 86.0% averaged recall rate and 78.3% averaged precson rate were obtaned n extractng human actvtes. 1. Introducton Today, t s easy to record ndvdual daly actvtes n vdeo sequences. To analyze human actvtes n vdeo sequences s valuable for tasks that can gve helpful nformaton to users or support ther lves. For example, at a desk n an offce, workers manly use computers, sometmes drnk coffee, or wear headphones to lsten to musc. If someone drnks coffee too much, a lfe-support system analyzes hs actvtes and wll ssue a warnng about hs health. Hence our goal s to automatcally detect, categorze and recognze human daly actvtes. There has been much research carred out on recognton of smple actons [1] [2], such as runnng, walkng, hand wavng, boxng, etc. Nebles showed nterestng results for unsupervsed learnng and recognton of multple actons usng plsa models [2]. However n an actual envronment, a person acts by combnng varous smply actons. Hence, recognton of daly human actvty cannot be acheved by merely extendng the prevous framework. Prevous research has represented human actvty as a symbolc sequence of actons n herarchy. One popular approach appled Stochastc Context-Free Grammar (SCFG to the symbolc sequence of actons to analyze ther structure [3] [4]. However, grammar was gven manually. Hamd has analyzed the human actvty n the ktchen envronment usng a SuffxTree from a sequence of nteractons wth key-objects [5]. In ths paper, we propose a method to analyze human actvtes usng vdeo, by detectng and categorzng actons based on an unsupervsed learnng approach and to recognze the human actvtes from these actons based on sequental data mnng. The learnng cost n obtanng a symbolc sequence of actons can be reduced by adoptng the unsupervsed approach. Under the assumpton that daly human actvtes appear frequently, sequental data mnng shows strong potental for obtanng frequently-appearng actvtes from symbolc sequences of actons n a vdeo. 13

(a reachng for a cup (b takng a cup (c puttng a cup down Fgure 1. A sequence of actons formng an actvty of Drnkng Coffee. The number n the lower left ndcates each acton. 2. Actvty representaton We defne the human actvty n ths secton. Human actvty conssts of varous actons, and t s represented as a symbolc sequence of actons. For example, an actvty S, n whch a person s drnkng coffee s represented as a sequence of actons as follows: S 8 6 9 The numbers 8, 6, and 9 ndcate the actons of reachng for a cup, takng the cup, and puttng the cup down, respectvely, as shown n Fg. 1. An actvty of drnkng coffee s usually represented as a flow of actons such as takng the cup, lftng the cup to the mouth, and puttng the cup down. A temporal flow of such actons consttutes a human actvty. 3. Approach Our method conssts of two phases. In the frst phase, a hstogram sequence of actons s obtaned usng human acton categorzng method [2]. In the second phase, the obtaned acton hstogram sequence s converted nto dscretzed symbolc sequence of actons, and human actvtes are extracted usng PrefxSpan based on the frequency. 3.1. Human acton categorzng method Ths method extracts spatal-temporal features and learns the acton models usng a plsa model. Here, a bref revew of ths method s descrbed. 3.1.1. Feature representaton: Assumng a statonary camera or a process that can account for camera moton, separable lnear flters are appled to the vdeo to obtan the response functon as follows R = 2 ( I g hev + ( I g hod 2 (1 where I s a gray-scale pxel on the mage, g(x,y;σ s a 2D Gaussan smoothng kernel, appled only along the spatal dmensons, and h ev and h od are a quadrature par of 1D Gabor flters appled temporally, whch are defned as h ev (t;τ,ω = cos(2πtωexp( t 2 /τ 2 and 14

h od (t;τ,ω = sn(2πtωexp( t 2 /τ 2. The two parameters σ and τ correspond to the spatal and temporal scales of the flters, respectvely. To gve the response functon effectvely, we use ω = 4/τ. Ths functon detects any regons where complex moton s caused spatally. In fact, a regon wth complex moton can nduce a strong response, but a regon wth smple translatonal moton wll not nduce a strong response. The spatal-temporal nterest ponts are extracted around the local maxma of the response functon. At each nterest pont, a spataltemporal cube s extracted that contans the output of the response functon. Its sze s approxmately sx tmes the spatal and temporal scales along each dmenson. To obtan a moton descrptor, the brghtness gradents are computed at all the pxels n the cube and are concatenated to form a vector. Then PCA s appled to reduce the dmensonalty of the descrptors. In order to obtan the cluster prototypes, a k-means algorthm s appled to the descrptors. Then each descrptor s assgned a descrptor type by mappng t to the prototype. Therefore a collecton of descrptors ncluded n a vdeo s represented as a hstogram of the descrptor types. Hereafter, we wll refer to the descrptor types as words n vdeos. 3.1.2. Acton categorzaton by plsa: The plsa (Probablstc Latent Semantc Analyss method s a technque used n the analyss of co-occurrence data. Ths method can fnd meanngful topcs that correspond to moton categores n terms of words n vdeos. P(d z P(z d z w P(w z Fgure 2. PLSA graphc model of symmetrc parameterzed verson We can create a co-occurrence table N between a word w n W = {w 1,,w M } and a vdeo d j n D = {d 1,,d N } usng the feature extracton method descrbed n 3.1.1. In addton, there s a latent topc varable z k n Z = {z 1,,z K }, whch s not observed yet. Assumng that the observaton pars (w,d j are generated ndependently under the condton of the latent topc varable z k, a jont probablty model s gven by P( w, d j = K P( z k = 1 k P( w z k P( d j z k (2 where P(w z k s the probablty of a word w occurrng n an acton category z k, and P(d j z k s the probablty of vdeo d j occurrng n an acton category z k. Ths model s a symmetrc parameterzed verson of the generatve model [6], and ts graphc model s represented n Fg. 2. We then determne the model parameters P(z, P(w z and P(d z by maxmzaton of the log-lkelhood functon 15

M L = Σ N = 1 j= 1 Σ n( w, d j log P( w, d j (3 where n(w,d j denotes the word frequency, that s the number of tmes word w occurred n vdeo d j. Maxmzng the log-lkelhood functon yelds a model that gves hgh probablty to the words that appear n the vdeo. The procedure for maxmzaton of the log-lkelhood functon s the Expectaton Maxmzaton (EM algorthm. When testng the model, each word n the testng vdeo d test s labeled topcally by fndng the followng maxmum posterors: P( z k w, d test P( w zk P( zk d = K Σ P( w z P( z d test l= 1 l l test (4 Snce P(z d test s not obtaned, t s requred to be computed. Although ths can be solved usng an EM algorthm n the same way as tranng the model. 3.1. Extracton of actvtes 3.2.1. Acton recognton by human acton categorzaton: It s necessary to prepare vdeo clps that nclude actons as learnng data. However, n our method, t s not necessary to clp each acton precsely from the vdeos because the plsa model s a mult-topc analyss method. If two actons occur consecutvely wthout a non-movement gap, they wll be clpped as one vdeo sequence. The plsa model can fnd these acton categores separately as latent topcs. Accordngly, vdeo sequences for learnng are extracted easly and automatcally from vdeos. When learnng usng the plsa model, t s necessary to decde topc K, whch s the number of categorzed actons. If topc K s large, although an acton vocabulary becomes large, t wll respond senstvely to the small dfference of the feature. If topc K s small, t does well n dealng wth nose, but the acton vocabulary becomes small. Future research wll consder how to deal wth ths problem automatcally. Probablstc sequence Dscretzaton Symbolc sequence aaabbbbccc...dd...*** Fgure 3. Converson nto a dscretzed symbolc sequence 16

3.2.2. Convertng nto dscretzed symbolc sequence: The result of acton recognton for the testng vdeo d test s a hstogram sequence of actons computed frame by frame. Ths hstogram s P(z k d test as descrbed n secton 3.1.2. Ths hstogram sequence s smoothed for denosng, and each frame s replaced by the acton symbol wth the maxmum probablty as shown n Fg. 3. Next, the consecutve same symbols are merged nto one as shown n Fg. 4. In addton, snce human actvty s a sequence of consecutve actons, f non-movement duraton s longer than some threshold, the sequence s splt nto two sequences. Symbolc sequence aaabbbbccc...dd...*** Merge & Lne splt Symbolc sequences a b c d *** Tme secton wth no acton Fgure 4. Converson nto symbolc sequence by mergng and splttng 3.2.3. Extractng human actvtes: We assume that human daly actvtes appear frequently. To extract actvtes, PrefxSpan (Prefx-projected Sequental PAtterN mnng [7], commonly used n sequental data mnng, s employed. As shown n Fg. 5, frequent subsequences are dscovered as patterns n a sequence database, where the occurrence frequency of subsequences s no less than mnmum support. Its general dea s to examne only the prefx subsequences and project only ther correspondng postfx subsequences nto projected databases. In each projected database, sequental patterns are grown by explorng only local frequent patterns [7]. A mnng result s a lst of acton sequences and they are sorted n the order of frequency. Next, the extracted sequences are manually labeled as actvtes f they represent the human actvtes. A set of nput symbolc-sequence 1. a c d 2. a b c 3. c b a 4. a a b a: 5 b: 3 c: 3 d: 1 Mnmum support threshold: 2 Projecton 1. c d 2. b c 4. a b 2. c 3. a 1. d 3. b a a: 1 b: 2 c: 2 d: 1 a: 1 c: 1 a: 1 b: 1 c: 1 2. c 1. d Output c: 1 d: 1 a :5 a b:2 a c:2 b :3 c :3 Fgure 5. Frequent subsequences extracton by PrefxSpan 17

4. Expermental results 4.1. Expermental condtons We verfed the valdty of our algorthm usng a 70-mnute-long vdeo n whch a person s workng at a desk n the laboratory. In vdeo, the person uses a computer and sometmes drnks coffee, wears or removes headphones, pcks up or throws away tssues, and scratches hs head. No one else appears n the vdeo, and the person does not leave the desk. The resoluton of the vdeo mage s 160 120. The spato-temporal features were extracted as descrbed n secton 3.1.1. wth the two parameters σ = 11 and τ = 19. A codebook contanng 400 codewords was created from the tranng set descrptors. The latent topc K was set to 13, and the mnmum support value of PrefxSpan was set to 3. A symbolc sequence was splt nto two f the non-movement duraton s longer than 120 frames. 4.2. Expermental results The number of human actvtes extracted by PrefxSpan was 43, and sx actvtes were extracted n the order of frequency. Table 1 shows the extracted human actvtes. Fg. 6 shows examples of extracted human actvtes as mages. Table 1. Human actvtes extracted by the proposed method Actvty Frequence Sequence Recall Precson 16 6 9 Drnk coffee 1.00 0.91 7 6 11 9 Remove headphones 7 4 10 3 0.86 0.86 Pck up tssues 5 8 12 0.80 0.80 Scratch the head 4 4 13 0.50 0.67 3 4 7 Wear headphones 1.00 0.86 3 4 10 7 Throw away tssues 3 12 10 9 1.00 0.60 In Table 1, two dfferent sequences appear n the same actvty. For example, Drnk coffee has two dfferent sequences: 6 9 and 6 11 9. Ths s caused by slow speed of acton. In Fg. 6(a and 6(b, the acton n the mddle was nserted when the speed of the arm moton was very slow. In Table 1, the averaged recall and precson are 86.0% and 78.3%, respectvely. The defnton of the recall and precson s as follows: Recall = (True postve / (True postve + False negatve ( 100[%] Precson = (True postve / (True postve + False postve ( 100[%] The defnton of true postve, false postve and false negatve s gven as follows: True postve : the number of correctly extracted actvtes False postve : the number of falsely extracted actvtes 18

False negatve : the number of true actvtes not extracted 5. Concluson We proposed a framework for recognzng human actvtes by analyzng vdeos. The goal of our work s to automatcally convert a vdeo sequence nto a symbolc sequence of actons and to extract frequently-occurrng human actvtes from the symbolc sequences. In the future, we are plannng to drectly extract human actvtes from an acton hstogram sequence by takng nto consderaton the duraton of actons and by permttng multple actvty canddates. (a (b (c Fgure 6. Human actvtes extracted by the proposed method. Each mage shows (a drnkng coffee, (b removng headphones, (c pckng up tssues 19

6. References [1] C. Schuldt, I. Laptev, and B. Caputo, Recognzng Human Actons: A Local SVM Approach, ICPR, pp. 32-36, 2004. [2] J.C. Nebles, H. Wang, and L. Fe-Fe, Unsupervsed Learnng of Human Acton Categores Usng Spatal- Temporal Words, Brtsh Machne Vson Conference, pp. 1249-1258, 2006. [3] Y. Ivanov and A. Bobck, Recognton of Vsual Actvtes and Interactons by Stochastc Parsng, IEEE Transactons on Pattern Analyss and Machne Intellgence, pp. 852-872, 2000. [4] D. Mnnen, I. Essa, and T. Starner, Expectaton Grammars: Leveragng Hgh-Level Expectatons for Actvty Recognton, CVPR, pp. 626-632, 2003. [5] R. Hamd, S. Madd, A. Bobck, and I. Essa, Unsupervsed Analyss of Actvty Sequences Usng Event- Motfs, VSSN, pp. 71-78, 2006. [6] T. Hofmann, Probablstc Latent Semantc Indexng, SIGIR, pp. 50-57, 1999. [7] J. Pe, J. Han, M. Behzad, and H. Pnto, PrefxSpan: Mnng Sequental Patterns Effcently by Prefx- Projected Pattern Growth, ICDE, pp. 215-224, 2001. Authors Takuya Tonaru s the graduate student at Kobe Unversty. Tetsuya Takguch receved the Dr. Eng. degree n nformaton scence from Nara Insttute of Scence and Technology, Nara, Japan, n 1999. From 1999 to 2004, he was a researcher at IBM Research, Tokyo Research Laboratory, Japan. He s currently a Lecturer wth Kobe Unversty. From May 2008 to September 2008 he was a vstng scholar at Unversty of Washngton. Hs research nterests nclude speech and mage processng. He receved the Awaya Award from the Acoustcal Socety of Japan n 2002. He s a member of the IEEE, the Informaton Processng Socety of Japan, and the Acoustcal Socety of Japan. Yasuo Ark receved hs B.E., M.E. and Ph.D. n nformaton scence from Kyoto Unversty n 1974, 1976 and 1979, respectvely. He was an assstant professor at Kyoto Unversty from 1980 to 1990, and stayed at Ednburgh Unversty as vstng academc from 1987 to 1990. From 1990 to 1992 he was an assocate professor and from 1992 to 2003 a professor at Ryukoku Unversty. Snce 2003 he has been a professor at Kobe Unversty. He s manly engaged n speech and mage recognton and nterested n nformaton retreval and database. He s a member of IEEE, IPSJ, JSAI, ITE and IIEEJ. 20