Extraction of Human Activities as Action Sequences using plsa and PrefixSpan

Similar documents
Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Edge Detection in Noisy Images Using the Support Vector Machines

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A Binarization Algorithm specialized on Document Images and Photos

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Human Action Recognition Using Dynamic Time Warping Algorithm and Reproducing Kernel Hilbert Space for Matrix Manifold

Local Quaternary Patterns and Feature Local Quaternary Patterns

Detection of an Object by using Principal Component Analysis

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Feature Reduction and Selection

The Research of Support Vector Machine in Agricultural Data Classification

Lecture 5: Multilayer Perceptrons

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

A Gradient Difference based Technique for Video Text Detection

A Gradient Difference based Technique for Video Text Detection

Action Recognition by Matching Clustered Trajectories of Motion Vectors

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

High Five: Recognising human interactions in TV shows

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Shape Representation Robust to the Sketching Order Using Distance Map and Direction Histogram

Detection of Human Actions from a Single Example

An Image Fusion Approach Based on Segmentation Region

Multiple Frame Motion Inference Using Belief Propagation

Cluster Analysis of Electrical Behavior

Large-scale Web Video Event Classification by use of Fisher Vectors

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Query Clustering Using a Hybrid Query Similarity Measure

Face Recognition Based on SVM and 2DPCA

VIDEO COMPLETION USING HIERARCHICAL MOTION ESTIMATION AND COLOR COMPENSATION

3D vector computer graphics

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

CSCI 5417 Information Retrieval Systems Jim Martin!

Problem Set 3 Solutions

Semantic Image Retrieval Using Region Based Inverted File

Classifier Selection Based on Data Complexity Measures *

Relevance Feedback for Image Retrieval

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Writer Identification using a Deep Neural Network

Learning a Class-Specific Dictionary for Facial Expression Recognition

TN348: Openlab Module - Colocalization

An Optimal Algorithm for Prufer Codes *

A Novel Term_Class Relevance Measure for Text Categorization

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Performance Evaluation of Information Retrieval Systems

Fuzzy C-Means Initialized by Fixed Threshold Clustering for Improving Image Retrieval

y and the total sum of

Real-time Joint Tracking of a Hand Manipulating an Object from RGB-D Input

Novel Pattern-based Fingerprint Recognition Technique Using 2D Wavelet Decomposition

COMPLEX WAVELET TRANSFORM-BASED COLOR INDEXING FOR CONTENT-BASED IMAGE RETRIEVAL

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Motion Boundary Trajectory for Human Action Recognition

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Vol. 5, No. 3 March 2014 ISSN Journal of Emerging Trends in Computing and Information Sciences CIS Journal. All rights reserved.

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

MULTISPECTRAL REMOTE SENSING IMAGE CLASSIFICATION WITH MULTIPLE FEATURES

Image Alignment CSC 767

Detection of hand grasping an object from complex background based on machine learning co-occurrence of local image feature

Object-Based Techniques for Image Retrieval

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

An Improved Image Segmentation Algorithm Based on the Otsu Method

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

Machine Learning: Algorithms and Applications

Video Object Tracking Based On Extended Active Shape Models With Color Information

Private Information Retrieval (PIR)

1. Introduction. Abstract

Collaboratively Regularized Nearest Points for Set Based Recognition

Determining Fuzzy Sets for Quantitative Attributes in Data Mining Problems

A New Feature of Uniformity of Image Texture Directions Coinciding with the Human Eyes Perception 1

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

High-Boost Mesh Filtering for 3-D Shape Enhancement

Face Detection with Deep Learning

UB at GeoCLEF Department of Geography Abstract

Related-Mode Attacks on CTR Encryption Mode

Semantic Scene Concept Learning by an Autonomous Agent

Learning an Image Manifold for Retrieval

CS 534: Computer Vision Model Fitting

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

Joint Example-based Depth Map Super-Resolution

Integrated Expression-Invariant Face Recognition with Constrained Optical Flow

Signature and Lexicon Pruning Techniques

Fast Feature Value Searching for Face Detection

Reducing Frame Rate for Object Tracking

Oracle Database: SQL and PL/SQL Fundamentals Certification Course

Scale Selective Extended Local Binary Pattern For Texture Classification

Load Balancing for Hex-Cell Interconnection Network

Recognizing Faces. Outline

Brushlet Features for Texture Image Retrieval

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

A Bilinear Model for Sparse Coding

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Transcription:

Extracton of Human Actvtes as Acton Sequences usng plsa and PrefxSpan Takuya TONARU Tetsuya TAKIGUCHI Yasuo ARIKI Graduate School of Engneerng, Kobe Unversty Organzaton of Advanced Scence and Technology, Kobe Unversty tonaru@me.cs.sctec.kobe-u.ac.jp takgu@kobe-u.ac.jp ark@kobe-u.ac.jp Abstract In ths paper, we propose a framework for recognzng human actvtes n our daly lfe. Snce a human actvty s represented as a sequence of actons, the actons are recognzed from vdeos and then the frequently-occurrng human actvtes can be extracted from them. We show the expermental results appled to the data taken n a deskwork envronment to demonstrate the performance of the proposed framework. The expermental results were as follows: 86.0% averaged recall rate and 78.3% averaged precson rate were obtaned n extractng human actvtes. 1. Introducton Today, t s easy to record ndvdual daly actvtes n vdeo sequences. To analyze human actvtes n vdeo sequences s valuable for tasks that can gve helpful nformaton to users or support ther lves. For example, at a desk n an offce, workers manly use computers, sometmes drnk coffee, or wear headphones to lsten to musc. If someone drnks coffee too much, a lfe-support system analyzes hs actvtes and wll ssue a warnng about hs health. Hence our goal s to automatcally detect, categorze and recognze human daly actvtes. There has been much research carred out on recognton of smple actons [1] [2], such as runnng, walkng, hand wavng, boxng, etc. Nebles showed nterestng results for unsupervsed learnng and recognton of multple actons usng plsa models [2]. However n an actual envronment, a person acts by combnng varous smply actons. Hence, recognton of daly human actvty cannot be acheved by merely extendng the prevous framework. Prevous research has represented human actvty as a symbolc sequence of actons n herarchy. One popular approach appled Stochastc Context-Free Grammar (SCFG to the symbolc sequence of actons to analyze ther structure [3] [4]. However, grammar was gven manually. Hamd has analyzed the human actvty n the ktchen envronment usng a SuffxTree from a sequence of nteractons wth key-objects [5]. In ths paper, we propose a method to analyze human actvtes usng vdeo, by detectng and categorzng actons based on an unsupervsed learnng approach and to recognze the human actvtes from these actons based on sequental data mnng. The learnng cost n obtanng a symbolc sequence of actons can be reduced by adoptng the unsupervsed approach. Under the assumpton that daly human actvtes appear frequently, sequental data mnng shows strong potental for obtanng frequently-appearng actvtes from symbolc sequences of actons n a vdeo. 13

(a reachng for a cup (b takng a cup (c puttng a cup down Fgure 1. A sequence of actons formng an actvty of Drnkng Coffee. The number n the lower left ndcates each acton. 2. Actvty representaton We defne the human actvty n ths secton. Human actvty conssts of varous actons, and t s represented as a symbolc sequence of actons. For example, an actvty S, n whch a person s drnkng coffee s represented as a sequence of actons as follows: S 8 6 9 The numbers 8, 6, and 9 ndcate the actons of reachng for a cup, takng the cup, and puttng the cup down, respectvely, as shown n Fg. 1. An actvty of drnkng coffee s usually represented as a flow of actons such as takng the cup, lftng the cup to the mouth, and puttng the cup down. A temporal flow of such actons consttutes a human actvty. 3. Approach Our method conssts of two phases. In the frst phase, a hstogram sequence of actons s obtaned usng human acton categorzng method [2]. In the second phase, the obtaned acton hstogram sequence s converted nto dscretzed symbolc sequence of actons, and human actvtes are extracted usng PrefxSpan based on the frequency. 3.1. Human acton categorzng method Ths method extracts spatal-temporal features and learns the acton models usng a plsa model. Here, a bref revew of ths method s descrbed. 3.1.1. Feature representaton: Assumng a statonary camera or a process that can account for camera moton, separable lnear flters are appled to the vdeo to obtan the response functon as follows R = 2 ( I g hev + ( I g hod 2 (1 where I s a gray-scale pxel on the mage, g(x,y;σ s a 2D Gaussan smoothng kernel, appled only along the spatal dmensons, and h ev and h od are a quadrature par of 1D Gabor flters appled temporally, whch are defned as h ev (t;τ,ω = cos(2πtωexp( t 2 /τ 2 and 14

h od (t;τ,ω = sn(2πtωexp( t 2 /τ 2. The two parameters σ and τ correspond to the spatal and temporal scales of the flters, respectvely. To gve the response functon effectvely, we use ω = 4/τ. Ths functon detects any regons where complex moton s caused spatally. In fact, a regon wth complex moton can nduce a strong response, but a regon wth smple translatonal moton wll not nduce a strong response. The spatal-temporal nterest ponts are extracted around the local maxma of the response functon. At each nterest pont, a spataltemporal cube s extracted that contans the output of the response functon. Its sze s approxmately sx tmes the spatal and temporal scales along each dmenson. To obtan a moton descrptor, the brghtness gradents are computed at all the pxels n the cube and are concatenated to form a vector. Then PCA s appled to reduce the dmensonalty of the descrptors. In order to obtan the cluster prototypes, a k-means algorthm s appled to the descrptors. Then each descrptor s assgned a descrptor type by mappng t to the prototype. Therefore a collecton of descrptors ncluded n a vdeo s represented as a hstogram of the descrptor types. Hereafter, we wll refer to the descrptor types as words n vdeos. 3.1.2. Acton categorzaton by plsa: The plsa (Probablstc Latent Semantc Analyss method s a technque used n the analyss of co-occurrence data. Ths method can fnd meanngful topcs that correspond to moton categores n terms of words n vdeos. P(d z P(z d z w P(w z Fgure 2. PLSA graphc model of symmetrc parameterzed verson We can create a co-occurrence table N between a word w n W = {w 1,,w M } and a vdeo d j n D = {d 1,,d N } usng the feature extracton method descrbed n 3.1.1. In addton, there s a latent topc varable z k n Z = {z 1,,z K }, whch s not observed yet. Assumng that the observaton pars (w,d j are generated ndependently under the condton of the latent topc varable z k, a jont probablty model s gven by P( w, d j = K P( z k = 1 k P( w z k P( d j z k (2 where P(w z k s the probablty of a word w occurrng n an acton category z k, and P(d j z k s the probablty of vdeo d j occurrng n an acton category z k. Ths model s a symmetrc parameterzed verson of the generatve model [6], and ts graphc model s represented n Fg. 2. We then determne the model parameters P(z, P(w z and P(d z by maxmzaton of the log-lkelhood functon 15

M L = Σ N = 1 j= 1 Σ n( w, d j log P( w, d j (3 where n(w,d j denotes the word frequency, that s the number of tmes word w occurred n vdeo d j. Maxmzng the log-lkelhood functon yelds a model that gves hgh probablty to the words that appear n the vdeo. The procedure for maxmzaton of the log-lkelhood functon s the Expectaton Maxmzaton (EM algorthm. When testng the model, each word n the testng vdeo d test s labeled topcally by fndng the followng maxmum posterors: P( z k w, d test P( w zk P( zk d = K Σ P( w z P( z d test l= 1 l l test (4 Snce P(z d test s not obtaned, t s requred to be computed. Although ths can be solved usng an EM algorthm n the same way as tranng the model. 3.1. Extracton of actvtes 3.2.1. Acton recognton by human acton categorzaton: It s necessary to prepare vdeo clps that nclude actons as learnng data. However, n our method, t s not necessary to clp each acton precsely from the vdeos because the plsa model s a mult-topc analyss method. If two actons occur consecutvely wthout a non-movement gap, they wll be clpped as one vdeo sequence. The plsa model can fnd these acton categores separately as latent topcs. Accordngly, vdeo sequences for learnng are extracted easly and automatcally from vdeos. When learnng usng the plsa model, t s necessary to decde topc K, whch s the number of categorzed actons. If topc K s large, although an acton vocabulary becomes large, t wll respond senstvely to the small dfference of the feature. If topc K s small, t does well n dealng wth nose, but the acton vocabulary becomes small. Future research wll consder how to deal wth ths problem automatcally. Probablstc sequence Dscretzaton Symbolc sequence aaabbbbccc...dd...*** Fgure 3. Converson nto a dscretzed symbolc sequence 16

3.2.2. Convertng nto dscretzed symbolc sequence: The result of acton recognton for the testng vdeo d test s a hstogram sequence of actons computed frame by frame. Ths hstogram s P(z k d test as descrbed n secton 3.1.2. Ths hstogram sequence s smoothed for denosng, and each frame s replaced by the acton symbol wth the maxmum probablty as shown n Fg. 3. Next, the consecutve same symbols are merged nto one as shown n Fg. 4. In addton, snce human actvty s a sequence of consecutve actons, f non-movement duraton s longer than some threshold, the sequence s splt nto two sequences. Symbolc sequence aaabbbbccc...dd...*** Merge & Lne splt Symbolc sequences a b c d *** Tme secton wth no acton Fgure 4. Converson nto symbolc sequence by mergng and splttng 3.2.3. Extractng human actvtes: We assume that human daly actvtes appear frequently. To extract actvtes, PrefxSpan (Prefx-projected Sequental PAtterN mnng [7], commonly used n sequental data mnng, s employed. As shown n Fg. 5, frequent subsequences are dscovered as patterns n a sequence database, where the occurrence frequency of subsequences s no less than mnmum support. Its general dea s to examne only the prefx subsequences and project only ther correspondng postfx subsequences nto projected databases. In each projected database, sequental patterns are grown by explorng only local frequent patterns [7]. A mnng result s a lst of acton sequences and they are sorted n the order of frequency. Next, the extracted sequences are manually labeled as actvtes f they represent the human actvtes. A set of nput symbolc-sequence 1. a c d 2. a b c 3. c b a 4. a a b a: 5 b: 3 c: 3 d: 1 Mnmum support threshold: 2 Projecton 1. c d 2. b c 4. a b 2. c 3. a 1. d 3. b a a: 1 b: 2 c: 2 d: 1 a: 1 c: 1 a: 1 b: 1 c: 1 2. c 1. d Output c: 1 d: 1 a :5 a b:2 a c:2 b :3 c :3 Fgure 5. Frequent subsequences extracton by PrefxSpan 17

4. Expermental results 4.1. Expermental condtons We verfed the valdty of our algorthm usng a 70-mnute-long vdeo n whch a person s workng at a desk n the laboratory. In vdeo, the person uses a computer and sometmes drnks coffee, wears or removes headphones, pcks up or throws away tssues, and scratches hs head. No one else appears n the vdeo, and the person does not leave the desk. The resoluton of the vdeo mage s 160 120. The spato-temporal features were extracted as descrbed n secton 3.1.1. wth the two parameters σ = 11 and τ = 19. A codebook contanng 400 codewords was created from the tranng set descrptors. The latent topc K was set to 13, and the mnmum support value of PrefxSpan was set to 3. A symbolc sequence was splt nto two f the non-movement duraton s longer than 120 frames. 4.2. Expermental results The number of human actvtes extracted by PrefxSpan was 43, and sx actvtes were extracted n the order of frequency. Table 1 shows the extracted human actvtes. Fg. 6 shows examples of extracted human actvtes as mages. Table 1. Human actvtes extracted by the proposed method Actvty Frequence Sequence Recall Precson 16 6 9 Drnk coffee 1.00 0.91 7 6 11 9 Remove headphones 7 4 10 3 0.86 0.86 Pck up tssues 5 8 12 0.80 0.80 Scratch the head 4 4 13 0.50 0.67 3 4 7 Wear headphones 1.00 0.86 3 4 10 7 Throw away tssues 3 12 10 9 1.00 0.60 In Table 1, two dfferent sequences appear n the same actvty. For example, Drnk coffee has two dfferent sequences: 6 9 and 6 11 9. Ths s caused by slow speed of acton. In Fg. 6(a and 6(b, the acton n the mddle was nserted when the speed of the arm moton was very slow. In Table 1, the averaged recall and precson are 86.0% and 78.3%, respectvely. The defnton of the recall and precson s as follows: Recall = (True postve / (True postve + False negatve ( 100[%] Precson = (True postve / (True postve + False postve ( 100[%] The defnton of true postve, false postve and false negatve s gven as follows: True postve : the number of correctly extracted actvtes False postve : the number of falsely extracted actvtes 18

False negatve : the number of true actvtes not extracted 5. Concluson We proposed a framework for recognzng human actvtes by analyzng vdeos. The goal of our work s to automatcally convert a vdeo sequence nto a symbolc sequence of actons and to extract frequently-occurrng human actvtes from the symbolc sequences. In the future, we are plannng to drectly extract human actvtes from an acton hstogram sequence by takng nto consderaton the duraton of actons and by permttng multple actvty canddates. (a (b (c Fgure 6. Human actvtes extracted by the proposed method. Each mage shows (a drnkng coffee, (b removng headphones, (c pckng up tssues 19

6. References [1] C. Schuldt, I. Laptev, and B. Caputo, Recognzng Human Actons: A Local SVM Approach, ICPR, pp. 32-36, 2004. [2] J.C. Nebles, H. Wang, and L. Fe-Fe, Unsupervsed Learnng of Human Acton Categores Usng Spatal- Temporal Words, Brtsh Machne Vson Conference, pp. 1249-1258, 2006. [3] Y. Ivanov and A. Bobck, Recognton of Vsual Actvtes and Interactons by Stochastc Parsng, IEEE Transactons on Pattern Analyss and Machne Intellgence, pp. 852-872, 2000. [4] D. Mnnen, I. Essa, and T. Starner, Expectaton Grammars: Leveragng Hgh-Level Expectatons for Actvty Recognton, CVPR, pp. 626-632, 2003. [5] R. Hamd, S. Madd, A. Bobck, and I. Essa, Unsupervsed Analyss of Actvty Sequences Usng Event- Motfs, VSSN, pp. 71-78, 2006. [6] T. Hofmann, Probablstc Latent Semantc Indexng, SIGIR, pp. 50-57, 1999. [7] J. Pe, J. Han, M. Behzad, and H. Pnto, PrefxSpan: Mnng Sequental Patterns Effcently by Prefx- Projected Pattern Growth, ICDE, pp. 215-224, 2001. Authors Takuya Tonaru s the graduate student at Kobe Unversty. Tetsuya Takguch receved the Dr. Eng. degree n nformaton scence from Nara Insttute of Scence and Technology, Nara, Japan, n 1999. From 1999 to 2004, he was a researcher at IBM Research, Tokyo Research Laboratory, Japan. He s currently a Lecturer wth Kobe Unversty. From May 2008 to September 2008 he was a vstng scholar at Unversty of Washngton. Hs research nterests nclude speech and mage processng. He receved the Awaya Award from the Acoustcal Socety of Japan n 2002. He s a member of the IEEE, the Informaton Processng Socety of Japan, and the Acoustcal Socety of Japan. Yasuo Ark receved hs B.E., M.E. and Ph.D. n nformaton scence from Kyoto Unversty n 1974, 1976 and 1979, respectvely. He was an assstant professor at Kyoto Unversty from 1980 to 1990, and stayed at Ednburgh Unversty as vstng academc from 1987 to 1990. From 1990 to 1992 he was an assocate professor and from 1992 to 2003 a professor at Ryukoku Unversty. Snce 2003 he has been a professor at Kobe Unversty. He s manly engaged n speech and mage recognton and nterested n nformaton retreval and database. He s a member of IEEE, IPSJ, JSAI, ITE and IIEEJ. 20