Human Action Recognition Using Silhouette Histogram

Similar documents
Human Activity Recognition Using a Dynamic Texture Based Method

EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor

A Modified Mean Shift Algorithm for Visual Object Tracking

Fully Automatic Methodology for Human Action Recognition Incorporating Dynamic Information

Short Survey on Static Hand Gesture Recognition

Human Motion Detection and Tracking for Video Surveillance

Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels

Adaptive Action Detection

Arm-hand Action Recognition Based on 3D Skeleton Joints Ling RUI 1, Shi-wei MA 1,a, *, Jia-rui WEN 1 and Li-na LIU 1,2

Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Human Action Recognition Using Independent Component Analysis

Gait analysis for person recognition using principal component analysis and support vector machines

PEOPLE IN SEATS COUNTING VIA SEAT DETECTION FOR MEETING SURVEILLANCE

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

DYNAMIC BACKGROUND SUBTRACTION BASED ON SPATIAL EXTENDED CENTER-SYMMETRIC LOCAL BINARY PATTERN. Gengjian Xue, Jun Sun, Li Song

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Tri-modal Human Body Segmentation

CS229: Action Recognition in Tennis

Human Action Recognition via Fused Kinematic Structure and Surface Representation

Video Stabilization, Camera Motion Pattern Recognition and Motion Tracking Using Spatiotemporal Regularity Flow

Extraction of Human Gait Features from Enhanced Human Silhouette Images

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

Human Upper Body Pose Estimation in Static Images

Spatio-temporal Feature Classifier

Implementation of Optical Flow, Sliding Window and SVM for Vehicle Detection and Tracking

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Dynamic Human Shape Description and Characterization

FAST HUMAN DETECTION USING TEMPLATE MATCHING FOR GRADIENT IMAGES AND ASC DESCRIPTORS BASED ON SUBTRACTION STEREO

A Hand Gesture Recognition Method Based on Multi-Feature Fusion and Template Matching

Automatic Gait Recognition. - Karthik Sridharan

Traffic Congestion Model: Challenges and Opportunities

Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features. Wei-Ta Chu

HID DARPA Niyogi Adelson [5] Cunado [6] 1D. Little Boyd [7] Murase Sakai [8] Huang [9] Shutler [10] TP391 1M02J , [1] [3]

International Journal of Innovative Research in Computer and Communication Engineering

Hand gesture recognition with Leap Motion and Kinect devices

Classification of objects from Video Data (Group 30)

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

HYBRID CENTER-SYMMETRIC LOCAL PATTERN FOR DYNAMIC BACKGROUND SUBTRACTION. Gengjian Xue, Li Song, Jun Sun, Meng Wu

Action Recognition & Categories via Spatial-Temporal Features

Action Recognition by Learning Mid-level Motion Features

Recognizing Human Activities in Video by Multi-resolutional Optical Flows

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

CS 231A Computer Vision (Fall 2012) Problem Set 3

Histograms of Oriented Gradients

Object and Action Detection from a Single Example

ABNORMAL GROUP BEHAVIOUR DETECTION FOR OUTDOOR ENVIRONMENT

Texture Segmentation by Windowed Projection

Human action recognition using an ensemble of body-part detectors

K-Means Based Matching Algorithm for Multi-Resolution Feature Descriptors

CSE/EE-576, Final Project

Bus Detection and recognition for visually impaired people

Video annotation based on adaptive annular spatial partition scheme

Generative and discriminative classification techniques

Identifying Layout Classes for Mathematical Symbols Using Layout Context

Unsupervised Motion Classification by means of Efficient Feature Selection and Tracking

Summarization of Egocentric Moving Videos for Generating Walking Route Guidance

Adaptive Background Mixture Models for Real-Time Tracking

Research on Recognition and Classification of Moving Objects in Mixed Traffic Based on Video Detection

CS 223B Computer Vision Problem Set 3

Crowd Event Recognition Using HOG Tracker

Application of Geometry Rectification to Deformed Characters Recognition Liqun Wang1, a * and Honghui Fan2

Sign Language Recognition using Dynamic Time Warping and Hand Shape Distance Based on Histogram of Oriented Gradient Features

Aggregating Descriptors with Local Gaussian Metrics

View Invariant Movement Recognition by using Adaptive Neural Fuzzy Inference System

A Performance Evaluation of HMM and DTW for Gesture Recognition

ACTION BASED FEATURES OF HUMAN ACTIVITY RECOGNITION SYSTEM

Detecting Printed and Handwritten Partial Copies of Line Drawings Embedded in Complex Backgrounds

Static Gesture Recognition with Restricted Boltzmann Machines

International Journal of Modern Engineering and Research Technology

Person identification from spatio-temporal 3D gait

Person Identification using Shadow Analysis

Human detection using local shape and nonredundant

A Background Modeling Approach Based on Visual Background Extractor Taotao Liu1, a, Lin Qi2, b and Guichi Liu2, c

Lecture 9: Hough Transform and Thresholding base Segmentation

Lecture 18: Human Motion Recognition

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

Matching Gait Image Sequences in the Frequency Domain for Tracking People at a Distance

HOUGH TRANSFORM CS 6350 C V

An Efficient Part-Based Approach to Action Recognition from RGB-D Video with BoW-Pyramid Representation

International Journal of Electrical, Electronics ISSN No. (Online): and Computer Engineering 3(2): 85-90(2014)

Non-rigid body Object Tracking using Fuzzy Neural System based on Multiple ROIs and Adaptive Motion Frame Method

QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task

[2008] IEEE. Reprinted, with permission, from [Yan Chen, Qiang Wu, Xiangjian He, Wenjing Jia,Tom Hintz, A Modified Mahalanobis Distance for Human

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

Human Hand Gesture Recognition Using Motion Orientation Histogram for Interaction of Handicapped Persons with Computer

A Real Time Human Detection System Based on Far Infrared Vision

Topological Mapping. Discrete Bayes Filter

Trademark recognition using the PDH shape descriptor

Object Tracking using HOG and SVM

Histogram of Oriented Gradients for Human Detection

A Texture-based Method for Detecting Moving Objects

Recognizing hand-drawn images using shape context

Speeding up the Detection of Line Drawings Using a Hash Table

Fast Denoising for Moving Object Detection by An Extended Structural Fitness Algorithm

Histograms of Oriented Gradients for Human Detection p. 1/1

Moving Object Detection and Tracking for Video Survelliance

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

Color Image Segmentation

Action recognition in videos

Transcription:

Human Action Recognition Using Silhouette Histogram Chaur-Heh Hsieh, *Ping S. Huang, and Ming-Da Tang Department of Computer and Communication Engineering Ming Chuan University Taoyuan 333, Taiwan, ROC *Department of Electronic Engineering Ming Chuan University Taoyuan 333, Taiwan, ROC {hsiehch, pshuang}@mail.mcu.edu.tw, s766076@ss24.mcu.edu.tw Abstract This paper presents a method for human action recognition using silhouette histogram. The human silhouette is obtained by using the background subtraction method and then mapped into three different polar coordinate systems that characterize three parts of a human figure respectively. Each polar coordinate system is quantized by partitioning it into several cells with different radii and angles. The evaluation applied at the Weizmann dataset shows that our method can achieve better performance than the existing silhouette-based system.. Keywords: silhouette, polar coordinate system, classifier. Introduction Human action recognition from videos is an important research area of computer vision. It can be applied to a variety of application domains, such as video surveillance, video retrieval and human-computer interaction systems. In the past, many human action recognition methods have been proposed and they are mainly divided into two kinds of approaches: shape-based and flow-based. The first method utilizes human silhouettes as features which are commonly extracted by background subtraction [-5]. The second method adopts motion information from human movement such as optical flow [6-8]. Classification techniques that are usually used for feature matching consist of earest-eighbor (-) [, 6], Support Vector Machines (SVM) [9] and Hidden Markov Models (HMM) [2, 9, 0]. The silhouette-based method aims to recognize actions by characterizing the shapes of the actor s silhouette through space-time. The method is very popular since it is computationally efficient and robust to variations in clothing and lighting []. evertheless, it often suffers from the artifacts such as shadows, which will reduce the recognition accuracy. The flow-based method computes the optical field between adjacent frames and uses that as features for action recognition. This is suitable for recognizing small objects [6]. However, it is computational expensive, and the optical field is a very Copyright 20, Australian Computer Society, Inc. This paper appeared at the 34th Australasian Computer Science Conference (ACSC 20), Perth, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 3. M. Reynolds, Ed. Reproduction for academic, not-for profit purposes permitted provided this text is included. coarse feature and thus different actions may exhibit similar flows over short periods of time []. Bobick and Davis [2] stack all silhouettes along the temporal axis from the sequence to form the Motion-History Image (MHI). MHI has a self-occlusion problem for complex movement and higher variation movement. Chen et al. [3] propose a method using star skeleton, which is obtained by connecting the centroid to gross extremities of a human contour. Chuang et al. [4] and Hsieh et al. [5] present a triangulation-base scheme for recognizing various human actions. Hsiao et al. [] propose the temporal-state shape context (TSSC) that can capture local space-time shape characteristics effectively. This can eliminate time warping effects in the local feature methods by segmenting the video sequence into several temporal states defined by soft fuzzy intervals. The distance between two action clips is defined as the summation of shape context distances in all frames. Besides estimating the distance measure between two actions, TSSC method can also provide spatial consistency information. Motivated by the TSSC approach, this paper presents a human action recognition method by using three silhouette histograms. The human silhouette extracted by background subtraction is mapped into three different polar coordinate systems that characterize three parts of a human figure respectively. Each polar coordinate system is quantized by partitioning it into several cells with different radii and angles. This paper is organized as follows. Section 2 introduces the proposed method. Experimental results and discussion are explained in Section 3. Finally conclusions are given in Section 4. 2 Proposed Method 2. Overview of Proposed System The proposed system includes four main processes as shown in Figure. First, the human silhouette is extracted from the input video by background subtraction method []. Then, the extracted silhouette is mapped into three polar coordinate systems that characterize three parts of a human figure respectively. The largest circle covers the motion of the human body and the other two circles are to include the effect arms and legs have on the human action/silhouette. That is why two of the centres are between the shoulders and between the hips, respectively. Each polar coordinate system is quantized by partitioning

it into several cells with different radii and angles. By counting the number of pixels fallen into each cell from the silhouette at a particular frame, the silhouette histogram of the frame can be obtained. By collecting a sequence of silhouette histograms, a video clip is thus generated and used to describe the human action. Based on the silhouette histogram descriptor, an action classifier is trained and then used to recognize the action of an input video clip. Figure : System Flow chart 2.2 Silhouette Extraction We assume video sequences are captured from a still camera. Thus, background subtraction is applied to obtain the human silhouette from each frame. The well-known Gaussian Mixture Model (GMM) [] is used to update the background. The method of background subtraction is given by, Ft( x, y) Bt( x, y) > T Ot ( x, y) () 0, otherwise where Bt ( xy, ) is the background image, Ft ( x, y ) is the input frame and T is the threshold value. Background refresh is achieved by B (, ) ( ) (, ) (, ) t+ xy α Bt xy+α Ft xy (2) where α 0. 2.3 Polar Transform To be able to effectively describe the human shape, the Cartesian coordinate system is transformed into the polar coordinate system through the following equations: 2 2 ri ( xi xc) + ( yi y c) (3) yi yc θ i tan ( ) (4) xi xc where ( xi, y i) is the coordinate of silhouette pixels in the Cartesian coordinate system. ( r i, θi) is the radius and the angle in the polar coordinate system. ( xc, y c) is the centre of the silhouette. The centre of the silhouette can be calculated by xc yc x i i y i i where is the total number of pixels. The existing approaches [, 4, 5] often use a single polar coordinate system to describe the human posture. However, our investigation indicates that the single coordinate is not enough to discriminate different postures with small difference. In this work, we design a method which contains three polar coordinate systems (three circles) defined as: (5) C: Circle that encloses the whole human body. C2: Circle that encloses the upper part of a body. C3: circle that encloses the lower part of a body. To verify the effectiveness of the proposed three-circle method, it is applied to describe two different actions in the Weizmann dataset: wave (one hand waving) and wave2 (two hands waving), as shown in Figure 2. Weizmann dataset is widely used by related research papers. The silhouette histograms obtained by C and C2 are shown in Figure 3 and Figure 4, respectively. We can see that two histograms from C are very similar such that the discriminability of action types is poor. On the contrary, superior discriminability is demonstrated by two histograms from C2. 2.4 Histogram Computation To generate the histogram of each polar coordinate system, we partition each polar coordinate space into K cells by uniformly dividing each radius into n parts, and angles into m orientations such that Km*n. Therefore, K-bins are obtained for each histogram by Hk ( ) #{ Pθ, θ, i r P j i r cell j k} (6) where the ranges of θ i and r i are defined by 2πi 2 π ( i+ ) θi, i 0,,..., m m m and j r max r j ( j + ) r max, j 0,,..., n. n n r is the radius of circle C, C2, or C3.) ( max Figure 2: the action of waving one hand (wave) and waving two hand2(wave2) Figure 3: Silhouette histograms obtained by C for two waving actions.

Figure 5 shows 3 frames of different actions and each is mapped onto three polar coordinate systems. Each histogram count is divided by the total number of pixels fallen into that coordinate space. The normalization is used to avoid the bias of different silhouettes. Finally, the normalized histograms are concatenated into a feature vector, which is used to describe the human posture at a particular time instant. 3 Experimental Results Figure 4: Silhouette histograms obtained by C2 for two waving actions. Our algorithm is evaluated with Weizmann dataset that includes nine persons performing ten actions. Figure 6 shows 8 sample frames from this dataset with different actions. The result is compared to that of the TSSC method []. The video format is 80 44 with the frame rate of 25 fps. We adopt the nearest neighbor classifier to identify different action types. The leave-one-out [3] evaluation method is used, which means one clip of a video is used for testing, and the remaining clips of all videos are included into the training set. This process is repeated until all video clips are tested. The video clip is a fixed-length video data, say ten frames. The video clips are generated by dividing a complete video into a designate number of clips. eighbouring clips can be overlapped with a fixed length, called the overlapped length. In our experiments, we set the clip length as 0 frames, and the overlapped length as 5 frames. Totally, 96 clips are generated from 90 videos. Therefore, the feature dimension for each video clip which contains 0 frames is 720 (24 3 0). To improve the computation efficiency of classification, the principal component analysis (PCA) is adopted to reduce the dimension of a feature vector. The recognition performance is evaluated by calculating confusion matrices shown in Table and Table 2, which are generated from the TSSC method and our proposed method respectively. The results are obtained with a clip length of 0 and an overlapped length of 5. Figure 5. Human silhouettes of different actions and each is mapped onto three polar coordinate systems Generally, the recognition accuracy is increased with the adding of the number of cells. However, it would also raise the time complexity. In our experiments, we chose m8 and n3, and a histogram with 24-bins is obtained for each polar coordinate system. The procedures for calculating the silhouette histogram can be organized into the following steps. First, we compute the centre of human silhouette and divide the silhouette into an upper part and a lower part according to the centre position. Then the centres of upper silhouette and lower silhouette are individually computed. Those three centre positions are taken as origins for the respective polar coordinate systems. Second, we compute the heights of all human silhouettes, which are used to calculate the radius of C. The radius of C2 or C3 is half of C radius. The third step is to apply Equation (6) to compute three histograms separately for each human silhouette. Figure 6. Weizmann dataset Table : Confusion matrix of TSSC Method (-)

Table 2: Confusion matrix of Our Method (-) It can be seen that, for TSSC method, 45 clips are identified incorrectly, which achieves a recognition rate of 95.3%. However, our method can accomplish a higher recognition accuracy of 98.33%. otice that the blend and skip actions have caused more errors for both methods. Also, we have investigated the effects of overlapped length and clip length and the results are shown in Table 3 and Table 4, respectively. Table 3 indicates that the more the frames are overlapped, the better the recognition accuracy for both methods can achieve. The same conclusion is obtained for the clip length. The results listed in two tables show that our proposed method can achieve better recognition accuracy under various conditions of clip length and overlapped length. Overlapped Frames TSSC Proposed method 7 97.96% 98.9% 5 95.3% 98.33% 3 92.90% 96.59% Table 3: Recognition rates under different number of overlapped frames Clip length TSSC Proposed method 0 95.3% 98.33% 5 97.35% 99.08% 20 98.59% 99.48% Table 4: Recognition rates under different clip lengths Table 5: Confusion matrix of TSSC Method (SVM) Table 6: Confusion matrix of Our Method (SVM) We argue that the leave-one-out test on clips is not very objective, because the parts of a testing video clip are included in the training set. Here, we design a test method called leave-n-out by taking one video for test and the remaining videos are used for training. Then n clips are extracted from the test video. Each video clip is evaluated separately, and the majority voting scheme is then applied to get the final decision. Table 5 is the TSSC result by using the leave-n-out method and Table 6 shows the result of our method. The results indicate that for TSSC the recognition accuracy is degraded significantly, while our method can still keep superior performance. 4 Conclusions In this paper, we have developed an action recognition method which employs three polar coordinate systems to characterize different parts of the human posture. The human silhouette is mapped into a feature vector which contains three histograms. Experimental results show that our method can obtain superior performance than the TSSC method. Using the Weizmann dataset, the proposed method achieves a recognition rate of 98.33%. However, like the other silhouette-based approaches, the proposed method requires the extraction of a complete silhouette. Although the robustness to incomplete silhouettes has been raised by our method, the problem has not been completely solved. Furthermore, since the experiments are currently done within a well-controlled environment, more realistic outdoor scenes need to be tested. They will all be investigated in the future. Acknowledgements We would like to thank the anonymous reviewers for their valuable comments and suggestions. This research is supported in part by the ational Science Council, Taiwan under the grant SC 99-2632-E-30-00-MY3. 5 References [] Hsiao, P.C., Chen, C.S. and Chang L.W. (2008): Human action recognition using temporal-state shape context. Proc. International Conference on Pattern Recognition, -4. [2] Bobick, F. and Davis, J.W. (200): The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23: 257-267. [3] Chen, H.S., Chen, H.T., Chen, Y.W. and Lee, S.Y. (2006): Human action recognition using star skeleton. Proc. the 4th ACM international workshop on Video surveillance and sensor networks, 7-78.

[4] Chuang, C.C., Hsieh, J.W., Tsai, L.W. and Fan, K.C. (2008), Human action recognition using star templates and delaunay triangulation. Proc. International Conference on Intelligent Information Hiding and Multimedia Signal Processing, 79-82. [5] Hsieh, J.W., Hsu, Y.T., Liao, H.Y. Mark, and Chen, C.C. (2008): Video-based human movement analysis and its application to surveillance systems, IEEE Transactions on Multimedia, 0: 372-384. [6] Efros, A.A., Berg, A.C., Mori, G. and Malik J. (2003): Recognizing action at a distance. Proc. inth IEEE International Conference on Computer Vision, 726. [7] Liu, C., Yang, Y. and Chen Y. (2009): Human action recognition using sparse representation. Proc. IEEE International Conference on Intelligent Computing and Intelligent Systems, 84-88. [8] Liu, J., Ail, S. and Shah, M. (2008): Recognizing human action using multiple feature. Proc. IEEE Conference on Computer Vision and Pattern Recognition, -8. [9] Chang, C.C. and Lin, C.J. (200): Software available at a library for support vector machines, http://www.csie.ntu.edu.tw/~cjlin/libsv. [0] Petkovic, M., Jonker, W. and Zivkovic, Z. (200): Recognizing strokes in tennis videos using Hidden Markov Models. Proc. International Conference on Visualization, Imaging and Image Processing, Marbella, Spain. [] Ke, Y., Sukthankar, R. and Hebert, M. (2007): Spatio-temporal shape and flow correlation for action recognition. Proc. Visual Surveillance Workshop. [2] Benezeth, Y., Jodoin, P., Emile, B., Laurent, H. and Rosenberger, C. (2008): Review and evaluation of commonly-implemented background subtraction algorithms. Proc. International Conference on Pattern Recognition, -4. [3] Sun, X., Chen, M. and Hauptmann, A. (2009): Action recognition via local descriptors and holistic features. Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 58-65.