QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task

Similar documents
Evaluation of Local Space-time Descriptors based on Cuboid Detector in Human Action Recognition

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos

Evaluation of local descriptors for action recognition in videos

An evaluation of local action descriptors for human action classification in the presence of occlusion

IMPROVING SPATIO-TEMPORAL FEATURE EXTRACTION TECHNIQUES AND THEIR APPLICATIONS IN ACTION CLASSIFICATION. Maral Mesmakhosroshahi, Joohee Kim

G3D: A Gaming Action Dataset and Real Time Action Recognition Evaluation Framework

Lecture 18: Human Motion Recognition

Human Focused Action Localization in Video

CS229: Action Recognition in Tennis

An Efficient Part-Based Approach to Action Recognition from RGB-D Video with BoW-Pyramid Representation

Observing people with multiple cameras

Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features

Fast Realistic Multi-Action Recognition using Mined Dense Spatio-temporal Features

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Action Recognition in Video by Sparse Representation on Covariance Manifolds of Silhouette Tunnels

MoSIFT: Recognizing Human Actions in Surveillance Videos

Human Activity Recognition Using a Dynamic Texture Based Method

Features Preserving Video Event Detection using Relative Motion Histogram of Bag of Visual Words

EVENT DETECTION AND HUMAN BEHAVIOR RECOGNITION. Ing. Lorenzo Seidenari

WP1: Video Data Analysis

A Motion Descriptor Based on Statistics of Optical Flow Orientations for Action Classification in Video-Surveillance

Automatic visual recognition for metro surveillance

Learning Realistic Human Actions from Movies

Content-based image and video analysis. Event Recognition

Human Motion Detection and Tracking for Video Surveillance

Beyond Actions: Discriminative Models for Contextual Group Activities

Person Action Recognition/Detection

People detection in complex scene using a cascade of Boosted classifiers based on Haar-like-features

Space-Time Shapelets for Action Recognition

Adaptive Action Detection

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

A Survey of Computer Vision based methods for Violent action Representation and Recognition

ETISEO, performance evaluation for video surveillance systems

Categorizing Turn-Taking Interactions

Learning realistic human actions from movies

Temporal Feature Weighting for Prototype-Based Action Recognition

Action Recognition with HOG-OF Features

EigenJoints-based Action Recognition Using Naïve-Bayes-Nearest-Neighbor

A Two-stage Scheme for Dynamic Hand Gesture Recognition

Exploiting Spatio-Temporal Scene Structure for Wide-Area Activity Analysis in Unconstrained Environments

Large-scale Video Classification with Convolutional Neural Networks

Multimodal Ranking for Non-Compliance Detection in Retail Surveillance

Seminar Heidelberg University

Action recognition in videos

Human Action Recognition in Videos Using Hybrid Motion Features

HOG-based Pedestriant Detector Training

Person Re-identification for Improved Multi-person Multi-camera Tracking by Continuous Entity Association

Velocity adaptation of space-time interest points

Visual Action Recognition

AN ENHANCED ATTRIBUTE RERANKING DESIGN FOR WEB IMAGE SEARCH

CAP 6412 Advanced Computer Vision

Mobile Visual Search with Word-HOG Descriptors

Exploiting Multi-level Parallelism for Low-latency Activity Recognition in Streaming Video

View Invariant Movement Recognition by using Adaptive Neural Fuzzy Inference System

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Middle-Level Representation for Human Activities Recognition: the Role of Spatio-temporal Relationships

Action Recognition by Learning Mid-level Motion Features

Spatial-Temporal correlatons for unsupervised action classification

Real-time multi-view human action recognition using a wireless camera network

Spatio-temporal Shape and Flow Correlation for Action Recognition

Action Recognition & Categories via Spatial-Temporal Features

Segmenting Objects in Weakly Labeled Videos

Multimedia Information Retrieval

A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video

Robust Classification of Human Actions from 3D Data

Object Detection in a Fixed Frame

Video Alignment. Literature Survey. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

Human Daily Action Analysis with Multi-View and Color-Depth Data

LIG and LIRIS at TRECVID 2008: High Level Feature Extraction and Collaborative Annotation

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Functionalities & Applications. D. M. Gavrila (UvA) and E. Jansen (TNO)

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

Bag of Optical Flow Volumes for Image Sequence Recognition 1

Leveraging Textural Features for Recognizing Actions in Low Quality Videos

2010 ICPR Contest on Semantic Description of Human Activities (SDHA) M. S. Ryoo* J. K. Aggarwal Amit Roy-Chowdhury

TEVI: Text Extraction for Video Indexing

Real-Time Action Recognition System from a First-Person View Video Stream

Consumer Video Understanding

UNDERSTANDING human actions in videos has been

Object Recognition 1

Object Recognition 1

Person identification from spatio-temporal 3D gait

Detection and recognition of moving objects using statistical motion detection and Fourier descriptors

Human activity recognition in the semantic simplex of elementary actions

Motion Interchange Patterns for Action Recognition in Unconstrained Videos

Dynamic Human Shape Description and Characterization

C. Premsai 1, Prof. A. Kavya 2 School of Computer Science, School of Computer Science Engineering, Engineering VIT Chennai, VIT Chennai

Minimizing hallucination in Histogram of Oriented Gradients

Action Recognition Using Hybrid Feature Descriptor and VLAD Video Encoding

Human Action Recognition Using Silhouette Histogram

P-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh

Multiple-Person Tracking by Detection

ActivityRepresentationUsing3DShapeModels

Learning discriminative space-time actions from weakly labelled videos

Incremental Action Recognition Using Feature-Tree

Human Action Classification

Optimizing Trajectories Clustering for Activity Recognition

Trademark Matching and Retrieval in Sport Video Databases

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Transcription:

QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task Fahad Daniyal and Andrea Cavallaro Queen Mary University of London Mile End Road, London E1 4NS (United Kingdom) {fahad.daniyal,andrea.cavallaro}@eecs.qmul.ac.uk http://www.eecs.qmul.ac.uk/ andrea/ Abstract: We discuss an event detection approach based on local feature modeling, which is based on spatio-temporal cuboids and perspective normalization (QMUL-ACTIVA 3 / p-baseline 1). Motion information is compared against examples of events learned from a training dataset to define a similarity measure. This similarity measure is then analyzed in both space and time to identify frames containing instances of the event of interest (a person running in an airport building). Features are analyzed locally to enable the differentiation of simultaneously occurring events in different portions of an image frame. The performance is quantified on the TRECVID 2010 surveillance event detection dataset. Description Event recognition in video has gained much attention in automated video surveillance, content understanding and ranking [1, 2, 3]. Events to be identified are characterized by individual actions of a target or by its interactions with the environment or other targets. A considerable amount of work has been dedicated to event detection in simple datasets (e.g. KTH [4] and Weizmann [5]), whereas more recent works focus on realworld scenarios [1, 6, 7]. These scenarios are often characterized by different level of complexity in the scene due crowdedness, clutter or unfavorable camera placement. Under such constraints several state-of-the-art methods underperform as demonstrated in various trials of [7]. Most event recognition approaches can be divided into two main steps, namely feature extraction and classification. Feature extraction involves the conversion of video This work was supported in part by the EU under the FP7 project APIDIS (ICT-216023)

Table 1: Results of the Person Runs event detection for the dry-run and the evaluation datasets. (Key. Ref : number of ground-truth (GT) events; Sys: number of events generated by the proposed method; TP: number of correct detections; FP: number of false positives; FN: number of missed detections, DCR: Detection Cost Rate). Ref Sys TP FP FN Actual DCR Minimum DCR Dry run 63 476 34 215 29 0.531 0.307 Dry run v1 63 327 49 134 14 0.290 0.273 Final submission 107 360 36 223 71 0.737 0.628 data into set of representative features that describe the event we are interested in, whilst the classification steps compares these features are against models generated using known labeled samples. The proposed approach for event recognition incorporates analysis of event candidates generated by analyzing motion vectors. For feature extraction we employ a 2D macro-block grid on pairs of frames that are then merged into groups using contextual information. Then we perform spatio-temporal analysis to make a final decision. To quantify the performance of the proposed method we test it on the TRECVID surveillance event detection dataset [7], for the Person Runs event. This dataset contains dense scenes, clutter, considerable variations in viewpoint and high variability among different instances of the same action. We will discuss the results on the (dry run) and on the (final submission) datasets. The (dry run) dataset is composed of 54890 seconds of video @25 fps. For this we used around 2300 instances of people running in different locations of the scene. The results of this evaluation, as reported by the National Institute of Standards and Technology (NIST), are given in Table. 1 (top row). The proposed system had a relatively high number of detections (Sys. = 476), where 227 detections had a low confidence and were not included in the computation of the Acual DCR. However, when analyzing the results, we noticed that some sample events taken from the ground truth of the training set were not reliable due to the fact that the corresponding motion vectors were not sufficiently clear for triggering a detection. Hence as a second stage we carried out a manual verification for selecting sample events to be included in the training. Moreover the formulation of the proposed method was changed such that events that occurred simultaneously but were spatially separated could be detected as separate events. The results of these improvements (Dry-run-v1) are shown in Table. 1 (middle row). We see that by improving the quality of the training data and allowing for the simultaneous occurrence of events of interest there was a significant increase of detected events (by 15) and a decrease of the number of false positives (by 81). The scores for (Dry run v1) were generated in-house using the TRECVID Framework for Detection Evalua-

Figure 1: Examples of true positive detections for the Person Runs event. Row 1: camera 1; rows 2 3: camera 2; Rows 4 5: camera 3; rows 6 7: camera 5. No Person Runs event is present in camera 4.

(a) (b) Figure 2: Examples of missed events due to an occlusion (a) and to motion direction away from the camera (b). Figure 3: Examples of simultaneous events that are close to each other and therefore detected as a single event. tions (F4DE). 1 In the final stage of testing (final submission) we submitted the results of the proposed method on approximately 44 hours of test data. The number of training samples was increased to 4753. The results of this final submission, as reported by NIST, are shown in Table. 1 (bottom row). Examples of detected running people (true positives) are shown in Fig. 1. Using the proposed method we are able to detect some difficult events, for example in the first row (middle image) and in row 3 (last image), we are able to detect a running kid, who is partially occluded. In the second and in the third row, we are able to detect the events that happen at without changing scale. However, when a person is running away from or towards the camera (Fig. 2 (a)) then the magnitude of the motion vectors is not large enough and we are not able to discriminate a running person from a walking person. Similarly, people that are in the far field and have similar in appearance to the background do not generate any significant motion vectors and hence are not detected (Fig. 2 (b)). Finally, although the proposed method can separate events happening simultaneously, when people are running very close to each other we are at the moment unable to separate the two events (Fig. 3). This limitation can be overcome by performing object detection in frames of interest. 1 http://www.itl.nist.gov/iad/mig/tools/

Summary We presented a technique for automatically detecting people running in real-world scenarios. The proposed approach is based on motion features and on spatio-temporal modeling. We demonstrated the performance of the proposed approach in the TRECVID 2010 airport dataset with very encouraging results. As future work, we will apply the proposed method to other types of events, such as people jumping, standing up, sitting down or moving in opposite direction to the normal flow. References [1] W. Chiu and D. Tsai, A macro-observation scheme for abnormal event detection in daily-life video sequences, EURASIP Jrnl. on Adv. in Signal Processing, 2010. [2] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realistic human actions from movies, in Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, Jun. 2008, pp. 1 8. [3] F. Daniyal, M. Taj, and A. Cavallaro, Content-aware ranking of video segments, in Proc. of ACM/IEEE Int. Conf. Distributed Smart Cameras, Sep 2008, pp. 1 9. [4] C. Schuldt, I. Laptev, and B. Caputo, Recognizing human actions: A local SVM approach, in Proc. of Int. Conf. on Pattern Recognition, 2004, pp. 32 36. [5] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions as spacetime shapes, in Proc. of IEEE Int. Conf. on Computer Vision, Oct. 2005, pp. 1395 1402. [6] M. Chen, L. Mummert, P. Pillai, A. Hauptmann, and R. Sukthankar, Exploiting multi-level parallelism for low-latency activity recognition in streaming video, in Proc. of ACM Conf. Multimedia Systems, 2010, pp. 1 12. [7] A. Smeaton, P. Over, and W. Kraaij, Evaluation campaigns and Trecvid, in Proc. of ACM Int. Workshop on Multimedia Information Retrieval, Oct. 2006, pp. 321 330.