of human activities. Our research is motivated by considerations of a ground-based mobile surveillance system that monitors an extended area for

Similar documents
A Real Time System for Detecting and Tracking People. Ismail Haritaoglu, David Harwood and Larry S. Davis. University of Maryland

3. International Conference on Face and Gesture Recognition, April 14-16, 1998, Nara, Japan 1. A Real Time System for Detecting and Tracking People

Backpack: Detection of People Carrying Objects Using Silhouettes

Parameterized Modeling and Recognition of Activities

Real-time Detection of Illegally Parked Vehicles Using 1-D Transformation

Video Alignment. Final Report. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation. Range Imaging Through Triangulation

Human Motion Detection and Tracking for Video Surveillance

Motion Tracking and Event Understanding in Video Sequences

Flow Estimation. Min Bai. February 8, University of Toronto. Min Bai (UofT) Flow Estimation February 8, / 47

Recognition Rate. 90 S 90 W 90 R Segment Length T

Learned Models for Estimation of Rigid and Articulated Human. Motion from Stationary or Moving Camera MD 20742, USA. Editors:??

Cardboard People: A Parameterized Model of Articulated Image Motion

W 4 : Real-Time Surveillance of People and Their Activities

A Fast Moving Object Detection Technique In Video Surveillance System

Dynamic Human Shape Description and Characterization

Detecting and Identifying Moving Objects in Real-Time

Proceedings of the 6th Int. Conf. on Computer Analysis of Images and Patterns. Direct Obstacle Detection and Motion. from Spatio-Temporal Derivatives

To appear in ECCV-94, Stockholm, Sweden, May 2-6, 1994.

Human Activity Recognition Using Multidimensional Indexing

Center for Automation Research, University of Maryland. The independence measure is the residual normal

Finally: Motion and tracking. Motion 4/20/2011. CS 376 Lecture 24 Motion 1. Video. Uses of motion. Motion parallax. Motion field

Video Alignment. Literature Survey. Spring 2005 Prof. Brian Evans Multidimensional Digital Signal Processing Project The University of Texas at Austin

Fast Electronic Digital Image Stabilization. Carlos Morimoto Rama Chellappa. Computer Vision Laboratory, Center for Automation Research

CHAPTER 5 MOTION DETECTION AND ANALYSIS

Real-time target tracking using a Pan and Tilt platform

Tracking Multiple Objects in 3D. Coimbra 3030 Coimbra target projection in the same image position (usually

Real Time Motion Detection Using Background Subtraction Method and Frame Difference

Human Action Recognition Using Independent Component Analysis

EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation

Vehicle Occupant Posture Analysis Using Voxel Data

Motion Estimation. There are three main types (or applications) of motion estimation:

Human Upper Body Pose Estimation in Static Images

Human Posture Recognition in Video Sequence

PRELIMINARY RESULTS ON REAL-TIME 3D FEATURE-BASED TRACKER 1. We present some preliminary results on a system for tracking 3D motion using

COMPARATIVE STUDY OF DIFFERENT APPROACHES FOR EFFICIENT RECTIFICATION UNDER GENERAL MOTION

Face Detection and Recognition in an Image Sequence using Eigenedginess

Object Detection in Video Streams

p w2 μ 2 Δp w p 2 epipole (FOE) p 1 μ 1 A p w1

Learning Parameterized Models of Image Motion

Robustly Estimating Changes in Image Appearance. Michael J. Black. Xerox Palo Alto Research Center, 3333 Coyote Hill Rd., Palo Alto, CA 94304

Human Detection. A state-of-the-art survey. Mohammad Dorgham. University of Hamburg

EE 264: Image Processing and Reconstruction. Image Motion Estimation I. EE 264: Image Processing and Reconstruction. Outline

Motion Analysis. Motion analysis. Now we will talk about. Differential Motion Analysis. Motion analysis. Difference Pictures

ActivityRepresentationUsing3DShapeModels

Learning Parameterized Models of Image Motion

Automatic Tracking of Moving Objects in Video for Surveillance Applications

ELEC Dr Reji Mathew Electrical Engineering UNSW

Chapter 9 Object Tracking an Overview

PART-LEVEL OBJECT RECOGNITION

Introduction to Computer Vision

Moving Object Detection for Video Surveillance

An Object Detection System using Image Reconstruction with PCA

CSE 252B: Computer Vision II

Idle Object Detection in Video for Banking ATM Applications

Human pose estimation using Active Shape Models

Fast Denoising for Moving Object Detection by An Extended Structural Fitness Algorithm

Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences

Non-rigid body Object Tracking using Fuzzy Neural System based on Multiple ROIs and Adaptive Motion Frame Method

A Two-stage Scheme for Dynamic Hand Gesture Recognition

Aircraft Tracking Based on KLT Feature Tracker and Image Modeling

Segmentation and Tracking of Partial Planar Templates

Expanding gait identification methods from straight to curved trajectories

Camera Calibration for a Robust Omni-directional Photogrammetry System

IN computer vision develop mathematical techniques in

AUTOMATIC OBJECT DETECTION IN VIDEO SEQUENCES WITH CAMERA IN MOTION. Ninad Thakoor, Jean Gao and Huamei Chen

Markerless human motion capture through visual hull and articulated ICP

Automatic visual recognition for metro surveillance

Depth. Common Classification Tasks. Example: AlexNet. Another Example: Inception. Another Example: Inception. Depth

Using Temporal Templates. James W. Davis and Aaron F. Bobick. MIT Media Laboratory, 20 Ames St., Cambridge, MA (jdavis j

Vehicle Dimensions Estimation Scheme Using AAM on Stereoscopic Video

EE795: Computer Vision and Intelligent Systems

A MOTION MODEL BASED VIDEO STABILISATION ALGORITHM

Feature Tracking and Optical Flow

MOTION. Feature Matching/Tracking. Control Signal Generation REFERENCE IMAGE

A Robust Two Feature Points Based Depth Estimation Method 1)

Zhongquan Wu* Hanfang Sun** Larry S. Davis. computer Vision Laboratory Computer Science Cente-.r University of Maryland College Park, MD 20742

Shadows in the graphics pipeline

Object Detection in a Fixed Frame

Design Specication. Group 3

Extracting Spatio-temporal Local Features Considering Consecutiveness of Motions

Articulated Structure from Motion through Ellipsoid Fitting

Image Features: Local Descriptors. Sanja Fidler CSC420: Intro to Image Understanding 1/ 58

A Street Scene Surveillance System for Moving Object Detection, Tracking and Classification

Research on Evaluation Method of Video Stabilization

Stereo and Epipolar geometry

Marcel Worring Intelligent Sensory Information Systems

Matching Gait Image Sequences in the Frequency Domain for Tracking People at a Distance

Online Motion Classification using Support Vector Machines

Local qualitative shape from stereo. without detailed correspondence. Extended Abstract. Shimon Edelman. Internet:

Local Image Registration: An Adaptive Filtering Framework

output vector controller y s... y 2 y 1

Occlusion Detection of Real Objects using Contour Based Stereo Matching

Postprint.

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

Tracking of Human Body using Multiple Predictors

Dense Image-based Motion Estimation Algorithms & Optical Flow

NEW CONCEPT FOR JOINT DISPARITY ESTIMATION AND SEGMENTATION FOR REAL-TIME VIDEO PROCESSING

LIDAR-BASED GAIT ANALYSIS IN PEOPLE TRACKING AND 4D VISUALIZATION

Adaptive Background Mixture Models for Real-Time Tracking

Transcription:

To Appear in ACCV-98, Mumbai-India, Material Subject to ACCV Copy-Rights Visual Surveillance of Human Activity Larry Davis 1 Sandor Fejes 1 David Harwood 1 Yaser Yacoob 1 Ismail Hariatoglu 1 Michael J. Black 2 1 Computer Vision Laboratory, University of Maryland, College Park, MD 20742 2 Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304 lsd, fejes, harwood, yaser, hismail@umiacs.umd.edu, black@parc.xerox.com 1 Introduction In this paper we provide an overview of recent research conducted at the University of Maryland's Computer Vision Laboratory on problems related to surveillance of human activities. Our research is motivated by considerations of a ground-based mobile surveillance system that monitors an extended area for human activity. During motion, the surveillance system must detect other moving objects and identify them as humans, animals, vehicles. When one or more persons are detected, their movements need to be analyzed to recognize the activities that they are involved in. Ideally, the surveillance system would be able to accomplish this even while continuing to move; alternatively, the system could stop and stare at that part of the scene containing people. In Section 1 we describe a novel approach to the problem of detecting independently moving objects from a moving ground camera, and illustrate the approach on sequences taken in very cluttered environments. Current research focuses on the problem of classifying those independently moving objects as people based on a combination of their appearance and movement. In Section 2 we describe a system that can track multiple moving people using sequences taken from a stationary camera. This system of algorithms, which has been implemented on a PC and can process 10-30 frames per second (depending on the number of people within the eld of view and the resolution of the imagery) uses a hierarchy of tracking modules to identify and follow people's heads, torsos, feet,... Finally, in Section 4 we explain how the recovered motion of these people can be classied into various activity classes using a principal component model of the time variation of motion of the body parts.??? The support of the Defense Advanced Research Projects Agency (ARPA Order No. C635), the Oce of Naval Research (grant N000149510521) is gratefully acknowledged.

2 Detecting independent motion using directional motion estimation This section briey describes an application of the theory developed in [2] to the problem of detecting independent motion in long image sequences. The approach, which is based on two simple geometric observations about directional components of ow elds, allows general camera motion, a large camera Field Of View (FOV), and scenes with large depth variation; no point correspondences are required. Due to the projection method the original problem of detecting independent motion is reduced to a combination of robust line tting and onedimensional search. More details about the method can be found in [1, 3]. 2.1 Properties of directional components of visual displacement elds Given an optical ow eld, we construct a scalar eld by projecting the optical ow vectors onto a given projection direction. Our approach to motion estimation is then based on analyzing cross sections of this projected ow eld; in particular, cross-sections both parallel and orthogonal to the chosen projection direction. This analysis leads to recovering the projections of the camera motion, which we call the directional motion parameters. In the simple case of a narrow-fov camera the rotational projected ow is constant along the parallel cross-sections, and varies linearly along the orthogonal cross-sections; we call this the linearity property. Since the projected translational ow is zero at the projection of the FOE on any parallel cross-section, the second observation leads to what we call the divergence property: points to the \left" of the projected FOE in a parallel cross-section have projected ow less than the ow at the projected FOE, and points to the \right" have greater projected ow. Since we do not know, a priori, what the projected rotational ow is, we estimate at each point along every parallel cross-section that ow value which best satises the divergence property. This is, essentially, equivalent to estimating a ow value that minimizes negative depth [6] along that parallel cross-section. Orthogonal cross-sections of these new projected ow values are then constructed. Finally, using the linearity property, the projected rotation parameters are estimated by nding that orthogonal crosssection on which the projected (new) ow values are best t by a linear model. Extensions of the algorithm to large FOVs (accomplished by embedding it into a recursive derotation framework) or very small FOVs (in which the divergence property of the projected ow cannot be used) can be easily achieved. In any event, once the projected motion parameters are estimated we know by the epipolar constraint that along any parallel cross-section, all points to the left of the projected FOE should have projected motion less than the nal estimate of rotation at the projected FOE obtained from the linear model, and all points to the right should have greater projected ow.

2.2 The detection algorithm The ability to verify the epipolar constraint for arbitrary ow elds using only low-dimensional projections of the original ow eld provides a simple basis for detecting independently moving objects. For this purpose we need to incorporate only one or a small collection of directional components of the ow eld. The image locations where the linearity and the divergence constraints of projections are violated are considered as regions with independent motion. Fig. 1. Detection of moving people (top and bottom) and a moving vehicle (middle) from a hand-carried camera. Each row illustrates two (non-consecutive) frames of long image sequences taken from several seconds apart. In practice, one must take into account the fact that the linear tting process used to estimate the projected rotation parameters must be robust to both measurement error in ow estimation and errors introduced by the presence of independently moving points; and, in the detection of independently moving points, one must take into account measurement error in ow estimation. The rst problem is addressed using robust line tting, in which the parallel cross-sections corrupted by independent motion are eliminated from the t by a repeated-median-based robust line-estimator [7]. The second problem is addressed by rst assuming that the parallel cross sections that are included in the robust line t ("inliers") in fact do not include any independently moving points. This allows us to identify detection thresholds that adapt to changes in imaging conditions. Intuitively, for each parallel inlier cross-section we nd the projected ow vector which most violates the assumption that no pixel along an inlier cross-section is moving independently. We then consider the worst violator amongst all the inlier cross-sections, and use the magnitude of the dierence in ow between that pixel and the ow at the projected FOE on that cross-section as our adaptive threshold. This threshold is then applied to the remaining "outlier" cross sections. This simple automatic adaptive thresholding procedure provides a good trade-o between sensitive detection and low false alarm rate, and is a signicant improvement of the detection algorithm over those which apply xed thresholds.

In order to further improve the reliability and robustness of the algorithm, frame-by-frame-based instantaneous detections need to be integrated over both space and time. We employ temporal integration over motion trajectories using tracking to verify detections and eliminate short-term drop-outs. Finally, a spatial integration provides grouping of independently moving pixels that pass the temporal analysis based on coherence in location and velocity. Figure 1 shows three examples of detecting independent motion from a handcarried camera. The camera FOV is relative large (55 o ) while the scenes contain dierent degree of depth variation. In all examples the primary input for our algorithm was the simple normal ow as a particular directional component of the ow eld. In each frame dark pixels indicate local detections veried by the temporal lter. The high-lighted bounding boxes represent groupings of these detections using spatial and velocity coherence. Current research focuses on characterizing the appearance and motion of independently moving objects to classify them as people, vehicles, etc. 3 The W 4 System W 4 is a real time system for tracking people and their body parts in monochromatic imagery. It constructs dynamic models of people's movements to answer questions about what they are doing, and where and when they act. It constructs appearance models of the people it tracks so that it can track people (who?) through occlusion events in the imagery. In this section we describe the computational models employed by W 4 to detect and track people and their parts. These models are designed to overcome the inevitable errors and ambiguities that arise in dynamic image analysis. These problems include instability in segmentation processes over time, splitting of objects due to coincidental alignment of objects parts with similarly colored background regions, etc. W 4 has been designed to work with only monochromatic video sources, either visible or infrared. While most previous work on detection and tracking of people has relied heavily on color cues, W 4 is designed for outdoor surveillance tasks, and particularly for night-time or other low light level situations. In such cases, color will not be available, and people need to be detected and tracked based on weaker appearance and motion cues. W 4 is a real time system. It currently is implemented on a dual processor Pentium PC and can process between 10-30 frames per second depending on the image resolution (typically lower for IR sensors than video sensors) and the number of people in its eld of view. In the long run, W 4 will be extended with models to recognize the actions of the people it tracks. Specically, we are interested in interactions between people and objects - e.g., people exchanging objects, leaving objects in the scene, taking objects from the scene. The descriptions of people - their global motions and the motions of their \parts" - developed by W 4 are designed to support such activity recognition. W 4 currently operates on video taken from a stationary camera, and many of its image analysis algorithms would not generalize easily to images taken

from a moving camera. Other ongoing research in our laboratory attempts to develop both appearance and motion cues from a moving sensor that might alert a system to the presence of people in its eld of regard [4]. W 4 consists of ve computational components: background modeling, foreground object detection, motion estimation of foreground objects, object tracking and labeling, and locating and tracking human body parts. The background scene is statically modeled by the minimum and maximum intensity values and temporal derivative for each pixel recorded over some period, and is updated periodically. For each frame in the video sequence, foreground objects are detected by frame dierence thresholding, connected component analysis, and morphological analysis. These foreground objects are tracked and labeled by a forward matching process from previously detected objects to currently detected objects. Motion models, which are based on matching silhouette edges of foreground objects in two successive frames and a recursive least square method, are used during object tracking to estimate the expected location of objects in future frames. A cardboard human model of a person in a standard upright pose is used to model the human body and to locate human body parts (head, torso, hands, legs and feet). Those parts are tracked using dynamic template matching methods. Figure 2 illustrates some results of the W 4 system. Fig. 2. Examples of using the cardboard model to locate body parts in dierent situations: four people meet and talk (rst line), a person sits on a bench (second line), two people meet (third line).

4 Activity Modeling and Recognition Activity representation and recognition are central to the interpretation of human movement. There are several issues that aect the development of models of activities and matching of observations to these models, { Repeated performances of the same activity by the same human vary even when all other factors are kept unchanged. { Similar activities are performed by dierent individuals in slightly dierent ways. { Delineation of onset and ending of an activity can sometimes be challenging. { Similar activities can be of dierent temporal durations. { Dierent activities may have signicantly dierent temporal durations. There are also imaging issues that aect the modeling and recognition of activities { Occlusions and self occlusions of body parts during activity performance. { The projection of movement trajectories of body parts depend on the observation viewpoint. { The distance between the camera and the human aect image-based measurements due to the projection of the activity. 1040 1100 1190 1270 Fig. 3. Image sequence of \walking" and the ve-part tracking An observed activity can be viewed as a vector of measurements over the temporal axis. Consider as an example Figure 3, which shows both selected frames from an image sequence of a person walking in front of a camera and the model-based tracking of ve body parts (i.e., arm, torso, thigh, calf and foot). We developed (see [8] for details) a method for modeling and recognition of these temporal measurements while accounting for some of the above variations in activity execution. This method is based on the hypothesis that a reduced

dimensionality model of activities such as \walking" can be constructed using principal component analysis (PCA, or an eigenspace representation) of example signals (\exemplars"). Recognition of such activities is then posed as matching between principal component representation of the observed activity (\observation") to these learned models that are subjected to \activity-preserving" transformations (e.g., change of execution duration, small change in viewpoint, change of performer, etc.). 4.1 Experimental Results We employ a recently proposed approach for tracking human motion using parameterized optical ow [5]. This approach assumes that an initial segmentation of the body into parts is given and tracks the motion of each part using a chainlike model that exploits the attachments between parts to achieve tracking of body parts in the presence of non-rigid deformations of clothing that cover the parts. A set of 44 sequences of people walking in dierent directions were used for testing. The model of multi-view walking was constructed from the walking pattern of one individual while the testing involved eight subjects. The rst six activity bases were used. The confusion matrix for the recognition of 44 instances of walking-directions are shown in Table 1. Each column shows the best matches for each sequence. The walkers had dierent paces and stylistic variations, some of which where recovered well by the ane transformation. Also, time shifts were common since only coarse temporal registration was employed prior to recognition. Walking Direction Parallel Diag. Away Forward Parallel 11 2 Diagonal 3 14 1 Perp. Away 6 Perp. Forw. 1 1 1 4 Total 15 17 7 5 Table 1. Confusion matrix for recognition of walking direction Next, we illustrate the modeling and recognition of a set of activities that we consider challenging for recognition. We chose four activities that are overall quite close in performance: walking, marching, line-walking 2, and kicking while walking. Each cycle of these four activities lasts approximately 1.5 seconds. We acquired tens of sequences of subjects performing these four activities as observed from a single view-point. Temporal and stylistic variabilities in the performance of these activities are common. Clothing and lighting variations 2 A form of walking in which the two feet step on a straight line and spatially touch when both are on the ground.

also aected the accuracy of the recovery of motion measurements from these image sequences. Table 2 shows the confusion matrix for recognition of a set of 66 test activities. These activities were performed by some of the same people who were used for model construction as well as new performers. Variations in performance were accounted for by the ane transformation. Up to 30% speed-up or slow-down as well as up to 15 frames of temporal shift were accounted for by the ane transformation used in the matching. Activity Walk Line-Walk Walk. to Kick March Walk 11 3 3 Line-Walk 3 24 1 Walk to Kick 12 March 1 1 7 Total 15 28 12 11 Table 2. Confusion matrix for recognition results References 1. S. Fejes and L.S. Davis. Detection of independent motion using directional motion estimation. Technical Report CS-TR-3815, CAR-TR-866, University of Maryland, 1997. 2. S. Fejes and L.S. Davis. Direction-selective lters for egomotion estimation. Technical Report CS-TR-3814,CAR-TR-865, University of Maryland at College Park, 1997. 3. S. Fejes and L.S. Davis. What can projections of ow elds tell us about the visual motion. To appear in Proceedings of the International Conference on Computer Vision, Bombay, India, January 1998. 4. S. Fejes, L.S. Davis \Exploring Visual Motion Using Projections of Flow Fields" Proc. of the DARPA Image Understanding Workshop, New Orleans, LA,1997 5. S. X. Ju, M. Black, and Y. Yacoob. Cardboard people: A parameterized model of articulated image motion. Proc. Int. Conference on Face and Gesture, Vermont, 1996, 561-567. 6. S. Negahdaripour and B.K.P. Horn. Using depth-is-positive constraint to recover translational motion. Proceedings of the Workshop on Computer Vision, pages 138{144, 1987. 7. A.F. Siegel. Robust regression using repeated medians. Biometrika, 69:242{244, 1982. 8. Y. Yacoob and M. Black Parameterized Modeling and Recognition of Activities. To appear in ICCV-98.