Unusual Event Detection using Sparse Spatio- Temporal Features and Bag of Words Model

Size: px

Start display at page:

Download "Unusual Event Detection using Sparse Spatio- Temporal Features and Bag of Words Model"

Derick Holland
5 years ago
Views:

1 Proceedings of the 03 IEEE Second International Conference on Image Information Processing (ICIIP-03) Unusual Event Detection using Sparse Spatio- Temporal Features and Bag of Words Model Balarishna Mandadi EEE Department Indian Institute of Technology Guwahati Guwahati, India Dr. Amit Sethi EEE Department Indian Institute of Technology Guwahati Guwahati, India Abstract We present a system for unusual event detection in single fixed camera surveillance video. Instead of taing a binary or multi-class supervised learning approach, we tae a one-class classification approach assuming that training dataset only contains usual events. The videos are modeled using a bag of words model for documents, where the words are prototypical sparse spatio-temporal feature descriptors extracted along moving objects in the scene of observation. We learn a probabilistic model of the training data as a corpus of documents, which contains a certain probabilistic mixture of latent topics, using Latent Dirichlet Allocation framewor. In this framewor, topics are further modeled as certain probabilistic mixture of words. Unusual events are video clips that probabilistically deviate more than a threshold from the distribution of the usual events. Our results indicate potential to learn usual events from a few examples, reliable flagging of unusual events, and sufficient speed for practical applications. Keywords Automated surveillance; Unusual abnormal rare event detection; Latent Dirichlet allocation I. ITRODUCTIO The ubiquitous presence of surveillance cameras, due to increased threat to public life and property, calls for automatic methods to detect any abnormal activities. Most of the present day surveillance systems do not have automation to detect illegal or unusual activities. It is highly expensive and difficult to appoint a human observer to monitor the activities for each and every system. Hence the need for automatic methods is increasing day by day. Unusual activity detection is a part of automated surveillance as a pre-warning that is to be further processed by the security personnel to tae higher level decision. The notion of unusual event is difficult to define because those are unpredictable and even change with scene of observation. For example, loitering is a usual activity in public pars but unusual in airports. In general, unusual activities can be defined as those whose occurrence is rare or unexpected in a given scenario. Given a long video sequence, those events which occur repeatedly over time are considered usual. With myriad possibilities of events in a video scene, it is quite difficult to label all types of events and classify them using supervised learning. Furthermore, since examples of unusual events will be few, if at all, the learning will have to be based on highly unbalanced classes. Dealing the problem as a one-class classification will be a better choice. The problem of detecting unusual events can be approached in different ways. Tracing based methods [] are a good choice in case of constrained environments where only limited set of activities is possible. These methods use tracs of moving objects to model the activities based on the speed, size, direction and silhouettes of objects along these tracs []. By tracing, the activity of one object can be separated from other co-occurring activities. The advantage of these methods is that complex motion patterns can be easily modeled using extracted trajectories. But these methods are sensitive to tracing errors. Explicit tracing of objects in crowded scenes is very complex and is easily prone to errors due to frequent occlusions. The other ind of approaches [] directly uses motion feature vectors instead of tracs to describe video clips. In these approaches, simple low level visual features lie motion histogram are extracted as part of the feature set. As there is no detection and tracing, a particular activity cannot be separated from other simultaneously co-occurring activities. For example, the motion of a pedestrian from that of a car cannot be separated without tracing. To alleviate this problem, some sorts of mixture models are used to learn various co-occurring activities in the scene. We also use the same ind of approach without going for explicit tracing based methods. Zhong and Shi [3] are one of the first to use hard to describe but easy to verify notion of detecting unusual events from video sequences. By slicing a long video sequence into small segments and representing them with simple motion features, they have clustered the video segments using importance feature signal extraction. Unusual video segments are the ones with small inter-cluster similarity. The above approach is well suited for finding interesting events in a large pool of video space but may not be able to detect unusual events over a live streaming video. Despite its offline nature, the above method has become popular due to its simple approach and inspired later methods to use powerful statistical models to mae it online. Wang et al. [] and aradarajan et al. [4] used the natural language processing models lie Latent Dirichlet Allocation (LDA) [5] and Probabilistic Latent Semantic Analysis (plsa) [6] to model the activities by treating every video segment as document and quantized optical flow features as vocabulary. Our approach of detecting unusual events resembles to that of [] but we used different set of features to represent a video and used different abnormality measures to detect unusual events. As motion information using optical /3/$ IEEE 69

Proceedings of the 03 IEEE Second International Conference on Image Information Processing (ICIIP-03) flow alone may not be used to model various activities efficiently, we have used quantized

To model the activities we have also used LDA model as used in [], but used different measures of abnormality instead of lielihood measure as it shows high for an unusual activity co-occurring with a

2 Proceedings of the 03 IEEE Second International Conference on Image Information Processing (ICIIP-03) flow alone may not be used to model various activities efficiently, we have used quantized spatio-temporal corner descriptors to efficiently model various activities over the scene of observation. The used features can capture both shape and motion information of moving objects. To model the activities we have also used LDA model as used in [], but used different measures of abnormality instead of lielihood measure as it shows high for an unusual activity co-occurring with a usual activity which is undesirable. By measuring the deviation of new test clips inferred parameters from that of model using probabilistic distances lie Bhattacharya distance and Kullbac-Leibler (KL) divergence, we detect unusual events. The rest of the paper is organized as follows. In section II the general mathematical formulation of the problem is given. In section III we discuss about the video features and representation. In section I we discuss about the generative model LDA. In section we give experimental results followed by a conclusion. II. MATHEMATICAL FORMULATIO Let ( x, y, t) represents a small video clip of T frames duration which may contain non-separable set of activities (e.g. a waling pedestrian while a car is moving on road) where xy, and t are the spatial and temporal coordinates respectively. Let = {,, 3,, } be the training dataset which is obtained by slicing a long video sequence into overlapping or non-overlapping equal length segments. The dataset can be labeled (label being usual or unusual) or unlabeled dataset depending on the supervised or unsupervised approach to detect a given video event as unusual. Feature representation of a video clip can be done in many ways (We use histogram of sparse feature descriptors to represent a video). In general, let λ i () l is the feature representation of a video clip, i i =,, and l is the feature parameter. ow the original dataset can be replaced by the feature dataset as, { () () ()},,, λ = λ l λ l λ l () To learn the various sets of activities or motion patterns generally a probabilistic model is fitted over the training data. v Let M ( l; θ ) be the generative model that is to be fitted over the training video dataset λ. θ is the parameter vector that is learned while fitting the model to dataset. As θ is tuned to the v dataset λ, M ( l; θ ) captures certain beliefs about the environment. ow given a new test video clip of same duration, s and its feature representation, λ s () l, its deviation from model is calculated as D s using appropriate probabilistic distance (PD) measures D ( () ( )) s v s = PD λ l, M l; θ () ew test clip s can be flagged as unusual if D D s > th, a predetermined threshold from training sequence. The model can be a supervised or unsupervised model but usually it is very difficult to get good training examples for unusual events in case of crowded environments to use supervised models. To solve the problem of unavailability of unusual examples, we model the activities seen during training using unsupervised learning. In general, we learn a probability distribution over the training examples, and mar test cases that are improbably under this distribution as unusual. The probability distribution should be flexible and loose enough to not raise false alarms for usual activities in the test examples, while it should not be too loose to miss the detection of obviously unusual cases. III. IDEO REPRESETATIO With the success of sparse spatio-temporal (ST) corner descriptors for action recognition, we have used the same set of features to represent a video clip. There are different set of algorithms available for detecting spatio-temporal interest points. Most popular among them are Harris3D [7] and periodic detector [8]. Harris3D is the extension of popular Harris corner detection algorithm [9] to videos and was developed by Laptev and Lindeberg. The periodic detector proposed by Dollar et al. [8] uses linear filters along temporal directions to detect interest points. Because of its robustness and simplicity in our experiments, we have used periodic detector method of representing a video clip proposed by Dollar et al. [8]. Usually a response function is evaluated at each location for any feature detection algorithm. The response function for periodic detector uses separable D temporal Gabor filters and is of the form, (,, ) = ( ( σ)* ) + ( ( σ)* ) R x yt J g h J g h (3) ev where R is the response function, J is the image intensity at location ( xyt,, ). (a) Test Frames (b) Periodic Detector s Interest points Figure : (a) One of the frames from 5 sec test video clips (b) The detected 3D interest points at different time instances projected on a frame shown in (a) using periodic detector. od Source code for periodic detector was downloaded from the author s homepage. 630

3 Proceedings of the 03 IEEE Second International Conference on Image Information Processing (ICIIP-03) ( x + y ) g = e is the spatial smoothening ernel. h ev and πσ are a pair of quadrature Gabor filters of the form. hod t h = cos( πω t ) e τ and ev t τ h = sin( πω t ) e are the quadrature filters withω = 4. The Gabor filters are bandpass τ in nature. Response function is evaluated at every possible location along three dimensions and those locations which fire high response are spatio-temporal corners. Figure above shows the detected ST corners or interest points from a small video clip by projecting on to single frame. The ST features correspond to corner points of moving objects in the scene as in Figure. To efficiently represent a video clip using the obtained ST corners, cuboids of certain size are extracted around these interest points [8]. We empirically determined 9x9x7 to be a good cuboid size for the tested scenes. In general, it will be dependent on the size of interesting objects and their pace. To obtain the shape and motion information of moving objects, the extracted cuboids are represented using a feature vector of 3D gradients. ocabulary: To build vocabulary for the bag of words model, the obtained descriptors are quantized using -means clustering. The cluster centers are the prototype words which can act as vocabulary. The number of clusters depends on different types of descriptors in the scene of observation and can be empirically determined. To allow for the new unseen descriptors which may be a part of unusual events during testing, for each prototype feature a counter prototype is defined. The descriptors which are nearer to the prototype centers but far away from the farthest descriptor of the cluster belong to the counter prototype. The location of the moving objects is also important in case of unusual activity detection. To get the location information, the frame is divided into H horizontal and vertical patches and the count of detected corner points within that patch is also appended to the vocabulary histogram. ow, the total possible words in the vocabulary are + H. Our feature representation λ of video clip is the above histogram of prototype descriptors present in the video clip. By treating every video clip as document and the prototype feature descriptors as vocabulary, we model the activities using LDA model which is explained below. I. LATET DIRICHLET ALLOCATIO LDA [5] is a probabilistic generative model successfully used in documents processing to extract semantic meaning of document. A document is modeled as an unordered collection of words which are obtained from the defined vocabulary. Only the counts of individual words are important in a bag of words model, while their order is neglected. Histogram of words captures their co-occurrence to model different topics. In case of videos, the documents are video clips and the vocabulary is the set of prototype descriptors as defined in od previous section. LDA assumes that a document (video clip) is generated from a random mixture of K latent topics (atomic activities). Topics are a set of co-occurring words that are captured from the corpus of documents during training. LDA is a fully generative model which does not suffer from overfitting problem as that of its predecessor plsa [0, ]. Figure : Graphical representation of LDA model as in [5] (shaded part represents observed variable) As LDA is a widely accepted model, we have directly taen some of the equations and notations required for our problem. For complete explanation refer [5]. The multinomial parameter of topics over document, θ, is considered random of nown Dirichlet distribution given by: Γ ( α i ) α α p( ) θ θ θ α = Γ α ( ) i Where Γ ( x) is the Gamma function of the form: t x t e dt 0 x ( ) Γ =, x is not a negative integer and zero. Given the parameters α and β, the joint distribution of topic mixture θ, a set of topic labels z and a set of words w is given by: p( θ, zw, α, β) = p( θ α) p( zn θ) p( wn zn, β ) (4) n= i Where pz ( n θ ) is simply θ i for the unique i such that z n =. Integrating over θ and summing over z, we obtain the marginal distribution of a document (video clip): p( w α, β) = p( θ α) p( zn θ) p( wn zn, β) dθ (5) n= zn Taing the product of the marginal probabilities of single documents, we obtain the probability of a corpus D M p( D α, β) = p( θd α) p( zdn θd ) p( wdn zdn, β) dθd d= n= zn (6) The LDA model can be represented graphically as shown in Figure. As shown in the figure, there are three levels to the representation. The parameter α and β are the corpus level parameters, assumed to be sampled only once in the process of generating a corpus. The variables θ d are document level parameters, sampled once per document. Finally, the variables z dn and w dn are word level variables and are sampled once for every word in each document. 63

4 Proceedings of the 03 IEEE Second International Conference on Image Information Processing (ICIIP-03) The LDA model description is given above and various parameters lie α and β for the model should be learned from training corpus (a set of related documents). The inference problem is computing the posterior distribution of hidden variables given a document and is given by (7) p( θ, zw, α, β) p( θ, z w, αβ, ) = (7) p( w αβ, ) The denominator in (4) is given by () is intractable due to coupling between θ and β. As the posterior distribution is intractable for exact inference, the authors proposed variational Bayes approximation for inference in their paper by relaxing the lin between the coupling parameters. The general idea is to approximate the complex posterior using simple free variational probability by minimizing the difference between them. The free variational posterior is given by (8) q( θ, z γφ, ) = q( θ γ ) q( zn ϕn) (8) n= where γ and ( ϕ, ϕ,, ϕ ) are the free variational parameters that are to be estimated in constraint of minimizing the difference between true posterior (4) and variational posterior (8). A tight lower bound on the log lielihood directly translates into the following optimization problem [5], ( γ, φ ) argmin Dq ( ( θ, z γφ, ) p( θz, w, αβ, )) (9) = ( γφ, ) where is Kullbac Leibler divergence. The optimum values for free variational parameters can be found by equating first derivatives to zero, ϕ β exp{e [log(θ γ )]} (0) ni iwn q i i i n = ni q i = γi γ j = j γ = α + ϕ () E[log(θ γ )] ψ( )-ψ( ) () where ψ is the first derivative of the log Γ function. Equations (7) and (8) are coupled equations which are solved recursively using expectation maximization (EM) algorithm. The model parameters α and β are learned using variational EM algorithm as explained in [5]. Our aim is to infer the various topic mixture weights γ ts present in a new test document (video clip) using (8) and (9). This variational mixture parameter is compared to that of the model parameter α to determine the deviation between the two. Measures of abnormality: As both γ ts and α are the parameters of random Dirichlet distribution, statistical distance measures lie Kullbac-Leibler(KL) divergence and Bhattacharya distance can be used to measure the deviation between the two. The KL divergence between two Dirichlet distributions [] is given by, pa( X; αa) DKL ( αa, αb ) = pa ( X; αa )ln dx (3) pb( X; αb) X ( α ai ) ln ( α bi ) = ln Γ Γ { ln ( αbi ) ln ( αai )} ψ ψ + Γ Γ (4) + [ α ] ( ) i ai αbi αai ( α = ai ) The Bhattacharya distance between two Dirichlet distributions [3] with parameters αa and αb is given by: DBC ( αa, αb ) = ln pa ( X; αa ) pb ( X; αb ) dx (5) X αai + αbi αai + αbi = ln Γ ln Γ + { ln Γ ( α ) ln ( ) i ai + Γ α = bi } (6) { ln Γ ( i αai ) + ln Γ ( i α = = bi ) }. EXPERIMETAL RESULTS As there is no proper test benchmar available as of now, we have collected our own videos from our campus. We provide results of using both optical flow features and spatio-temporal features. The video dataset consists of a 45 minute video of which 35 minutes of video is used for training which contains general activities that are seen daily. To test for unusual events, an acted video is captured which contains unusual activities lie driving in wrong directions, entering the restricted lawn area. A small part of the normal video and the acted video is ept aside for testing purposes. The video sequence is sliced into small non-overlapping segments of 5 seconds duration which are treated as documents. After slicing into short video segments, a total of 370 video segments for training and 58 video segments for testing are obtained. Of the 58 test video clips first 58 contain daily seen usual activities and remaining segments contain unusual activities. Each video segment is represented as histogram of quantized optical flow features [] and histogram of quantized ST corner descriptors as explained in section III. The results are discussed below. Optical flow as vocabulary: The optical flow [4] vocabulary for the bag of words model is created as explained in [] with the added bins for magnitude. The frame size of 40x30 is divided into spatial patches of 4x3, each patch containing 0x0 pixels. By dividing magnitude and direction of optical flow over a patch into 4 bins each, a 6 bin histogram per patch is calculated by counting the number of pixels belonging to specific bins. All the 6-bin histograms per patch are concatenated to represent a video segment (document). The total vocabulary size is equal to the length of the histogram bins i.e. 4x3x6=,88. The number of topics K is empirically determined to be 0. The optical flow histograms of various training video segments are used to learn the parameters of the model. Given a new video clip, the topic distribution present in the given clip is inferred from the model using(0), () and () iteratively. The deviation of new test clip s topic distribution from the model parameter is measured using KL divergence given by(4). As it is clear from the KL 63

Proceedings of the 03 IEEE Second International Conference on Image Information Processing (ICIIP-03) divergence curve in Figure 3, differentiationn between unusual clips (59-58) and usual clips

This problem is alleviated by representing a video by using sparse ST corner descriptors as shown in Figure 4.

5 Proceedings of the 03 IEEE Second International Conference on Image Information Processing (ICIIP-03) divergence curve in Figure 3, differentiationn between unusual clips (59-58) and usual clips (-58) is not much distinguishable because of insufficient representation of video by using noisy optical flow motion features. This problem is alleviated by representing a video by using sparse ST corner descriptors as shown in Figure 4. Figure 3 shows some of the frames from the unusual video segments corresponding to high deviation from the learned model. abnormality given by (4) and (6). The differentiation between unusual video clips and usual clips (-58) is clearly evident in Figure 4. The Bhattacharya distance curve and KL divergence curves are almost similar except for the numerical range. The threshold is calculated as maximum distance of training clips from model parameter. (a) (b) Figure 3: (a) KL divergence abnormality measure for test video clips using optical flow as vocabulary. (b) Frames corresponding to high KL divergence. (, 4, 5, 6) People moving in the restricted lawn area. (, 3) Bicycles moving in wrong directions. (Bounding boxes are hand labeled to visualize unusual activities over the scene) ST features as vocabulary: The vocabulary for ST corner descriptors is built as explained in section III. The vocabulary is generated by quantizing the ST descriptors using -means clustering with the empirically chosen = 500. For location information the frame is divided into 6x8 patches and the ST features under that patches are counted. Hence total vocabulary is + H =,048, which is very less compared to that of optical flow representation. The number of topics K in the mixture model is manually selected as 0 (Increasing the value of K results in repetitive atomic activities or mixture of other already learned activities which are redundant). The topic distribution of the given test clip is inferred as explained in section 3. Instead of going for lielihood measure of abnormality, which may show up high for an unusual activity co-occurring with a usual activity, we used KL divergence and Bhattacharya distance as measures of (c) Figure 4: (a) KL divergence curves and (b) Bhattacharya distance for test data sets using ST feature representation. (c) Some of the frames corresponding to unusual video (high divergence). (,3,5)- Motion in the restricted lawn area, (,6)-Driving in wrong directions. (4)-Unusual hand movement of observer, recorded while capturing the videos. (Bounding boxes are hand labeled to visualize unusual activity) Some of the frames of unusual video segments corresponding to temporally local maximal abnormality are shown in Figure 4. Most of the detected unusual activities include driving in wrong directions and entering restricted lawn areas which are not seen during training. Due to limited training videos used for the experiments, few patterns which are expected to be usual are detected as unusual are labeled as false alarm (FA) in Figure 4. The false alarms are corresponding to car going reverse, hence in the wrong direction and taing a U-turn which are not seen widely during training are shown in Figure 5. The present implementation has a latency of 5 sec duration i.e. there is a delay of 5 sec duration to calculate and determine the unusual event over live streaming videos. The latency can be easily reduced by taing overlapping clips. To allow for the spatial localization we tried to divide the frame into patches and a 633

6 Proceedings of the 03 IEEE Second International Conference on Image Information Processing (ICIIP-03) (a) (b) Figure 5: (a) Frames corresponding to normal video events but detected as unusual (false alarms) (b) Frames corresponding to unusual but detected as usual (missed detection). separate LDA model is learned for each patch. But the results are not satisfactory due to the size of the patch where it cannot accommodate the whole activity and everything seems new. I. COCLUSIO In case of one-class based learning of activities over the scene of observation, video representation and generative model selection are the primary things to be considered. Previous methods used optical flow features to represent a video and used language models to describe the scene. Motion features using optical flow alone may not be useful to model the scene everywhere especially in outdoor scenes where optical flow gives erroneous results. To alleviate this problem we have used robust representation of video clips using sparse interest points, successful in case of action recognition to represent a video clip. To learn the various co-occurring activities, a popular bag of words generative model, LDA is used. At the end, experimental results are provided to show that ST features performed well to that of optical flow features to detect unusual events in outdoor scenarios. The future scope for the approach can be in the areas of video representation and online learning of the model parameters. Recently, deep learning techniques [5] have become popular for learning important informative features from the video without going for hand designed features. To allow for continuous update of the model to capture the changing environment, an online version of the LDA model [6], proposed recently can be readily used here. [] C. Stauffer and W. E. L. Grimson, Learning Patterns of Activity Using Real-Time Tracing, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol., pp , 000. [3] H. Zhong, J. Shi, and M. isontai, Detecting Unusual Activity in ideo, in CPR (), 004, pp [4] J. aradarajan and J.-M. Odobez, Topic Models for Scene Analysis and Abnormality Detection, in 9th International Worshop in isual Surveillance, 009. [5] D. M. Blei, A. Y. g, M. I. Jordan, and J. Lafferty, Latent dirichlet allocation, Journal of Machine Learning Research, vol. 3, p. 003, 003. [6] T. Hofmann, Learning the Similarity of Documents: An Information- Geometric Approach to Document Retrieval and Categorization [7] I. Laptev and T. Lindeberg, Space-time Interest Points, in ICC, 003, pp [8] P. Dollár,. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-Temporal Features, in S-PETS, 005. [9] C. Harris and M. Stephens, A combined corner and edge detector, in Proc. of Fourth Alvey ision Conference, 988, pp [0] M. Steyvers and T. Griffiths, Probabilistic topic models, Handboo of latent semantic analysis, vol. 47, pp , 007. [] H. Shan, A. Banerjee, and. C. Oza, Discriminative Mixed- Membership Models, in Proceedings of the inth IEEE International Conference on Data Mining, 009, pp [] W. D. Penny, Kullbac-Liebler Divergences of ormal, Gamma, Dirichlet and Wishart Densities, Wellcome Department of Cognitive eurology, 00. [3] T. W. Rauber, A. Conci, T. Braun, and K. Berns, Bhattacharyya probabilistic distance of the Dirichlet density and its application to Splitand-Merge image segmentation, in 5th International Conference on Systems, Signals and Image Processing, 008, pp [4] B. D. Lucas and T. Kanade, An Iterative Image Registration Technique with an Application to Stereo ision, 98, pp [5] Q.. Le, W. Y. Zou, S. Y. Yeung, and A. Y. g, Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis, in IEEE Conference on Computer ision and Pattern Recognition (CPR), 0, pp [6] M. D. Hoffman, D. M. Blei, and F. Bach, Online learning for latent dirichlet allocation, in IPS, 00. REFERECES [] X. Wang, X. Ma, and W. E. L. Grimson, Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 3, pp ,

Spatial Latent Dirichlet Allocation

Spatial Latent Dirichlet Allocation Xiaogang Wang and Eric Grimson Computer Science and Computer Science and Artificial Intelligence Lab Massachusetts Tnstitute of Technology, Cambridge, MA, 02139, USA