Using dynamic Bayesian network for scene modeling and anomaly detection

SIViP (2010) 4:1 10 DOI 10.1007/s11760-008-0099-7 ORIGINAL PAPER Using dynamic Bayesian network for scene modeling and anomaly detection Imran N. Junejo Received: 7 July 2008 / Revised: 4 December 2008 / Accepted: 5 December 2008 / Published online: 8 January 2009 Springer-Verlag London Limited 2008 Abstract In this paper, we address the problem of scene modeling for performing video surveillance. The problem consists of using the trajectories, obtained by observing objects in a scene, to construct a scene model that can be used to distinguish a normal and an acceptable behavior from a atypical one. In this regard, the proposed method is divided into a training phase and a testing phase. During the training phase, the input trajectories are used to identify different paths or routes commonly taken by the objects in a scene. Important discriminative features are then extracted from these identified paths to learn a dynamic Bayesian network (DBN). During the testing phase, the learned network is used to classify the incoming trajectories based on their size, location, speed, acceleration, and spatio-temproal curvature characteristics. The proposed method (i) handles trajectories of varying lengths, (ii) automatically detects the number of paths presents in a scene, and (iii) introduces the novel usage of the DBN, which is very intuitive and accurately captures the dynamics of the scene. We show results on four datasets of varying lengths and successfully show results for both path clustering and anomalous behavior detection. Keywords Path modeling Dynamic Bayesian network Video surveillance 1 Introduction Video surveillance has brought computer vision into the limelight in recent years. This is primarily due to the increased I. N. Junejo (B) INRIA-Rennes, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France e-mail: ijunejo@cs.ucf.edu security concerns, but also mainly because now the technology has advanced to a stage where we no longer have to wait for hours to get results from a number crunching machine. The task of processing a video sequence, performing background subtraction and detecting foreground objects, in addition to performing other higher level event detection tasks can now be done in real time. One such higher level task is building a scene model 1 for performing video surveillance. Once an acceptable or normal object behavior is obtained from a scene, the path modeling problem basically involves building a system that is able to learn routes or paths most commonly taken by objects in that scene. Based on this learned model, we aim to classify an incoming or a test behavior as conforming to our model or not. For example, consider the problem of monitoring an area of interest, e.g. a building entrance, a parking lot, a port facility, an embassy, or an airport lobby, using stationary cameras. Our goal in these scenarios would be to model the behavior of objects of interest, e.g. cars or pedestrians, with the intent to correctly detect any unusual movement or activity when it occurs. Although the solution we propose is general, objects of interest addressed in this paper are pedestrians. As objects tend to follow well established lines of travel while entering or exiting a scene, due to the presence of benches, trees etc, we contend that it is very critical to identify these different areas which receive extra attention from pedestrians. We refer to this process of segmenting different lines of travel as determining paths in the model. A path or route can be defined as any established line of travel or access [1]. This is the region that is most used by the objects. And a trajectory can be defined as a path followed 1 We shall use the terms path modeling and scene modeling interchangeably.

2 SIViP (2010) 4:1 10 by an object moving through the space. An example is shown in Fig. 1 where a person walking on a paved path is tracked by a system and correctly assigned a unique identifier. However, it is true that the definition of an unusual behavior might be different for different applications. People waiting several minutes by a conveyer belt at the airport maybe considered as acceptable while the same may not be true for a person standing outside a bank. Similarly, running of a person on a sidewalk may be an acceptable behavior for a certain application but may not be suitable at an airport lobby. This difference of context is mediated by dividing the process into atraining phase and a testing phase. Thus for any application, the training phase consists of acceptable behaviors that objects may demonstrate in a scene, which is obtained in terms of their tracked trajectories. And any test trajectory having characteristics different from the scene model built in the training phase shall be discarded as abnormal, flagging an alert. It has been recently argued by Junejo and Foroosh [2] that a camera needs to be calibrated for performing path modeling. This is primarily due to the effects of perspective projection. That is, objects tend to grow larger as they approach the camera and grow smaller while moving away from the camera; making it difficult to characterize objects in terms of their sizes and motions. However, we contend that this perspective effect is also a characteristic of the scene, depending on both the intrinsic and extrinsic parameters of the camera as well as on the locations and orientations of paths present in the scene. We also make a novel usage of an object s size by using its bounding box information. This gives us a prior knowledge about the expected size of an object at any location, allowing us to reject, for example, the presence of a car where a pedestrian was expected. In this paper, we provide a novel method to perform path modeling for video surveillance. Mainly, our contributions are: (i) a simple and intuitive method to segment trajectories obtained during the training phase into spatially different paths by the application of eigendecomposition, (ii) extracting useful novel features from each trajectory present in a detected path, characterizing the location, velocity, acceleration, spatio-temporal curvature and size of the observed pedestrians, and (iii) a novel usage of dynamic Bayesian network (DBN) to learn each of the unique path detected in the scene so that normal behavior can be distinguished from an abnormal one. The rest of the paper is organized as follow: a brief introduction to the related work is described next. The process of segmenting input trajectories into different paths is described in Sect. 2. Novel features are extracted from these segmented paths to learn a path model for each of the detected path by using the DBN, described in Sect. 3. We show promising results in Sect. 4 before concluding. 1.1 Related work It is beyond the scope of the current work to summarize the existing work on video surveillance, we therefore refer the readers to a recent survey [3]. Similarly, commonly used distance measures and their comparisons can be found in [4 7]. For video surveillance, Grimson et al. [8] use a distributed system of cameras to cover a scene, and employ an adaptive tracker to detect moving objects. A set of parameters for each detected object are recorded, e.g. position, direction of motion, velocity, size, and aspect ratio of each connected region. Tracked patterns (e.g. aspect ratio of a tracked object) are used to classify objects or actions. Tracks are clustered using spatial features based on the vector quantization approach. Once these clusters are obtained the unusual activities are detected by matching incoming trajectories to these clusters. Thus, unusual activities are outliers in the clustered distributions.boyd et al. [9] demonstrate the use of network tomography for statistical tracking of activities in a video sequence. The method estimates the number of trips made from one region to another based on the inter-region boundary traffic counts accumulated over time. It does not track an object through the scene but only logs the event when an object crosses a boundary. The method only determines the mean traffic intensities based on the calculated statistics and no information is given about trajectories. Johnson et al. [10] use a neural network to model the trajectory distribution for event recognition and prediction. Piciarelli et al. [11] cluster the trajectory based on normalized Euclidean distances, Khalid and Naftel [12] employ a Mahalanobis classifier for detection of anomalous trajectories, Calderara et al. [13] use a mixture of von Mises Distributions for abnormal behavior detection. However, in terms of path modeling, the most relevant work is that of Makris and Ellis [1,14], where they develop a spatial model to represent routes in an image. Once a trajectory of a moving object is obtained, it is matched with routes already existing in a database using a simple distance measure. If a match is found, the existing route is updated by a weight update function; otherwise a new route is created for this new trajectory having some initial weight. Spatially proximal routes are merged together and a graph representation of the scene is generated. One limitation of this approach is that only spatial information is used for trajectory clustering and behavior recognition. The system cannot distinguish between a person walking and a person lingering around, or between a person running and walking, since their models and measurements are not Euclidean. There also does not exist any stopping criteria for merging of routes. Following the work of Makris and Ellis [1], recently Junejo and Foroosh [15] proposed a method that overcomes some of the limitations of Makris and Ellis [1], in addition

SIViP (2010) 4:1 10 3 to calibrating the camera by observing the pedestrians. Trajectories are clustered into paths by performing Normalized- Cuts [16]. By applying dynamic time warping (DTW), a path envelope is built, representing the spatial extent of each path. Features are then extracted from these paths and each path is divided into segments. Mean and standard deviation of these feature are computed to create a Gaussian representation of each path. Mahalanobis distance is used to check the conformity of a test trajectory with the created model. However, they argue that a camera needs to be calibrated for performing this task, whereas we show that this is not necessary as perspective effects also represent a characteristic of the scene. We show results on their dataset and obtain comparable results. Moreover, they adopt the tedious approach of constructing a path envelope and dividing each path into segments. We overcome this drawback with the help of techniques from machine learning. Recently, Wright and Pless [17] used the 3D structure tensor for representing global patterns of local motion. Zhang et al. [18] built a generic rule induction framework based on trajectory series analysis for learning events in a scene. Jiang et al. [19] propose a HMM-based solution for event detection by performing dynamic hierarchical clustering. They only use the object positions to train the HMM, which severely limits the application of their method, as they are not able to account for the crookedness, or the difference in velocities between the training and the testing trajectories. Moreover, they train a HMM on each trajectory, which can be very time consuming. A hierarchical approach was also adopted by Fu et al. [20]. Wang et al. [21] describe an unsupervised framework, where scene is segmented initially into people and vehicles. However they only use size information for this initial yet crucial step. This feature is very prone to perspective effects and generally leads to unstable results [2](a person on skateboards might appear to be of the same size as a car or a golf cart in the scene). Their proposed clustering criterion is similar to the normalized-cuts [16], using only the location and velocity features, which might not be able to distinguish between a running person or a person riding a bicycle. Whereas, in addition to these, we make use of the spatio-temporal curvature feature which captures more accurately the dynamics of a pedestrian s motion. In contrast to the existing work, we propose a simple yet efficient solution that avoids the noise prone task of camera calibration [2] and does not construct a path envelope or mean path [1], which can lead to a serious model drifts [1]. For this problem of path modeling, we introduce the novel usage of DBN, which is able to accounts for the instantaneous dependencies between different time instances of a tracked object. Moreover, we extract features that can differentiate between people walking at different speed or velocity, between people walking with different spatio-temporal characteristics (i.e. in a crooked zigzag path), and we also obtain a Fig. 1 People generally tend to follow established lines of travel. In this example, a person is walking straight on a paved pathway, and is assigned a unique identifier by the object tracker. For our method, instead of tracking the centroid of an object, we track the feet locations prior information regarding the size of the object that should be visible at a particular location. 2 Behavior class identification The object tracker is able to correctly detect an object as it enters the scene and successfully tracks it until it exits. For our case of a single stationary camera, we use the tracker proposed by Javed and Shah [22]. With the unique label given by the tracker to a new object in the scene, we are able to correctly maintain a history of the object as it traverses the scene, as shown in Fig. 1. This history, in 2D image coordinates, can be represented as: T i ={l i 1, li 2,...,li n } (1) where T i is the trajectory of an object labeled i = 1, 2,...,N, where N is the total number of observed pedestrians throughout the whole video sequence, and l i t is the location of an observed object i at time instance t. People walk at different speed, thus the length of the trajectory T i will be different for different people. The vector l i t represents the bounding box of the detected object and contains (t x, t y, w, h), where (t x, t y ) denote the x and y coordinates of the top left points of the bounding box, and (w, h) represent the width and the height of the bounding box, respectively. However, instead of using (t x, t y ) for learning the path/scene model, we extract the feet location of an object at each time instance and denote it by (x, y) and use it along with the two quantities (w, h) for learning, as we discuss in Sect. 3. The feet location is computed simply as the midpoint of the bottom line of the bounding box. 2.1 Identifying paths from motion trajectories Before we can learn the motion behavior of objects in the scene, we need to segment the different classes of object

4 SIViP (2010) 4:1 10 Fig. 2 a Trajectories used for learning the scene model and for building the affinity matrix W. By successive application of the eigenvalue decomposition, different clusters, representing different paths in the scene, are obtained, as shown in b e motion. That is, we need to identify the number of paths S in the scene, and assign each trajectory its appropriate path label. In this regard, we aim to adopt an approach that is easy to implement, is able to distinguish between spatially dissimilar trajectories, and is flexible to accommodate different definitions of acceptable behaviors. In order to perform this task, we need to measure the affinity between trajectories. We define the N N symmetric affinity matrix [W ij ] where 1 i, j N, as: w(1, 1) w(1, 2)... w(1, N) w(2, 2)... w(2, N) W =........... w(n, N) where indicates symmetric values, N indicates the total number of trajectories in the training dataset. Each entry of this matrix is computed as: w(i, j) = e max{d(t i,t j ),d(t j,t i )}/2σ 2 w (2) where σ controls the decay of the similarities (set to σ w = 2 for our experiments), w(i, i) = 0, and d(t i, T j ) is a distance between xy-coordinates of two trajectories T i and T j. Our task at this stage is to cluster the input trajectories into paths based only on their spatial similarity. In order to do so, we choose the Hausdorff Distance to compute the distance d(t i, T j ) in (2), defined as: d(t i, T j ) = max min a b (3) a T i b T j One advantage of using the Hausdorff distance is that trajectories of different lengths can be compared and is fairly efficient to compute. So far, we have computed the Hausdorff distance between all the trajectories T ={T 1,...,T i,...,t N } by using the Eq. 2 and have built the affinity matrix W. In order to segment the entire trajectory (i.e. the training) dataset into different paths and to determine the total number of classes S in the scene, rather than looking at the first eigenvector of W, we look at its generalized eigenvectors [23]. Let d(i) = j w(i, j) be the total distance of a trajectory T i to all the other trajectories in the training dataset. Let D be the N N diagonal matrix with d on its diagonal, then define the generalized eigenvector γ i as a solution to (D W)γ i = λ i Dγ i (4) and the second generalized eigenvector is defined as the γ i corresponding to the second smallest λ i. In other words, once we have constructed W, we apply the eigendecomposition recursively until the value of the second generalized eigenvector is above a threshold. Shi and Malik [23] show that the second generalized eigenvector is the solution to a continuous version of a discrete problem, where the solution is to find a segmentation that minimizes the affinity between groups normalized by the affinity within each group. After applying this process, the entire training set of trajectories is clustered into spatially similar paths, and we denote the total number of paths detected in the scene by S. After this, each trajectory is assigned the label of its corresponding path. The result of applying the above method to our dataset is shown in Fig. 2. All trajectories used for learning are depicted in Fig. 2a. We then perform the generalized eigendecomposition, which divides the matrix into two segments. The eigendecomposition is again applied on the smaller segment of the two for further division, while thresholding on the second generalized eigenvector. Different detected paths used by the objects in the scene are shown in Fig. 2b e. 3 Path modeling In the previous section, we described an approach for segmenting trajectories into different paths. Once this is achieved, we need to extract useful features from these extracted paths so that the scene can be accurately modeled. We aim to extract features that enable us to distinguish between (i) spatially dissimilar trajectories, (ii) trajectories of people walking at different speeds, (iii) crooked trajectories, (iv) trajectories where people make unexpected u-turns or (v) trajectories where people unexpectedly vary their speed.

SIViP (2010) 4:1 10 5 The novel set of features extracted from the trajectories of an object i at time instance t is a 11-tuple: ψ t i = (x, y, w, h, v, v x, v y, a, a x, a y, κ). Feature description: As described above, (x, y) represent feet locations of a detected object, essential for characterizing positional information of the trajectories present in the path model. (w, h) is the width and the height, respectively, of the bounding box around the detected object. For example, consider the case when we only observe pedestrians during the training phase. The values of observed w and h will be quantitatively lower than those of, for example, the bounding box around an observed car. Thus this feature is very useful for detecting objects of unusual or unexpected size in the scene. In order to distinguish between objects of different speeds, we compute the velocity feature v, which can be derived as the magnitude of the first derivative of the x and y positions. Similarly, we compute v x and v y, the first derivative of the x coordinates and y coordinates, respectively. This feature is essential for differentiating between people walking at different velocities, for e.g. between walking vs. running. In this paper, for the sake of simplicity we discard the velocity direction information. But it could be very useful for the cases where, for instance, road traffic is observed and all the vehicles are expected to follow one particular direction. There might also occur a case where a pedestrian suddenly changes his/her velocity. Many reasons can be attributed to this behavior. For example, a person falls while walking, a running person suddenly stops, or a person does a combination of a slow and a fast walk intermittently. For addressing this type of scenarios, we compute the second derivative of the x and y positions, i.e. the acceleration a and the x and y derivatives of acceleration as a x and a y, respectively. In order to distinguish a crooked trajectory from a straight line motion trajectory, we extract the velocity and the acceleration discontinuities, i.e. the spatio-temporal curvature feature. This is where the importance of the velocity and the acceleration feature is greatly realized. For instance, in distinguishing a person walking normally in a straight line and a drunkard walking waywardly. At any time instance, this feature is computed as [24]: κ = a 2 y + a2 x + (v xa y a x v y ) 2 ( v 2 x + v2 y + 1 ) 3 (5) 3.1 Learning dynamic bayesian network From the previous step, the scene is clustered in to S paths or classes, and thus each trajectory T i contains the label of the class to which it belongs. Let C ={C 1, C 2,...,C i,...c S } be Fig. 3 The dynamic Bayesian network (DBN) used in our method. The blue shaded circles are the observed variables while the white circles are the latent variables. q c represents the path label of a trajectory during the training phase, while z n and y n are the continuous latent variable and 11D input variable, respectively the set consisting of the class labels. Let the set of extracted features for any trajectory i in class C S be represented by = {ψi 1,ψ2 i,...,ψn 2 i }. Similarly, let = {Ɣ C 1, Ɣ C S i Ɣ C 2,...,Ɣ C S} be the set representing features of the total trajectories used in our training process. We aim to learn the scene model by applying machine learning techniques to this features set. As described above, a common approach adopted for learning the set of input trajectories is to build a path envelope, and to divide each trajectory into segments [1], which can be a very tedious task. Other approaches that use machine learning techniques, such as HMM, assume equal length for all trajectories in a path [19]. However, by adopting DBN [25], we aim to provide a solution that overcomes the restrictions imposed by the existing methods. In addition, by employing the linear dynamical system (LDS), we learn the dynamics of the motion. Our goal is to propose a general solution for the problem where individual components can be modified to cater to any specific application at hand. Trajectory of a tracked object (a car or a pedestrian) is a temporal (time-series) data. For capturing the instantaneous correlation of this sequential data, it is natural to use directed graph models. The arcs (or edges) between different time slices of these models can be either directed or undirected. When these arcs are directed, they are known as DBN. DBNs of different topologies exist and in this paper, the DBN that we use is shown in Fig. 3. This network is an example of LDS (or commonly also known as the Kalman Filter [26]). Although this network appears similar to the input-output HMM (IOHMM) [27], however the latent node z n is continuous for our model. For a trajectory belonging to path label c (we omit the subscript i to make the notation clutter-free), q c is a discrete state variable, where the number of possible states are S, and y n is the continuous 11D input feature vector.

6 SIViP (2010) 4:1 10 The number of nodes per slice are three, and the observable variables are shaded blue while the latent (hidden) variable is shaded white, as shown in the figure. Each of the state variable q c and the input variable y n is connected to the latent variable z n. The state space description of this model is: z n = f c (z n 1, q c ) (6) y n = g c (z n, q c ) (7) where f c and g c are arbitrary differentiable functions (Gaussian), q c C is a discrete state variable containing the class (or path) label of a trajectory, and z n and y n are the continuous latent variable and the 11D input (observation) variable, respectively. Thus the observable variable q c represents the path label of each feature vector. As shown in Fig. 3, the input variable q c, in addition to the output variables y n, influence either the latent variables or the output variables or both. The network has the task of predicting the next state based on the current input q c and y n and the previous state z n 1. As LDS is a linear Gaussian model, the joint distribution overall the variables, as well as the marginals and the conditionals will all be Gaussian [25]. Hence each latent variable z n is taken as a Gaussian. and the observed 11D feature variable y n is also assumed as Gaussian. Traditionally, the conditional distribution of the latent variable, in terms of noisy linear equations, is given by z n = A c z n 1 + wc n (8) y n = C c z n + vc n (9) z 1 = µ 0 + u (10) where the noise terms w c, v c, and u c are zero mean Gaussians: w c N (w c 0, B c ) (11) v c N (v c 0, c ) (12) u c N (u c 0, V 0 ) (13) and the Gaussian distribution of the initial latent variable is given as: p(z 1 ) = N (z 1 µ 0, V 0 ). (14) Thus the parameters of a path with label c are denoted by { c = A c, B c, C c, c, V 0, µ 0 }. Therefore, once the membership of trajectories to their corresponding paths is determined, a unique DBN is trained for each path model by using the features extracted from their corresponding trajectories. We use the standard method of Maximum Likelihood to determine the model parameters c through the expectation-maximization (EM) algorithm. The publicly available BNT toolbox [28] is used for learning the proposed DBN. 3.2 Anomaly detection Once the labeled trajectories have been used to learn the DBN model M, we can begin to test our method to distinguish normal behavior from the abnormal one. The model M is used to classify an unseen behavior pattern as belonging to one of the S model classes obtained from the training set. Once a test trajectory T p is obtained, the membership of this trajectory is verified to each of the created models, as Ŝ = arg max c p(t p M c ) N p (15) where c C, and N p is the length of the test trajectory T p. Thus, a trajectory is assigned to a class M c for which (15)is maximum. However, it is also critical to detect if the test trajectory conforms to our constructed scene model or not. In order to do so, for every trajectory T k in the model class c, wefirst compute the following quantity: Lˆ c = min p(t k M c ) (16) N k Thus Lˆ c computes the minimum of the weights that each trajectory obtains from a given path model. Based on this, a normal trajectory is identified as: ( MŜ) p T p < LˆS ˆ (17) i.e. if the incoming test trajectory fails to obtain a substantial support from any path of our model, that trajectory is rejected as containing unusual or abnormal behavior. As we shall show shortly in the next section, this measure works very well and we are able to obtain better results than the existing approaches without resorting to camera calibration, path envelopes or other hierarchical clustering approaches. The novel set of extracted features allow us to account for the changes in size, location, velocity, and acceleration discontinuities in the motions of the observed objects. 4 Experiments and results We rigourously test the proposed method on a variety of test sequences. As described above, we first detect pedestrians in a scene and perform tracking. Often, due to tracking errors or noise in the image, the obtained trajectories are noisy. In order to remove this effect, we perform smoothing. Another important issue while performing object tracking is the occlusion that may occur. When an occlusion does occur, the accurate position and velocity of the occluded object can not be determined. The types of occlusion are (i) Inter-object occlusion, which occurs when one object blocks

SIViP (2010) 4:1 10 7 Fig. 4 Training and testing for G 1 : a depicts all the trajectories in the dataset G 1.After performing path clustering, two distinct paths are obtained, shown in b and c. Two results are also shown in d and e. A bicyclist is detected in d where its velocity is higher than the trained model. Hence, it is flagged red, i.e. unusual behavior. e shows the flagged trajectory of a person that makes a u-turn in the scene the view of other objects in the field of view of the camera. The background subtraction method gives a single region for occluding objects. If two initially non-occluding objects cause occlusion then this condition can be easily detected. (ii)occlusion of objects due to thin scene structures like poles or trees causes an object to break into two regions. Thus more than one extracted region can belong to the same object in such a scenario, and (iii) Occlusion of objects due to large structures, causes the objects to disappear completely for a certain amount of time, that is there is no foreground region representing such objects. More details on how we handle these occlusions during the tracking process can be found in [22]. Although our tracking can handle occlusions to a great degree, not all cases can be handled correctly. As a result, we obtain incorrect trajectories, which affects our trajectory clustering method. During our training phase, two cases are considered: 1) When inter-object occlusion occurs: this kind of occlusion generates incomplete trajectories, i.e. a trajectory starts from one end of the image and ends before reaching the image boundary (possibly an exit point). We ignore this trajectory and do not use in our path building phase. 2) A new trajectory is generated not at the boundary of the image, but rather well inside the image plane. This generally occurs when scene structures causes an object to break, or when the tracker assigns new trajectories to objects emerging from occlusion. We also ignore this type of trajectory. During the testing phase, trajectories resulting from occlusion are not treated specially. If such a trajectory does satisfy the spatial proximity feature, it fails the motion and spatiotemporal features. This happens because there is no information regarding velocity and the curvature of the trajectory at the missing sections of the trajectory. Testing on real data: We test the proposed scene modeling method on four video sequences provided by Junejo and Foroosh [2]. These sequences are captured from a single camera with image resolution of 320 240, and we label each dataset as G i, where i = 1, 2, 3, 4. For visualizing purposes, a conforming trajectory is marked green, while the abnormal trajectory is marked red in the results below. Once all the trajectories in the scene are obtained for the training phase, the affinity matrix is constructed by comparing the trajectories pair-wise. The purpose of this step is to distinguish between spatially and visual distinct paths. However, this segmentation can vary from application to application. For example, while monitoring a road for traffic surveillance, distinct path may be based on the speed of the motorists, in addition to segmenting different lanes on the road. Eigendecomposition is performed on the computed W. Recursive application of this decomposition is a form of segmentation, dividing the matrix (and hence the set of trajectories used in its construction) into two sets. The decomposition is applied again on the larger of the two sets until the values of the second generalized eigenvector is above a certain threshold, more on this can be found in [23]. Useful features are then extracted from these trajectories. Once the object trajectories are segmented into paths, the DBN [28], as described in Sect. 3, is constructed for each segmented path. During the test phase, the support for the test trajectory is computed from each path. The trajectory is assigned a class label that maximizes its marginal probability, or rejected when the probability is below the threshold. Testing on G 1 : This is a small dataset of 3,730 frames, containing 15 instances of pedestrians walking in the scene. All the trajectories for this class are plotted in Fig. 4a. In this scene, people move horizontal or vertical, following the paved paths. Thus, two distinct paths are obtained, as shown in Fig. 4b and c.

8 SIViP (2010) 4:1 10 Fig. 5 Training and testing for G 2 : a All the trajectories in dataset G 2. Three unique paths are clustered from the training sequence, shown in b d. Results are shown in e h. Two cases of normal behavior are detected, as shown in e and g, wherethe accepted trajectories are plotted green. Two abnormal cases, one containing a person walking on the grass (f), and another containing a person walking zigzag (h), are flagged red Fig. 6 Testing for G 3 : a depicts the rejected trajectory of a golf cart in the scene. This trajectory is rejected based on its unusual size and high speed. b, d f depict four cases of walking behavior which match our scene model as the object follows the learned paths. c shows the case when a person makes a left turn while entering from the right. This behavior is not part of the training process, hence rejected. Another case is shown in g where a person wanders on the grass. This is rejected based on its spatial characteristics Once the scene model has been learned, we test on two trajectories. One trajectory is that of a bicyclist, as shown in Fig. 4d, coming from the top of the scene and moving towards the left.this is marked as unusual due to two reasons: containing different spatial signature and having a non-conforming velocity. Another case is shown in Fig. 4e. Here, a person moves to the left of the scene and then makes are u-turn. This is detected as unusual behavior as well. As shown in Fig. 4b and c, the training trajectories consists of people walking in almost straight lines, thus this specific trajectory is correctly detected to be abnormal. This particular case shows the strength of the proposed method, as detection of this type of trajectory was not possible in [2,14]. Testing on G 2 : This is a sequence of 9,284 frames with 27 different trajectories forming three different path clusters. The length of the trajectories varies from 250 points to almost 800 points. Figure 5a depicts all the trajectories obtained in this sequence. The scene contains T shaped paved path. The proposed method successfully dissects the training set into three paths, as shown in Fig. 5b d. The training phase consisted of people walking straight, coming in from the left of the scene and going right, and going left. A normal case is detected in Fig.5e, where a persons walks in a straight line, getting highest probability from cluster one (cf. Fig. 5b). Figure 5f contains a trajectory, marked red, of a person walking on the grass. This trajectory fails to get support from any of the clustered paths, hence marked red. Similarly, a case of a person making a right turn is detected to be a normal behavior (cf. Fig. 5g). Figure 5h depicts the case when the person is walking zigzag. This is where our spatio-temporal curvature feature is useful, thus this trajectory is correctly marked as unusual. Testing on G 3 : This is a longer sequence, containing over 20 min of data forming over 100 trajectories of people walking. Figure 2a shows all the trajectories of the training sequence. Fours clustered paths are shown in Fig. 2b e. The first cluster contains people walking from the top and making a left turn, the second cluster contains people walking from the top and going down, the third cluster consists of people marking a right turn, and the fourth and the largest cluster contains people walking in a straight line. Test results for G 3 are shown in Fig. 6. The case of observing a golf cart in the scene is rejected based on the account of its size and speed, as shown in Fig. 6a. Four cases of

SIViP (2010) 4:1 10 9 Fig. 7 Testing for G 4 : a c depict three classes of clustered trajectories. Some of the test results are shown. Four detected cases of unusual behavior are shown in d f, h, wherethe observed pedestrians are not following the constructed model, marked in red. Samples of two positive detections are shown in g and i, the trajectory is marked blue.seetextfor more details acceptable walking are shown in Fig. 6b, d, e, f, and plotted in green. A negative case is shown in Fig. 6c, where a pedestrians makes are left turn while coming in from the right of the scene. This is rejected because this path was not detected in the clustering process. Similarly, a case of a person deviating from the paved path and walking onto the grass is shown in Fig. 6g. This trajectory is spatially different and fails to get support from any of the path in our scene model and is thus rejected as an abnormal trajectory. Testing on G 4 : This is the longest test sequence of resolution 320 240 pixels. The dataset contains tracks obtain from a surveillance camera observing the scene for almost 183 minutes. The sequence contains more than 500 trajectories of people walking, car driving by etc. Figure 7a c show the three clustered paths for this scene. The first cluster contains people walking from the bottom of the scene and making a left turn, or vice versa. The second cluster consists a visible portion of a car parking lot, which is frequented by people as well. The third cluster consists of people walking form the bottom of the scene, walking straight and then disappearing around the corner of the building. The test sequence contains around 30 trajectories, some results are shown in Fig. 7d g. Two normal cases of walking are marked in blue, as shown in Fig. 7g, i, belonging to cluster 1 and 2, respectively. An abnormal case of a person entering a scene and performing some zigzag motion is shown in Fig. 7d, and marked in red. A case of a person entering from the right of the scene (next to the building) and going across the scene is detected as abnormal (red trajectory in Fig. 7e) as this kind of trajectory is not present in our training model. Similarly, a wayward motion is detected to be unusual, as shown in Fig. 7f, where a person enters the scene and goes onto the road and makes loops. Figure 7g depicts an interesting case. A person comes in to the scene and sits on the chair for some time. This scenario is correctly detected to be non-conforming to our model as the training phase did not include any such behavior. As shown above, the proposed method works fairly robustly and we are able to achieve very good results. We have propose a very general solution. Yet, the solution is flexible enough to be adopted for many applications monitoring behavior of objects in a scene. 5 Conclusion In order to address the growing need for new and innovative methods to do efficient and accurate path modeling for video surveillance, we propose a novel method based on DBN whose performance is shown to be comparable with

10 SIViP (2010) 4:1 10 the existing state-of-the-art methods. However, the advantage of our methods lies in doing-away with the extra processing commonly carried out while performing scene modeling, like calibrating cameras, obtaining prior knowledge of the number of different paths, or constructing path envelopes. In this paper, we showed that the trajectories obtained during the training phase can be clustered into different paths by the application of the eigendecomposition. These identified paths are then used to model the scene. That is, a novel feature set characterizing the motion of pedestrians, in terms of their speeds, sizes, locations, and discontinuities in their velocities and accelerations, are estimated from these clustered paths. These features are then used to train a DBN, which is used during the training phase to classify the test trajectories as either accept or reject. We obtain good results on four datasets, demonstrating the practicality, generality, and yet accuracy of the proposed approach. References 1. Makris, D., Ellis, T.: Path detection in video surveillance. Image Vis. Comput. J. (IVC) 20, 895 903 (2002) 2. Junejo, I., Foroosh, H.: Trajectory rectification and path modeling for video surveillance. In: Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV), pp. 1 7 (2007) 3. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Trans. Syst. Man Cybern. C 34, 334 352 (2004) 4. Zhang, Z., Huang, K., Tan, T.: Comparison of similarity measures for trajectory clustering in outdoor surveillance scenes. In: Proceedings of the 18th IEEE International Conference on Pattern Recognition (ICPR), vol. III, pp. 1135 1138 (2006) 5. Vlachos, M., Gunopulos, D. G. K.: Robust similarity measures for mobile objects trajectories. In: Proceedings of the DEXA Workshops, pp. 721 728 (2002) 6. Keogh, E.J., Pazzani, M.J.: Scaling up dynamic time warping for datamining applications. Knowl. Discov. Data Min. 285 289 (2000) 7. Rangarajan, K., Allen, W., Sha, M.: Matching motion trajectoriesnext term using scale-space. Elsevier Pattern Recognit. 595 610 (2003) 8. Grimson, W., Stauffer, C., Romano, R., Lee, L.: Using adaptive tracking to classify and monitor activities in a site. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22 29 (1998) 9. Boyd, J.E., Meloche, J., Vardi, Y.: Statistical tracking in video traffic surveillanc. In: Proceedings of the 7th IEEE International Conference on Computer Vision (ICCV), pp. 163 168 (1999) 10. Johnson, N., Hogg, D.: Learning the distribution of object trajectories for event recognition. Image Vis. Comput. 14, 583 592 (1996) 11. Piciarelli, C., Foresti, G., Snidara, L.: Trajectory clustering and its applications for video surveillance. In: Proceedings of the 2nd IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 40 45 (2005) 12. Khalid, S., Naftel, A.: Classifying spatiotemporal object trajectories using unsupervised learning of basis function coefficients. In: Proceedings of the 3rd ACM International Workshop on Video Surveillance and Sensor Networks, pp. 45 52 (2005) 13. Calderara, S., Cucchiara, R., Prati, A.: Detection of abnormal behaviors using a mixture of von mises distributions. In: Proceedings of the AVSBS, pp. 141 146 (2007) 14. Makris, D., Ellis, T.: Learning semantic scene models from observing activity in visual surveillance. IEEE Trans. Syst. Man Cybern. B 35, 397 408 (2005) 15. Junejo, I., Foroosh, H.: Euclidean path modeling for video surveillance. Elsevier J. Image Vis. Comput. (IVC) 26, 512 528 16. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Machine Intell. (PAMI) 22, 888 905 (2000) 17. Wright, J., Pless, R.: Analysis of persistent motion patterns using the 3d structure tensor. In: Proceedings of the IEEE Workshop on Motion and Video Computing, pp. 14 19 (2005) 18. Zhang, Z., Huang, K., Tan, T., Wang, L.: Trajectory series analysis based event rule induction for visual surveillance. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1 8 (2007) 19. Jiang, F., Wu, Y., Katsaggelos, A.: Abnormal event detection from surveillance video by dynamic hierarchical clustering. In: Proceedings of the 14th IEEE International Conference on Image Processing (ICIP). V, pp. 145 148 (2007) 20. Fu, Z., Hu, W., Tan, T.: Similarity based vehicle trajectory clustering and anomaly detection. In: Proceedings of the 12th International Conference on Image Processing (ICIP), pp. 602 605 (2005) 21. Wang, X., Tieu, K., Grimson, E.: Learning semantic scene models by trajectory analysis. In: Proceedings of the 9th European Conference on Computer Vision (ECCV), pp. 1 8 (2006) 22. Javed, O., Shah, M.: Tracking and object classification for automated surveillance. In: Proceedings of the 7th European Conference on Computer Vision (ECCV), pp. 439 443 (2002) 23. Weiss, Y.: Segmentation using eigenvectors: A unifying view. In: Proceedings of the 7th IEEE International Conference on Computer Vision (ICCV), pp. 975 982 (1999) 24. Rao, C., Shah, M.: A view-invariant representation of human action. Int. J. Comput. Vis. 50(2), 203 226 (2002) 25. Bishop, C.M.: Pattern Recognition and Machine Learning, 1st edn. Springer, Berlin, ISBN: 0-387-31073-8 (2006) 26. Welch, G., Bishop, G.: An introduction to the kalman filter. ACM SIGGRAPH 2001 Courses (2001) 27. Bengio, Y., Frasconi, P.: An input output HMM architecture. In: Proceedings of the Advances in Neural Information Processing Systems (NIPS), vol. 7, pp. 427 434 (1995) 28. Murphy, K.P.: The bayes net toolbox for MATLAB