Social Network Model for Crowd Anomaly Detection and Localization

Size: px

Start display at page:

Download "Social Network Model for Crowd Anomaly Detection and Localization"

Victoria Robbins
5 years ago
Views:

1 Social Network Model for Crowd Anomaly Detection and Localization Abstract Rima Chaker Zaher Al Aghbari Imran N. Junejo {rkhalid zaher In this work, we propose an unsupervised approach for crowd scene anomaly detection and localization using a social network model. Using a window-based approach, a video scene is first partitioned at spatial and temporal levels, and a set of spatio-temporal cuboids is constructed. Objects exhibiting scene dynamics are detected and the crowd behavior in each cuboid is modeled using local social networks (LSN). From these local social networks, a global social network (GSN) is built for the current window to represent the global behavior of the scene. As the scene evolves with time, the global social network is updated accordingly using LSNs, to detect and localize abnormal behaviors. We demonstrate the effectiveness of the proposed Social Network Model (SNM) approach on a set of benchmark crowd analysis video sequences. The experimental results reveal that the proposed method outperforms the majority, if not all, of the state-ofthe-art methods in terms of accuracy of anomaly detection. Keywords: crowd modeling, social network model, crowd analysis, anomaly detecting, anomaly localization, scene understanding, and video surveillance. 1. Introduction Crowd is defined as a collection of large number of people in a confined space. Socio-psychological studies [49] [50] have shown that people in a crowd tend to walk in groups, thus forming collective entities[31] each of which has a specific goal and similar characteristics like speed and trajectory. Early detection, or prediction, of abnormal behaviors occurring in surveillance scenario scenes is of utmost significance. By alerting human operators, potential dangerous consequences can be reduced, or prevented. However, the analysis of crowded scenes is a very challenging task, due to the fact that the analysis of human actions is still not a fully solved problem. The significance of understanding crowd scenes is due to its potential in applications such as crowd management [41], video surveillance [3], public space design [2], etc. Recently, crowd motion segmentation [42][5], crowd density estimation [7][8], and identifying individuals behavioral goals within a crowd [6], have all been subject of active research from different disciplines. This problem presents challenges of great complexity due to: (1) occlusion between individual objects, (2) random variations in the density of people over time, (3) low resolution videos with dynamic background, and (4) the inherent difficulty in accurately modeling the crowd behavior. What is needed is an automatic systems for analyzing crowd scenes and alerting human operators once anomalous activities are detected so that dangerous situations can be prevented. Anomaly detection refers to modeling the normal scene behavior and then to detect the behavior that does not confirm to it. Thus, behavior patterns that appear frequently, are referred to as normal behaviors and those appearing rarely are referred to as abnormal behaviors. In [10], anomaly detection is broadly classified into two types, namely local and global. Local abnormal behavior corresponds to the behavior of a group of objects in a localized region that is different from that of their neighbors in spatio-temporal terms [16]. On the other hand, global abnormal behavior corresponds to the abnormal behavior of a group of

2 Figure 1: A typical scenario (anomalies circled in red). (a) The region of instability flow of Pilgrims circling around kabba is detected. (b) Sample frame of anomaly detected (bicycle) in the UCSD dataset. objects in the whole scene. The key to accurate detection of abnormal behavior is the selection of an appropriate model that properly models both the local and the global behavior. Figure 1-(a) denotes a typical scenario. The red circle represents the detected region of instable flow around Kabba in Mecca. Another example is illustrated in Figure 1-(b). The appearance of the bicyclist, circled in red, represents an anomaly with respect to the overall behavior of its surrounding neighbors. In this paper, we aim at detecting local and global abnormal behaviors in crowd scenes using a social network model: a data structure consisting of nodes and links between the nodes. In the crowd scene context, nodes can represent people and links reflect the social relationship among the people. First, the unsupervised approach extracts dense tracklets from the crowd motion data in a scene. Second, the video scene is partitioned at spatial and temporal levels; as a result, a set of spatio-temporal cuboids are constructed. The granularity of scene partitioning is proportional to the crowd density. Third, we cluster the objects in each cuboid based on the unique features of their tracklets, such as velocity, curvature, direction, etc., to build the local social networks, which model the objects local behavior. Fourth, for each of the subsequent time windows, the global social network is updated incrementally using its local social networks and the previous window s global social network. By analyzing these social networks (local and global), a normal, or dominant, behavior and abnormal behavior can be identified. An earlier version of this work appeared in [51]. 2. Related Work Crowd behavior analysis comprises of motion information extraction and behavior modeling. The model is then used to distinguish between normal and abnormal behavior. Basharat et al. [5] use object tracking[43] to detect unusual events in image sequences. Similarly, Ali et al. [12,13] track subjects in high density crowd scenes that are captured from a distance. They learn the direction of motion as a prior information based on a force model (floor fields). However, their method requires a manual selection of individuals to be tracked in the crowd which hinders automatic unexpected behavior recognition. Also, floor fields is chaotic in crowded scenes as they result in highly inconsistent trajectories. For motion modeling, features, such as optical flow [11], tracklets [26], or Mixture of Dynamic Textures [16], are extracted at the pixel level. Different models are then built to solve the perplexities of occlusion and clutter. These models include Gaussian Mixture Model [21], Social Force Model [10], etc. For example, Mehran et al.[10] explore the socio-psychological concept social force in combination with optical flow to compute interaction forces that are later combined with Latent Dirichlet Allocation to model normal behaviors and detect abnormal ones. This method is further extended in [11] using Particle Swarm Optimization, in addition to social force model, to optimize the computed interaction force and thus detect global abnormal activities. Ali and Shah [13] utilize the idea of coherent structures in fluid dynamics for

3 Figure 2: Scenario of detecting anomalies using social network model: objects are detected and tracked. A spatio-temporal partitioning is constructed, producing a set of spatio-temporal cuboids that capture spatial and temporal features. A hierarchical social network is built to model crowd behavior. At the bottom-level of this hierarchical network, a spatial clustering is applied on each cuboid to detect local anomalies in its local social network. Moving up the hierarchical social network, a hierarchical clustering approach is employed to build the global social network. A temporal clustering is then applied on the global social network to detect global anomalies in each time window. An on-line mechanism is applied to update the global social network, for any subsequent time windows. segmenting dominant crowd flows and flow instability detection. Gaidon et al. [38] structure a video as a tree of nested motion components composed of short duration point trajectories, tracklets. Chongjing et al. [26] analyze motion patterns by clustering the extracted tracklets in a dynamic crowd scenes. [46] use spatiotemporal Laplacian eigenmap to extract different crowd activities from videos. Despite the many different representations of video events, many of the existing works ignore the importance of contextual anomaly in the field of crowd analysis. Contextual anomaly arises when an individual behavior exhibits behavior similar to others but it is anomalous in a specific context (e.g. neighborhood) [15]. Jiang et al. [15] focus on detecting contextual anomalies in the context of motion using statistical analysis. Leach et al. [18] detect subtle context-dependent behavioral anomalies based on contextual information. Beside the motion information, other works include important object features such as appearance or size. Mahadevan et al. [16] apply Mixture of Dynamic Textures (MDT) to jointly model the appearance and dynamics of crowded scenes. Their approach investigates both temporal and spatial abnormalities. Due to the reported heavy computational cost of [16], Reddy et al.[17] propose a more robust anomaly detection algorithm with relatively low complexity, while analyzing the size, motion and texture. An important aspect in crowd behavior analysis is event/behavior recognition. Regular motion patterns such as direction and speed [24,25,40] can be used to estimate the behavior of a crowd in a given environment. A deviant behavior from the normal behavior is considered abnormal behavior. Two types of approaches are commonly used: object-based approach and holistic-based approach [10]. In object-based

4 approaches, the crowd is considered as a collection of individuals. Ozturk et al. [24] propose an approach for clustering a set of flow vectors into local dominant motion flows. The local dominant motion flows are later combined to determine the global dominant motion flows in a crowd scene. In holistic-based approaches, a crowd, or a portion of a crowd, is treated as a single entity to estimate the regular and abnormal motions. For example, Mehran et al. [10] explored the social force model, which is based on socio-psychological studies, to model the behavior of a crowd. Anomaly Detection Techniques: To ensure public safety, the main objective of crowd analysis involves modeling the crowd dynamics and the detection of video anomalies in the scene. However, detecting anomalies in crowd scenes is a challenging task due to the followings [1][2]: The large number of moving objects in crowd scenes easily weakens the local anomaly detector. It is difficult to model the abnormal events, as they are rare and last for a short period of time. It is difficult to obtain a training dataset that covers every possible normal behavior. [48] propose an informative structural context descriptor (SCD), in addition to the 3-D discrete cosine transform (DCT), for describing the crowd individual, Ullah et al.[20], Mehran et al. [10] and Cui et al. [22] detect abnormal events in scenes of escape panics. Ullah et al. [20] initialized a fixed grid of particles that extracted the crowd motion features, and Gaussian Mixture Model [27] was adopted to learn the crowd behavior. The closest works to the proposed method are [10], [22] in terms of considering people social behaviors. Mehran et al.[10] attempt to detect abnormal events with a social force model. A bag of words method and a Latent Dirichlet Allocation are exploited to discriminate between normal and abnormal frames. Abnormal areas are localized as those representing higher force magnitudes. Cui et al. [22] propose interaction energy potentials to model group activities based on social behavior analysis and finally detect escape panic behavior in crowd. Saligrama et at. [19] categorize approaches of detecting abnormal behavior in crowd scenes into two types: local abnormal event (LAE) or global abnormal event (GAE). In LAE, most of the state-of-the-art methods extract motion or appearance features from local patches such as Mahadevan et al. [16]. For the GAE, Mehran et al. [10] detect abnormal crowd behavior by adopting the social force model and then using the Latent Dirichlet Allocation to discriminate abnormal frames from the normal ones. The above methods are often computationally expensive [20]. We proposed a simple yet robust approach where motion features are extracted from corner features by repeatedly generating features-to-track over a temporal window using KLT (Kanade-Lucas-Tomasi) [28,39]. In addition, our method is applicationindependent for detecting abnormal behaviors from different application videos. The proposed method not only detects anomalous events accurately, but also adapts itself to both spatial and temporal changes witnessed in the environment over time. The overview of the proposed method is shown in Figure Scene Modeling with Social Networks Given a set of objects in a crowd scene,,,, where N is the number of objects, is a feature vector, based on spatial and temporal characteristics, describing an individual object and d is the feature dimensionality. In order to capture the dynamics of the crowd, we extract motion tracklets [24, 25], using the KLT keypoint tracker [39]. A tracklet,, is a fragment of a long trajectory tracked across a small number of frames. Their short duration limits drifting problems i.e. trajectories deviating from the underlying tracked object. 3.1 Similarity Features In order to group tracklets that exhibit similar behavior, we focus on selecting the features that account for the (i) direction and magnitude of the motion, (ii) distance between the moving objects, and (iii)

5 different motion curvature of the object. Thus, we use the following measures: Cosine Similarity: Let, denote the dominant directions of tracklets and respectively. Thus, the cosine similarity is defined as [36]:,. 1.. (1) Magnitude Similarity: Let, denote the magnitudes (i.e. the distance between the first and the last spatial coordinates) of tracklets and respectively, the magnitude similarity is defined as,, 1 (2), Combining both similarity measures, and, linearly produces a weighted similarity measure, [36]as:,., 1., with 0 1, (3) where is the parameter that balances the effect of direction and magnitude of the two tracklets. Velocity Similarity Measure: Velocity is computed for each tracklet and Dynamic Time Warping (DTW) is used to measure the velocity similarity, between two tracklets and. We use the following local distance measure, :,,,, (4) where, and, represents velocity distance between the two tracklets along the x-axis and y-axis, respectively. The parameter and represents the standard deviation parameter in x-velocity and y-velocity respectively. Spatio-Temporal Curvature Similarity Measure: This measure capturing the discontinuity in velocity, acceleration and position of an object, is given by:, (5) where and are the and components of the velocity and and are the and components of the acceleration. This measure, denoted by,, is computed using DTW by using the following local distance measure:,,, (6) where, represents curvature distance between tracklets and, and represents the standard deviation parameter in spatio-temporal curvature. The similarity measures defined above are used by the proposed method (SNM) to cover the following cases:

Figure 3: Spatio-temporal cuboids at various spatial and temporal scales (represented by the upper arrows). Scale representation scheme is performed (represented by the down arrows).

6 Figure 3: Spatio-temporal cuboids at various spatial and temporal scales (represented by the upper arrows). Scale representation scheme is performed (represented by the down arrows). Cosine Similarity: This covers tracklets with zero Euclidean distance, but moving in different directions. They are considered dissimilar by SNM. Magnitude Similarity: This applied to tracklets moving in the same direction but have different lengths. A short tracklet is not considered similar to a long tracklet. Velocity Similarity: Spatially dis-similar tracklets moving in the same direction and having almost equal lengths are not considered similar if they exhibit different motion behavior. Spatio-Temporal Curvature Similarity: Tracklets similar in all above defined measures but with different curvatures are considered dissimilar. Now we are able to give the definition of our two social similarity measure between two tracklets and : Definition 1 (Velocity based Social Similarity Measure) Let, denote direction-magnitude similarity and, denote velocity similarity between the two tracklets and, then the social similarity measure between and is defined as,, =., 1.,, with 0 1 (7) where is a parameter that balances the effect of direction and magnitude on one hand and the velocities of the two tracklets on the other. Definition 2 (Curvature based Social Similarity Measure) Let, denote direction-magnitude similarity and, denote spatio-temporal curvature similarity between the two tracklets and, the social similarity measure between and is defined as,, =., 1.,, with 0 1 (8)

7 Figure 4: The procedure of producing LSN per cuboid. (a) We partition the current time window into cuboids,,,, using the spatio-temporal partitioning approach. Next, determine the tracklets within each processed cuboid. Tracklets in cuboid are colored differently for clarification. (b) Symmetric adjacency matrix of tracklet nodes similarity weights. (c) Connected tracklet nodes make up a local social network component, represented by its average-feature centroid (represented by black dot). where is a parameter that balances the effect of direction and magnitude on one hand and the spatiotemporal curvatures of the two tracklets on the other. The above two measures capture different behavior of the scene. As we shall show, one of these measures might be more appropriate for a certain crowd scene than the other depending on the applications and the scene dynamics. Thus our social similarity measures are flexible and work with different features depending on the nature of the video. 3.2 Spatio-Temporal Partitioning Inspired by the multi-resolution approaches, we sub-divide the input videos into smaller regions. This spatio-temporal partitioning is performed at various spatial and temporal scales producing a unique set of spatio-temporal volumes. We refer to an individual spatio-temporal volume as a cuboid,, where 1, is the number of rows and columns respectively of the spatio-temporal partitions within a window Ω. Each 3D spatio-temporal cuboid in a video is of size nx x ny x nf, in which nx x ny is the spatial dimensions of the cuboid and nf is the depth (or the number of frames). Each cuboid consists of the tracklets found within its dimensions. Therefore, the whole tracklet can belong to one or more cuboids at time window Ω. Depending on the dataset and the crowd dynamics, spatial blocks may range from 2 x 2 to m x m cuboids and temporal window of f frames. We observed that shorter duration (< 50 frames) yields erroneous tracklets due to motion blur and self-occlusions; therefore, in our experiments we set f to 50. Figure 3 illustrates the construction of the video hierarchy forming spatio-temporal cuboids at various levels i.e. 2 x 2, 4 x 4 or 8 x 8: the higher the density of the crowd, the higher the granularity of the partitioning to capture the details of the scene dynamics (illustrated by the right-to-left arrows in Figure 3).

8 3.3 Building Social Networks A social network is represented as a graph [30] where nodes represent objects and edges represent social interactions between people [29]. That is, each tracklet is represented by a node in the social network model, and the edge between two nodes represent the social interaction between these two nodes. The social interaction weights are based on our social similarity weight measure Equation (7) or Equation (8). On a graph, the geodesic between two nodes is a path connecting the nodes with the smallest number of edges. Since similar behaving tracklets need to be spatially close to each other, in addition to the social similarity measure, we use the closeness centrality among connected nodes representing tracklets for pruning only. The closeness centrality is defined as (the inverse of) the average distance to all other nodes[44]. If similar nodes are spatially distant (greater than a threshold ), their connecting edge is deleted. This is then followed by applying the connected component algorithm [35] to the whole network to find the connected components of the social network in each cuboid. Each extracted connected component is considered as a cluster denoted as the local social network (LSN). The aim is to identify the different dynamics of the scene, represented by the clusters in the network Building Local Social Networks (LSN) The cluster obtained above is denoted by its centroid, computed as a mean of the spatial (,, direction (, magnitude (, velocity and/or curvature (κ features of the tracklets belonging to a cluster :, κ (9) By finding the connected components, as defined above, we end up with a number of cluster(s) within each cuboid - referred to as local social network,. Figure 4 shows an example of processing one cuboid (shaded in red). The six extracted tracklets in are colored differently for clarification. Also a node is colored by its tracklet color for ease of referencing. Algorithm 1uses a threshold on the computed social similarity measure between two tracklets and ; and a threshold ( ) on the computed closeness centrality measure between tracklets. The results are stored in a symmetrical similarity adjacency matrix A. This adjacency matrix of non-zero value represents the weights of similarity among tracklet nodes, where zero indicates dissimilarity and one indicates highest similarity. Finally, for each component of a local social network,, its centroid, is computed (represented by black dot, Figure 4 bottom right). Hence, the algorithm outputs the local social network component(s) per cuboid including the corresponding centroid(s) Building Global Social Networks (GSN) GSN gives a general view of the activities occurring in a time window Ω. The Hierarchal Agglomerative Clustering (HAC) [34] is applied to merge similar from different cuboids in a time window into a global social network,, in a hierarchal fashion. Merging two components of the local

9 social networks (say and ) is based on the social similarity Equation (7) or Equation (8) between their centroid, i.e. and, respectively. That is, if the social similarity value between and is above the threshold (, then and are merged together to make a bigger social network and its new centroid is computed. This process continues up the hierarchy until no more merging is possible. The resulting global social network is considered a that may consist of one or more components. This bottom-up approach, as shown in Figure 5, aims to merge similar LSNs from different cuboids and finally discover the global social network within time window Ω. This is shown in GlobalSocialNetwork algorithm that takes as input the local social network from all cuboids ς within time window Ω including the representative centroid of each. The results of LSNto-LSN comparison are stored in a symmetrical similarity adjacency matrix, as an undirected graph. Adjacency matrix of non-zero value represents the weight of similarity among LSN components and zero indicates dissimilarity. 3.4 Anomaly Detection The social similarity measure and the size of social network are essential for detecting abnormal behavior. The social similarity measure separates the rare actions from the dominant ones. That is the resultant social network(s) with very few nodes is denoted as deviant behavior from the other dominant social network(s). Thus, isolated and small (few nodes) social networks are marked as anomaly. If the relative local size,, of the tested local social network component is less than ts, where ts is the ratio of LSNi to the largest local social network components in, where 1, then LSNi is classified as an anomaly. ts is set to 0.5 in our experiments:

10 , 1 (10) Table 1 shows an example of window Ω consisting of 50 frames partitioned into 2 x 2 spatiotemporal cuboids. On processing cuboid, for instance, it produces seven LSN components, of which four are normal and the other three are abnormal. As show in the Anomaly Detection Algorithm, the abnormality classification is based on the social similarity measure, Equation(7) or Equation (8), followed by the size of a LSN relative to the largest local social network component within the cuboid - Equation (10). Once the anomalous LSN is identified, the localization is simply determined by using the spatial feature, of the tracklet members in the anomalous LSN. The social similarity measure isolates the anomalous components from the normal components. Then, to identify those anomalous components, the size feature is used (see Anomaly Detection Algorithm). Within each window Ω, global anomalies at the top-level of the hierarchy are identified using Equation 11: if the relative global size,, of a target GSN component size is less than tg, where tg is the ratio of GSNj to the largest GSN in Ω, where 1, then is classified as an anomalous:, 1 (11)

11 An example of global anomaly detection is shown in Table 2. By using the hierarchal partitioning scheme, we can zoom in to finer details of the crowd behavior, which increases the efficiency of detecting and localizing anomalies, especially local anomalies. Also, as we move up the hierarchy level, certain tracklet nodes classified as abnormal in a lower level LSN component(s), might be merged with other nodes in a higher level normal LSN component(s) and vice versa. 3.5 GSN-Update The proposed hierarchical model maintains a link between local and global social networks. In this phase, we seek to learn any newly observed events, and in turn update the global social network, implicitly gaining any changes in the bottom-level of the hierarchy i.e. local social network. The process of GSN-Update performs as follows (illustrated in Figure 6): For every two successive windows, cluster centroid algorithm is employed, Equation (9), instead of tracklet-to-tracklet comparison that demands a high computational time. GSN components of current window are merged with previous components windows, i.e. Ω and Ω, therefore the corresponding GSNs are compared. The similarity comparison corresponds to only if their centroids exhibit features similarity. Non-matching GSN components from windows Ω and Ω, are dealt with as follows: a. Non-matching global social network components(s) that belong to the recently processed time window Ω are destroyed. b. Non-matching global social network components(s) that belong to current time window Ω are preserved.

12 Figure 5: Constructing GSN by hierarchically grouping similar LSNs components from different cuboids. (a) Once we obtain the local social network components, the hierarchical clustering algorithm is employed to have a coarser view of the scene. (b) The bottom-up approach, will aim to merge similar local social network components from different cuboids towards discovering the global social network within time window. As an example, Figure 6. shows three windows Ω,Ω and Ω. GSN-Update on windows Ω and Ω, merges similar GSNs i.e. the red,, and blue,,, respectively. As satisfies condition (a) above, it is destroyed. As satisfies condition (b) above, and is preserved. The result of GSN- Update, i.e. between window Ω and window Ω, is used as the input for the successive windows in the GSN- Update process, i.e. window Ω, and so on.

behaviors. Cuboid in Time Window Ω LSN Component No. Analysis of Local Social Network Dominant Feature(s) No.

13 Table 1: Example of local anomaly detection. The time window is partitioned into 2 x 2, cuboid enclosed in red, produces 7 local social network components, in which 4 are classified as normal social network components and 3 exhibit abnormal behaviors. Cuboid in Time Window Ω LSN Component No. Analysis of Local Social Network Dominant Feature(s) No. of Tracklets Local Anomaly - LSN1 Direction & Magnitude 1 Yes LSN2 Magnitude 24 No LSN3 Direction & Magnitude 29 No LSN4 Direction & Magnitude 15 No LSN5 Direction & Magnitude 14 No LSN6 Direction & Magnitude 8 Yes LSN7 Direction & Magnitude & Velocity 7 Yes Table 2: Example of global anomaly detection. Window 1 produces 3 global social network components: colored in red, colored in green and colored in yellow. Out of the three global social network components, 2 are classified as normal social network components and 3 exhibit abnormal activities. Time Window Ω GSN Component No. Analysis of Global Social Network Dominant Feature(s) No. of Tracklets Global Anomaly - GSN1 Direction 345 No GSN2 Direction & Velocity 21 Yes Direction 274 No

14 Figure 6: GSN-Update on windows, and. Global social network components are labeled with the same window index. Similar global social network components from different windows contain the same network shape and color. Similar global social network components are merged together as shown in level 1. The red global social network components and blue global social network components are merged, respectively. Non-existing global social network component in recently processed window,, is destroyed. Newly non-matching global social network components in current window are preserved,. The result of GSN-Update between window and window is used as the input for the successive windows,, in the GSN-Update process. 4. Experiments & Results Our proposed method run all the experiments on a PC computer with an Intel(R) Core(TM) i5 3.10GHz CPU and 4GB RAM under the MATLAB implementation. We have used publicly available datasets: UCSD Dataset: The UCSD anomaly detection dataset 1 uses an elevated stationary camera and overlooks pedestrian walkways on UCSD campus. The dataset represents a real scene and the abnormalities occur naturally containing videos of two different pedestrian scenes, namely USCD Ped1: containing groups of people walking towards and away from the camera with some amount of perspective distortion; and UCSD Ped2: containing groups of people walking in parallel to the camera plane. The crowd density in the walkways was variable, ranging from sparse to crowded. The normal events contain only pedestrians. The abnormal events are due to either: 1) the appearance of non-pedestrian entities in the walkways, and/or 2) anomalous pedestrian motion patterns. Commonly occurring anomalies include small carts in the scene, skaters, bikes, and people in wheelchairs. The UCSD dataset contains both frame-level ground-truth and pixel-level ground-truth. 1

15 UCD Dataset: The UCD dataset 2 contains two outdoor videos of students moving across two buildings lasting for 12 and 5 minutes, respectively. Each sequence is segmented into two different subsequences with people mainly moving in a horizontal direction in the scene. This dataset defines anomaly as the deviations from what has been observed beforehand. The groundtruth consists of the number of frames in the scene when someone starts moving against the dominant crowd motion. In our experiments, of Equation 7 and of Equation 8 are determined experimentally to be 0.4 and 0.8, respectively. 4.1 Performance Evaluation For both local and global scene understanding and anomaly detection we use [16]: a frame-level criterion - a frame is considered an anomaly if it contains at least one abnormal pixel, and denoted as positive; and the pixel-level criterion - a frame is considered anomaly if (i) it is positive and (ii) at least 40 percent of its anomalous pixels are truly identified. For GSN evaluation, the Receiver Operating Characteristic (ROC) curve is computed and the Area Under the Curve (AUC) is used for comparison. In addition, we measure [16]: Equal Error Rate (EER) - the percentage of misclassified frames when the false positive rate (FPR) is equal to the false negative rate (miss rate) i.e. FPR = 1 - true positive rate (TPR). EER is calculated for both pixel and frame level analyses; Rate of Detection (RD) - reports the detection rate at equal error point on processing the anomaly localization component, i.e., 1- EER[16]. A. UCSD dataset Local Social Network Evaluation We use 200 frames at resolution 158 x features are detected with a minimum distance of 3 to have a complete coverage of the scene. We partitioned the dataset into 4 time windows. Each time window of 50 frames is partitioned into 8 x 8 spatio-temporal cuboids. The dataset contains the biker anomaly. The results are compared against the ground-truth in terms of frame accuracy and pixel accuracy. In addition, the average of both frame accuracy and pixel accuracy, respectively, is computed for each time window. As shown in Table 3, the green cuboid represents false positive abnormal behavior in some of its LSNs. The orange-colored cuboid represents true positive abnormal activity, whereas the abnormal region is enclosed in red border. For instance, the first time window, Ω, starting from frame 1 and ending at frame 50, contains 7 false positive cuboids. The abnormality is due to the existence of tracklets that exhibit rare features relative to the surrounding neighborhood. Cuboid 12 results in abnormal LSN with tracklets exhibiting short magnitude comparable to the dominant longer tracklets in the surrounding. Moving to the next time window Ω that starts at frame 51 and ends at frame 100, two abnormal cuboids out of four are detected correctly. The abnormal LSN components are in cuboid 47 and cuboid 55. However, the abnormal LSN components in cuboids 46 and 54 are wrongly classified as normal due to: (i) in cuboid 46, the incomplete abnormal tracklet(s) exhibit similar features to its surrounding neighbors and thus were assigned to the normal LSN component, (ii) cuboid 54 contains only one tracklet. The tracklets in cuboid 54 in Ω that starts at frame 101 and ends at frame 150, were removed. 2

Table 1: Frame average accuracy and pixel average accuracy of LSN algorithm. A dataset is partitioned into 4 time windows, each consist of 50 frames and partitioned into 8 x 8 spatio-temporal cuboids.

For each time window, the frame accuracy and pixel accuracy of the abnormal local social network components are averaged.

8 47 80 84 46 79 75 45 76 77 101-150 37 79 83.2 36 77 83 Average Accuracy 78.2 80.4 37 79.4 83.2 36 80 83 28 82 89.2 151-200 27 76 82 19 76 79 Average Accuracy 78.7 83.3 Moreover, cuboid 46 gained more feature information regarding the abnormal tracklets and thus were distinguished in the neighborhood.

16 Table 1: Frame average accuracy and pixel average accuracy of LSN algorithm. A dataset is partitioned into 4 time windows, each consist of 50 frames and partitioned into 8 x 8 spatio-temporal cuboids. Green-colored blocks represent false positive abnormal behavior while orange-colored blocks represent true positive abnormal activity. For each time window, the frame accuracy and pixel accuracy of the abnormal local social network components are averaged. Frame Sequences 8 x 8 spatio temporal cuboids Anomaly Spatiotemporal Cuboids Frame Accuracy Pixel Accuracy 1 50 Average Accuracy Average Accuracy Average Accuracy Average Accuracy Moreover, cuboid 46 gained more feature information regarding the abnormal tracklets and thus were distinguished in the neighborhood. The same applies on cuboid 27 in Ω that starts at frame 151 and ends at frame 200. Further analysis from Table 3 shows, as more tracklet information is gained or new behavior is captured by the newly produced tracklets, both average frame accuracy and average pixel accuracy increases with time. Moreover, each time window reflects the changing environment in the scene. As an example,

1 0.8 TPR 0.6 0.4 SNM 0.2 0 0 0.2 0.4 0.6 0.8 1 FPR Figure 7: Frame-level ROC curves on UCSD Ped1 dataset. Left: Our proposed approach SNM. Right: The state-of-the-art methods from [23]. TPR 1 0.8 0.

17 1 0.8 TPR SNM FPR Figure 7: Frame-level ROC curves on UCSD Ped1 dataset. Left: Our proposed approach SNM. Right: The state-of-the-art methods from [23]. TPR FPR SNM Figure 8: Frame-level ROC curve of our proposed approach SNM on UCSD Ped2 dataset. cuboid 55 in Ω was removed from the abnormal behaviors hence reflecting the ongoing crowd scenario in video. Another example is cuboid 36, which was wrongly classified as normal in window Ω, but later its truly abnormal detection was increased to an average of 78.5 in frame accuracy and an average rate of detection at 83 on the next two windows, Ω and Ω. B. UCSD dataset Global Social Network Evaluation For performance comparison, we choose six state-of-the-art methods namely: the Mixture of Dynamic Texture (DTM) [16], the Social Force Model (SF) [10], the Mixture of Optical Flow (MPPCA) [45], the Social Force Model with MPPCA (MPPCA+SF ) [16], and the Optical Flow Monitoring (Adam s) [46]. The quantitative results of these six methods are obtained from [16]. In addition, we also included the Sparse Reconstruction Cost (Sparse) of [23]. The abbreviation of our proposed method is Social Network Model (SNM). Two frame-level ROC curves are produced for UCSD Ped1 and UCSD Ped2 datasets, as shown in Figure 7 and in Figure 8, respectively. As UCSD Ped2 does not provide pixel-level ground- truth, we only present pixel-level ROC curve of UCSD Ped1, as in Figure 9. In addition, Figure 10 shows the Equal Error Rate (EER) of our approach and the state-of-the-art methods.

1 0.8 TPR 0.6 0.4 0.2 0 SNM 0 0.2 0.4 0.6 0.8 1 FPR Figure 9: Pixel-level ROC curves on UCSD Ped1 dataset. Left: Our proposed approach SNM; Right: The stateof-the-art methods from [23].

18 1 0.8 TPR SNM FPR Figure 9: Pixel-level ROC curves on UCSD Ped1 dataset. Left: Our proposed approach SNM; Right: The stateof-the-art methods from [23]. We also calculated the Area Under Curve (AUC) values (cf. Table 4), as well as the Rate of Detection (RD) values in Table 5. Missing entries indicate unavailable results. Some example of frames with anomalies detected by the proposed approach and by some state-of-the-art methods are shown in Figure 11. Our frame-level ROC curve on UCSD Ped2 shows higher anomaly detection rate than existing methods, except slightly lower than Sparce [23] on UCSD Ped1. On the other hand, our pixel-level ROC curve on UCSD Ped1, see Table 5, outperforms all state-of-the-art methods. For EER, our frame-level EER (about 20%) for UCSD Ped1 outperforms all methods, but is slightly worse than Sparse method [23] (about 19%), see Error! Reference source not found.. However, for the more precise pixel-level criterion (RD) on UCSD Ped1, see Table 5, our rate of detection is (48.5% > 46% [23]) which significantly outperforms all the state-of-the-art methods. For AUC values on UCSD Ped1 and UCSD Ped2 datasets, we obtained is 86.7% on average that also outperforms all the other methods including [23], where the average AUC is 86.1%, see Table 4. This indicates that the remaining approaches may be enjoying good detection rates in anomaly detection task due to lucky hits in terms of frame-level criterion. Some image results are shown in Figure 11 (the abnormal events are labeled by red masks), in which the first column is generated by DTM method [16], the second column is given by MPPCA+SF method [16], and the third and fourth are by our SNM method. MPPCA+SF method completely miss the biker in Figure 11-(b). DTM method does detect nearly all of the abnormal events, but the foreground mask is too large, which is not accurate, as shown in first column of Figure 11. For our method, we detect (third column in Figure 11) and track (fourth column in Figure 11) the abnormal objects robustly with more accurate masks, such as bikers, skaters, small cars, etc. Obviously, the proposed SNM method outperforms the other state-of-the-art methods. Our approach achieves high anomaly localization rate due to the efficiency of hierarchical construction of spatio-temporal cuboids at different spatial and temporal scales.

100% Equal Error Rate 80% 60% 40% 20% 0% UCSD Ped1 UCSD Ped2 Figure 7: Frame-level Equal Error Rate of UCSD Ped1 and UCSD Ped2 datasets.

The third and fourth rows show the AUC over the two datasets UCSD Ped1 and UCSD Ped2. The average over the two datasets is shown in the fifth row.

5% UCSD Ped2 84.8% 62.3% 77.4% 71.0% 63.4% 86.1% 87.9% Average 83.3% 64.9% 68.2% 68.9% 63.4% 86.1% 86.

19 100% Equal Error Rate 80% 60% 40% 20% 0% UCSD Ped1 UCSD Ped2 Figure 7: Frame-level Equal Error Rate of UCSD Ped1 and UCSD Ped2 datasets. Table 2: Quantitative comparison of performance for the abnormality detection algorithms tested. The third and fourth rows show the AUC over the two datasets UCSD Ped1 and UCSD Ped2. The average over the two datasets is shown in the fifth row. Anomaly Detection Experiment: AUC Algorithm DTM [16] SF[10] MPPCA[45] MPPCA+SF[16] Adam et al.[46] Sparce [23] SNM UCSD Ped1 81.8% 67.5% 59.0% 66.8% 86.0% 85.5% UCSD Ped2 84.8% 62.3% 77.4% 71.0% 63.4% 86.1% 87.9% Average 83.3% 64.9% 68.2% 68.9% 63.4% 86.1% 86.7% Table 3: The quantitative comparison of the detection rate (RD) at equal error for the anomaly localization task on UCSD Ped1. Our SNM approach achieves the higher detection rate among the state-of-the-art methods. Anomaly localization Experiment: Rate of Detection Localization DTM [16] SF[10] MPPCA [45] MPPCA+SF [16] Adam et al. [46] Sparse [23] SNM

Figure 11: Examples of abnormal detections using (i) the DTM approach [16], (ii) the MPPCA+SF approach [16], (iii) our detection approach and (iv) our tracking approach.

20 Figure 11: Examples of abnormal detections using (i) the DTM approach [16], (ii) the MPPCA+SF approach [16], (iii) our detection approach and (iv) our tracking approach. For DTM, its abnormal detection foreground mask is too large thus its results are not accurate; and for MPPCA+SF, it inaccurately detects the small car in (a), completely misses the bike in (b), completely misses the skater in (c) and produces spurious abnormality at the near end of the camera in (c). In contrast, our approach using social network model outperform the above approaches with high accuracy detection rate. Figure 12: The ROC curves of different spatio-temporal scales (2 x 2, 4 x 4 and 8 x 8) on UCSD Ped2 dataset. Table 4: The AUC of different spatio-temporal scales (2 x 2, 4 x 4 and 8 x 8) on UCSD Ped2 dataset. Spatio-temporal partitioning scales Scale AUC 2 x % 4 x % 8 x %

21 Figure 8: Anomaly detection in UCD dataset. Frames taken from video sequences representing normal behavior of crowd in the first row. Examples of frames containing anomalies are shown in the second row for the GMM method and in the third row for the proposed method (SNM), respectively. Spatio-temporal partitioning at different scales: In order to evaluate the impact of different spatiotemporal scales, we experiment with 2 x 2, 4 x 4 and 8 x 8 spatio-temporal scales on UCSD Ped2 dataset. The comparative AUC are shown in Table 6 and the ROC curves in Figure. It is clear that 8 x 8 spatiotemporal partitioning achieves the best result, degrades slightly on using 4 x 4 spatio-temporal partitioning and 2 x 2 spatio-temporal partitioning produces the worst result. The reason is that 2 x 2 provides coarse view of the scene. C. Comparison Form the above experiments we notice: (i) SNM is general - covers local and global anomalous events. On the other hand, Adam s work [46] detects only local abnormal events using Gaussian of Mixture Models. In addition, SF [10] is a spatial abnormality technique while MPPCA [45] is temporal abnormality technique. (ii) SNM is an unsupervised method, however Sparse [23] requires pre-learnt dictionary and MPPCA+SF [16] approach requires large training dataset. Moreover, the performance of MPPCA+SF [16] degrades if training dataset is small. In addition, optical force method, i.e. social force model uses offline learning. (iii) SNM extends to online event detection via incremental update mechanism. Although Sparse [23] also supports online event detection, however, its training is completely offline. In addition, DTM [16] is an offline approach.

22 D. UCD Dataset For this dataset, we compared SNM with the Gaussian mixtures model GMM [21] and the crowd segmentation model CSM [4] based on the anomaly detection ground truth. The performance is measured based on the detection accuracy rate. Quantitative comparison of SNM with the ground truth is shown in Table 5. SNM achieves higher anomaly detection accuracy in the four video segments. For accurate anomaly localization, we compared our results to the GMM [21] method. Figure 13 shows the results obtained on the UCD dataset. Frames from video scenes where crowd exhibit normal behaviors are shown in first row. The second row represents results from the GMM [21] and the third row represents results from the SNM. Anomalous behaviors are exhibited as a student running from bottom left to top right and a group of four students are running from left to right. Both events are identified as anomalous since they deviate from the dominant crowd motion. Although GMM[21] correctly identified the anomalous behavior, highlighted by the red dots, the SNM comprehensively highlights the anomalous region of interest. Such dense coverage of the region of interest leads to better tracking and better performance in identifying the anomalous frames, as shown in Table 7. As mentioned before, the compositional information of a video enables the method to handle illumination variations as shown in Figure 14. A person leaving a shop and moving in a direction opposite to the crowd motion (first row in Figure 14) has been detected (second row in Figure 14), tracked and successfully identified as an anomalous motion pattern (last row in Figure 14). Table 5: Comparison of our method with the CSM method based on the UCD groundtruth in anomaly detection. Anomaly Detection Experiment: Percent Accuracy Segment No. GroundTruth SNM Detection Frames Results Frames SNM Accuracy CSM Accuracy[4] Segment % 93.7% Segment % 88.5% Segment % 82.6% Segment % 84.9% E. Spatio-temporal partitioning at different scales Similarly, we tested the influence of different spatio-temporal scales, 2 x 2, 4 x 4 and 8 x 8, on video segment 1 of the UCD dataset. The statistical results of percent accuracy are tabulated in Table 8. Both 4 x 4 and 8 x 8 spatio-temporal partitioning produce similar results. However, 8 x 8 spatio-temporal partitioning consumes more computational time than 4 x 4 spatio-temporal partitioning. For UCD dataset, higher scale give better accuracy than a lower scale since the video is crowded. That is, higher resolutions tend to capture details of a crowded video better than coarser resolutions.

23 Figure 9: Anomaly object detected successfully under illumination variation. First row represents sample frame of object presence. Second row represents the detection of anomalous object and been tracked in the third row. Table 6: The AUC results of different spatio-temporal scales (2 x 2, 4 x 4 and 8 x 8) on UCD dataset. Spatio-temporal partitioning scales Segment1 Scale 2 x 2 4 x 4 8 x 8 SNM Detection Results SNM Accuracy % % % 5. Conclusion The proposed social network model, SNM, captures the scene dynamics and crowd interactions spatially and temporally through modeling crowd scenes by a social network. SNM has been shown to outperform the state-of-the-art methods in detecting and localizing anomalies of crowd scenes. Moreover, SNM allows for adaptive partitioning of crowd scenes to capture the details of scene dynamics and thus detect fine anomalous events in the scene as required by an application. Using a set of benchmark crowd analysis video sequences, our experiments show that the detection accuracy of SNM is higher than the other methods.

24 References: [1] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM Computing Surveys (CSUR), no. 3, September [2] J. Junior, S. Mussef, C. Jung, Crowd Analysis using Computer Vision Techniques, IEEE Signal Processing Magazine,, vol. 27, no. 5, pp , [3] S. Saxena, F. Brémond, M. Thonnat, and R. Ma, Crowd behavior recognition for video surveillance, Advanced Concepts for Intelligent Vision Systems, pp. 1 12, [4] H. Ullah and N. Conci, Crowd motion segmentation and anomaly detection via multi-label optimization, ICPR workshop on Pattern Recognition and Crowd Analysis, [5] A. Basharat, A. Gritai, and M. Shah, Learning object motion patterns for anomaly detection and improved object detection, 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1 8, Jun [6] H. Ullah and N. Conci, Structured Learning for Crowd Motion Segmentation, in IEEE Conference on Image Processing (ICIP), pp , [7] R. Mazzon, S.F. Tahir, and A. Cavallaro, Person re-identification in crowd, Pattern Recognition Letters, vol. 33, no. 14, pp , Oct [8] W. Ge and R.T. Collins, Marked point processes for crowd counting, 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp , Jun [9] Z. Wang, H. Liu, Y. Qian, and T. Xu, Crowd Density Estimation Based on Local Binary Pattern Co- Occurrence Matrix, 2012 IEEE International Conference on Multimedia and Expo Workshops, pp , Jul [10] R. Mehran, A. Oyama, and M. Shah, Abnormal crowd behavior detection using social force model, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), no. 1, pp , [11] R. Raghavendra, A.D. Bue and M. Cristani, Optimizing interaction force for global anomaly detection in crowded scenes, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), no. 1, pp , [12] S. Ali and M. Shah, Floor fields for tracking in high density crowd scenes, Computer Vision ECCV, pp. 1 14, [13] S. Ali and M. Shah, A lagrangian particle dynamics approach for crowd flow segmentation and stability analysis, IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-7, [14] J. Feng, C. Zhang, and P. Hao, Online Learning with Self-Organizing Maps for Anomaly Detection in Crowd Scenes, th International Conference on Pattern Recognition, pp , Aug [15] F. Jiang, Y. Wu and A.K. Katsaggelos, Detecting contextual anomalies of crowd motion in surveillance video, th IEEE International Conference on Image Processing (ICIP), pp , [16] V. Mahadevan, W. Li and V. Bhalodia, Anomaly detection in crowded scenes, 2010 IEEE Conference on Computer Vision and Pattern Recogniton (CVPR), pp ,2010. [17] V. Reddy, C. Sanderson and BC. Lovell, Improved anomaly detection in crowded scenes via cell-based analysis of foreground speed, size and texture, 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recogniton Workshops (CVPRW),pp , [18] M. J. V. Leach, Ed.P. Sparks and N.M. Robertson, Contextual anomaly detection in crowded surveillance scenes, Pattern Recognition Letters, vol. 44, pp , Jul [19] V. Saligrama and Z. Chen, Video Anomaly Detection Based on Local Statistical Aggregates, 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp , [20] H. Ullah, M. Ullah, and N. Conci, Real-time anomaly detection in dense crowded scenes, SPIE-Video Surveillance and Transportation Imaging Applications, vol. 9026, pp Mar [21] H. Ullah, L. Tenuti, and N. Conci, Gaussian mixtures for anomaly detection in crowded scenes, IS&T/SPIE Electronic Imaging, pp , Mar [22] X. Cui, Q. Liu, M. Gao, and D.N. Metaxas, Abnormal detection using interaction energy potentials, 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp , Jun [23] Y. Cong, J. Yuan, and J. Liu, Sparse reconstruction cost for abnormal event detection, 2011 IEEE Conference on Computer Vision and Pattern Recognition, pp , Jun [24] O. Ozturk, T. Yamasaki, and K. Aizawa, Detecting Dominant Motion Flows in Unstructured/Structured Crowd Scenes, th International Conference on Pattern Recognition (ICPR), pp , Aug

Anomaly Detection in Crowded Scenes by SL-HOF Descriptor and Foreground Classification

26 23rd International Conference on Pattern Recognition (ICPR) Cancún Center, Cancún, México, December 4-8, 26 Anomaly Detection in Crowded Scenes by SL-HOF Descriptor and Foreground Classification Siqi