Multi-person tracking system for complex outdoor environments

Size: px

Start display at page:

Download "Multi-person tracking system for complex outdoor environments"

Reginald Harris
6 years ago
Views:

1 Examensarbete 30 hp Februari 2015 Multi-person tracking system for complex outdoor environments Cristina-Madalina Tanase Institutionen för informationsteknologi Department of Information Technology

Abstract Multi-person tracking system for complex outdoor environments Cristina-Madalina Tanase Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1

3 Abstract Multi-person tracking system for complex outdoor environments Cristina-Madalina Tanase Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box Uppsala Telefon: Telefax: Hemsida: The thesis represents the research in the domain of modern video tracking systems and presents the details of the implementation of such a system. Video surveillance is a high point of interest and it relies on robust systems that interconnect several critical modules: data acquisition, data processing, background modeling, foreground detection and multiple object tracking. The present work analyzes different state of the art methods that are suitable for each module. The emphasis of the thesis is on the background subtraction stage, as the final accuracy and performance of the person tracking dramatically dependent on it. The experimental results show the performance of four different foreground detection algorithms, including two variations of self-organizing feature maps for background modeling, a machine learning technique. The undertaken work provides a comprehensive view of the actual state of the research in the foreground detection field and multiple object tracking and offers solution for common problems that occur when tracking in complex scenes. The chosen data set for experiments covers extremely different and complex scenes (outdoor environments) that allow a detailed study of the appropriate approaches and emphasize the weaknesses and strengths of each algorithm. The proposed system handles problems like: dynamic backgrounds, illumination changes, camouflage, cast shadows, frequent occlusions and crowded scenes. The tracking obtains a maximum Multiple Object Tracking Accuracy of 92,5% for the standard video sequence MWT and a minimum of 32,3% for an extremely difficult sequence that challenges every method. Handledare: Hongyu Li Ämnesgranskare: Cris Luengo Examinator: Ivan Christoff IT Tryckt av: Reprocentralen ITC

5 Contents 1 Introduction Overview Background Existing systems Existing methods Problem formulation and proposed solution Background Modeling Overview Gaussian mixture model (GMM) Block-based classifier (BBC) Adaptive self-organizing background (SOM) Pre and Post Processing Methods Noise reduction Morphological operations Shadow removal Tracking Objects Optical flow tracking methods Lucas-Kanade-Tomasi method Feature Tracking Feature Detection Feature Selection Tracking for surveillance systems Experimental Results Data set Key technologies Method and implementation Video Handler Module Image Processing Module Background Subtraction Module Feature Tracking Module Background segmentation Accuracy metrics for foreground detection Qualitative results Method 1: GMM... 40

6 5.4.4 Method 2: BCC Method 3: SOM Accuracy results Pre and post processing Noise reduction and morphological operations results Shadow detection results Tracking Accuracy metrics for tracking Qualitative results Accuracy results Conclusion References... 62

7 1. Introduction This chapter is a short overview of current methods used in the process of multiple object tracking, including references to existing systems and popular algorithms. Moreover, the chapter is to introduce the problem of foreground detection and tracking, and explain the chosen solution for reaching this goal. 1.1 Overview Video tracking is an active field of research that concentrates the process of locating objects from a video footage taken with a camera. This procedure is of great interest in fields like surveillance, security, human-computer interaction, video communication and compression, augmented reality, traffic control, medical imaging and video editing. The purpose of video tracking is to identify target objects in consecutive video frames. The association between detected foreground regions can be somewhat difficult when the objects are moving fast relative to the frame rate. Video tracking is an expensive process due to the amount of data that is contained in the video. The complexity of the process is increased when techniques of object recognition are essential to tracking. Multiple object tracking is a subfield of video tracking and it deals with more subtle problems like following more than one object in time and space even if occlusions occur or relative positions of the objects change. It is important in surveillance applications to register the trajectory of humans, their collisions, their stagnant phases, the moments of entering or leaving the frame. Outdoor conditions of video tracking increase the difficulty of the problem by making the foreground and background harder to segment due to the variations in light intensity, and sizes and orientation of the shadows. The most important subcomponent of the tracking systems is the foreground detection. The results of this stage influence greatly the ability of the system to track objects. An accurate segmentation of the foreground is a challenge in itself. Depending on the complexity of the scene and the background s dynamics, this procedure can vary from simple to extremely complex. The study of the existing methods and their possible improvements for different sets of data is an interesting purpose and is detailed in Chapter 2. The background of the present thesis consists of several methods that have been successfully applied in projects with similar goals. Some of the methods are specific to image analysis, such as: filtering, morphological transformations, edge detection segmentation and classification. All these methods 3

8 are adapted for video frames and are compiled into more advanced intelligent systems that are able to identify and track objects in different ways: adaptive modeling of the background [40, 7], feature tracking [27], kernel-base tracking [11], contour tracking [19, 25], visual feature matching [37]. 1.2 Background This section presents the background of the video tracking field and shows examples of existing systems and most popular algorithms and methods Existing systems Both commercial and open source human tracking application exist. The current trend around the world is combining different methods of tracking for suiting the custom problems of each setting. In general, a tracking system is composed of several components as shown in Figure 1.1. General description of multiple object tracking systems Figure 1.1. General description of multiple object tracking systems This is a brief list of existing systems that perform video tracking by employing various techniques and technologies: POM (developed by Computer Vision Laboratory - CVLAB) uses a generative model of background subtraction to estimate the positions of people in an individual time frame. [13] World-Z map (developed by Nara Institute of Science and Technology - NAIST) performs 3D people detection and tracking with World-Z Map from a single stereo camera. [45] HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. [39] 4

9 The Reading People Tracker (developed by Nils T. Siebel in the European Framework V research project ADVISOR.) is an automatic visual surveillance system for crime detection and prevention, using state-of-the art image processing algorithms. [38] Current applications use multiple cameras [13] under different angles and positions for obtaining more information on the tracked persons. This is how occlusion is no longer a problem, but higher complexity is added due to the fact that an object has to be identified to be the same in different frames taken from different points of view of the scene. The current trend is not only tracking people, but also detecting their body position or their accessories (eg. backpack) [24, 17]. Also the research has reached the point that systems can detect an individual in different settings and recognize the person if it has been detected before. Systems today can reconstruct in 3D the position of a person in space and reproduce it by using simplified models [2]. Still the problem of long term occlusions, stationary persons and poor lighting are important and hard to solve in single camera situations Existing methods The undertaken research in the area of video analysis and multiple object tracking, in particular, has resulted in a comprehensive overview of the available methods and technologies. This section presents a walk-through of the most important and valid methods employed in modern systems, or that are under current research. The goal of the study is to find out how to manipulate different kinds of information present in the video samples for successfully tracking the objects, in our case, all persons present in the frame. There are different tracking modalities that have been used in literature, the main processed data being visual or sound. We will focus on the visual aspect, which is constituted by color, contours, motion and faces. These categories further expand to face detection, face recognition, foreground detection, background estimation, optical flow, frame by frame difference and feature points. A very intuitive and simple approach is to track persons using their colors, as some studies suggest. Different methods of filtering and tracking based on color have been tested. Usually a color mixture model is used based on Gaussian distributions [40]. A metric used in tracking according to the color of the object is the color coherence, which measures the average inter-frame histogram distance of a tracked object [10, 30]. It is assumed that the object histogram should remain constant between image frames. This metric has low values if the segmented object has similar color attributes, and higher values when color attributes are different. Finding and tracking contours is another approach to detecting objects in a video [44]. The contour of a person can be found using a standard edge 5

10 detector [14]. The main problem is that often persons contours can become partly occluded when passing behind other objects. Motion is another clue that can be interpreted as the presence of a person in the environment, and is usually used for tracking systems. Foreground method has been successfully used in some studies [2, 15,?, 29]. In most cases frame by frame difference provides a foreground and these objects constitute tracked persons. The drawback of this approach is that it cannot detect stationary persons in the frames. A way of solving this problem is the background subtraction method, which differentiates between a clean/neutral background and a frame with people in it. The foreground method opens a new research path on the ways of modeling the background accurately, when, there is no "clean" scene to use for subtraction. There are pixel-wise and region-wise methods that can estimate the background that efficiently produce models capable to adapt to small dynamics of the scene. Classical methods such as classifiers or mixture models compete nowadays with neural networks that are able to successfully identify background pixels even in the most troublesome situations. Another successful method of tracking objects is following their optical flow [3]. The optical flow is constructed as the motion of all points in the scene relative to the camera projected into a 2D plane. By determining the optical flow of the entire image, moving objects can be separated from the background. One way to use optical flow for tracking is by identifying a few particular points and then track these features. Finding relevant features is a not trivial problem. Based on the above extracted data, certain tracking algorithms are applied. One of the most common is the Kalman filter [43]. It is considered to be the most important method for state estimation. In our case the state estimation is needed for tracking objects. The Kalman filter is an efficient recursive algorithm that estimates the internal state of a linear dynamic system from a series of noisy measurements. Research shows that this filtering technique for tracking single and multiple moving objects in video sequences are very efficient especially for real time systems. The filter uses different features such as color, shape, motion, edge, etc. Another tracking method is the particle filtering technique with multiple cues [8][Bra05] such as color, texture and edges as observation features is a powerful technique for tracking deformable objects in image sequences with complex backgrounds. The number of active objects and track management are handled by means of probabilities of the number of active objects in a given frame. The probabilities are estimated using a Monte Carlo data association algorithm. 6

11 1.3 Problem formulation and proposed solution The problem can be easily formulated as: the creation of a system that can automatically identify the individuals from a surveillance area and track them from the entrance to the exit of the scene. This formulation represents a need that many surveillance companies have. The research part of this thesis is finding the optimal approach of background modeling, foreground detection and tracking for multiple persons in a given setting, making it possible to uniquely identify all humans from the scene, with minimal computational effort and maximum accuracy. By comparing different approaches, we search for simplified models and customized methods to obtain a useful result. The application consists in implementing found during the research on a set of surveillance footages taken from an outdoor setting, with emphasis on the foreground detection stage. For this purpose three main outdoor sequences will be used. The experiment s aim is to detect and track all humans entering or exiting the view, in different lighting and motion conditions. The framework consists of a Visual Studio project, where several methods relying on mathematical and image analysis libraries are implemented into a final People Tracker System. Several features from a distinct project were used for experimental purposes. The main modules of the People Tracker are: pre and post processing, including a shadow detection subsystem, background modeling and foreground detection module and finally a tracking module that relies on feature detection. The framework is open for extension and it provides a good support for running relevant experiments. The original contribution to the field of the multiple object tracking consists in the comparative evaluation of large set of methods at different stages of processing and finding which ones impact most the tracking accuracy. Even though the complexity of some methods is high, the impact in the final result of tracking might be less significant than other, simpler, less time-expensive methods. These results, presented in the thesis, are essential to the fine tune of a real-time tracking system. Moreover, an intensive study with remarkably good results has been undertaken on novel, machine learning techniques for background modeling and foreground detection. These algorithms have the advantage of producing extremely good results even in corner case situations, but they rise extreme difficulties in setting up the parameters. The present report presents solutions for solving the parameter problems and easily setting up the neural networks for self-organizing backgrounds. Finally, and possibly most importantly, a fine analysis of extremely complex and diverse scenes has been pursued. Most benchmarks and algorithm results, consist of relatively simple scenes, with a high quality of the footage. The current project tests methods for different types of environments that rise various, unexplored problems. The study shows the advantages of each method, but 7

12 most importantly the weaknesses and flaws under non-standard conditions of the scenes. 8

13 2. Background Modeling Adaptive background estimation is a commonly used method in computer vision, and can be found under different variations. This chapter presents a short overview and 3 different algorithms of modeling the background: gaussian mixture modeling, block-based background classifier and adaptive selforganizing background. 2.1 Overview An attempt of classifying background modeling and foreground segmentation techniques based on the employed algorithms is presented in Figure 2.1. Figure 2.1. Classification of background modeling methods Some methods [20, 32] suppose that the background has a predefined distribution, which implies making risky assumptions about the data. The parameter 9

14 choice can be a sensitive issue, but the complexity of parametric algorithms is lower than for non-parametric ones; also, the resources used for storing and computing are restricted to a few parameters per pixel or block. On the other hand, the non-parametric methods [28, 15] can require a large amount of resources. These methods compensate though with great flexibility and can model arbitrary distributions, without any a priori information or assumptions about the data. A recursive approach [29] will evaluate each frame at the time and will update accordingly the background model recursively. A non-recursive [11] approach maintains a buffer of frames, making possible for the background to adapt to temporal changes better. While the space complexity is significantly higher in this case, the method avoids the persistence of early errors in the background model. Depending on how many distributions are needed to estimate the background within a reasonable error interval, methods may employ unimodal distributions or multimodal. The performance difference is that the latter can cope with more complex, dynamic backgrounds, whereas the former cannot. The last distinction is made between pixel-based methods, that consider each pixel s information as an independent input, and the block-based methods that make use of the spatial information of a region, including the proximity of the pixels as a decision process of modeling. In the vast majority of video surveillance systems, real-time methods for background subtraction and foreground modeling are used, so space and time complexity are considered important factors. The methods employed should be chosen according to the settings and complexity of the scene. The basic methods include static background subtraction, thresholding or chromatic filtering, but they have limited applicability. More advanced techniques include adaptive mixture models of Gaussians for each pixel [21, 20], block-based classifiers with multiple decision factors [32], kernel density estimation background model [15], codebook background model [?] and supervised or nonsupervised learning of the foreground in neural networks [28, 29]. These advanced techniques can cope with dynamic backgrounds, changes in illumination, background bootstrapping, foreground objects similar to the background (camouflage), and resource limitations. The chosen methods will be assessed by their capacity of achieving a set of goals. By analyzing the common situations in the data set, a list was created of desirable features for the results of the tested algorithms. For this stage the goals are: Mobility: the method should be able model a background in the presence of occlusions, such as persons traversing the scene. Entering and leaving: the method should be able to detect weather a new object has entered or left the scene. 10

15 Adaptability: the method should adapt to changes in background, such as new static objects are introduced, and stay still for a reasonable amount of time. Dynamics of the scene: the method should be able to incorporate in the model small variations of the scene: branches moving, slow illumination changes, small shifts of the camera. Large objects: the method should create an estimation of the background even in the presence of large objects in the scene. (Sequence 1 and Sequence 2 contain frames with persons or significantly different sizes). Occlusions from 3% of the scene to 20% of the scene should be reasonably handled. Noise handling: the method should not be susceptible to noise. Due to the diversity of situations encountered in the proposed video sequences, all these goals can be assessed. In the following sections different approaches will be discussed for background modelling and indicate their ability of reaching these goals. A more detailed evaluation will be further discussed in Chapter Gaussian mixture model (GMM) The Gaussian mixture background modeling method [40] is a parametric, recursive, pixel-based method that uses a variable mixture of Gaussian distributions. In statistics, a mixture model is a probabilistic model for representing subpopulation within an overall population. Formally, a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. In our case the population is represented by the entire range of pixel values and the subpopulation is the values that a certain pixel p(x,y) takes in a sequence of n frames or observations. The Gaussian distribution of each pixel is described in p(θ)= K i=1 ϕ i N(µ i i) (2.1) Where: θ is the parameter of distribution of observations. N is the normal distribution with weight ϕ i, mean µ i and covariance matrix i. K is the number of mixtures.. The technique compares each pixel value with an existing distribution, in order to classify it as foreground as background. The initial method creates a model for each pixel as an average of all incoming frames. This turns out to be insufficient for modeling a dynamic background that deals with sudden illumination changes or moving fragments of the scene. Further, improved 11

16 version of a classical Gaussian representation of the background have been developed. The benefit of the improved method consists in using a mixture of three to five Gaussian distributions for each pixel color. By considering the persistence and the variance of each of the Gaussians of the mixture, the Gaussians that may correspond to background colors are determined. Pixel values that do not fit the background distributions are ignored. The generic algorithm is presented in Figure 2.2 Figure 2.2. Foreground detection with Gaussian mixture model The parameters θ of the mixture represent how long a specific color is present in a frame. In this approach, the background colors are assumed to be the ones which stay longer and move little on the scene. To allow the model to adapt to changes in illumination an update scheme is applied, based upon selective updating. Every new pixel value is checked against existing model components in order of fitness. The first matched model component will be updated. If it finds no match, a new Gaussian component will be added, having the mean equal with the point s value, a large covariance matrix and a small value of weighting parameter. Further on, P. KaewTraKulPong and R. Bowden [20] prove that in a crowded environment this approach is insufficient, since the background component might become very late dominant. They used an Expectation Maximization (EM) algorithm to determine the Gaussian mixture models. The method uses a window of L frames and L-recent window update equations, where most recent frames have greater priority. In the initialization phase, while we wait for sufficient statistics, another set of equations is used. The algorithm takes 4 parameters: length of the history window of the learned background, number of Gaussian mixtures, background ratio and noise strength and can be easily configured. 2.3 Block-based classifier (BBC) The block-based background classifier is a parametric, recursive, block-based method that uses an unimodal distribution. Another employed method is using a simple decision chain [VRe13]. This technique works as a binary classifier distinguishing between background blocks and foreground blocks. As opposed to other techniques which use pixel information, the present one uses 12

17 contextual information about blocks of pixels. The algorithm consist of four stages: division of a frame into overlapping blocks, classification of each block (foreground or background), background model re-initialization and finally probabilistic generation of the foreground mask, as shown in Figure 2.3. Figure 2.3. Stages of "Foreground Detection via Block-based Classifier Cascade with Probabilistic Decision Integration" algorithm The first stage is dividing the image in blocks and creating for each block a low-dimensional descriptor. The important parameters in this stage are the block size and block overlap, and their effect will be further discussed in the experimental results section. Stage two consists of a three classifier cascade, namely: probability measurement, cosine distance, temporal correlation check as shown in Figure 2.4. The first classifier deals with dynamic backgrounds: different objects in the background that move but mainly are present all the time (such as small shadows or small orientation changes of the camera) and it uses a likelihood 13

18 Figure 2.4. Cascade Classifier for Block Based Classifier for Background Segmentation function of the low-dimensional descriptor d(i, j) for each block to decide if it is part of the background, based on a Gaussian model of each selected block. p(d i, j )= exp{ 1 2 [d i, j µ i, j ] T i, j [d i, j µ i, j ]} (2π) D 2 ( i, j) 1 2 (2.2) Where: µ i, j and i, j are the mean and the covariance matrix for location (i, j) D is the dimensionality of the descriptors The classifier decides based on the condition: p(d (i, j) ) p(µ (i, j) + 2diag( (i, j))) 1 2 (2.3) If condition is satisfied the block is classified as background. The second classifier deals with illumination changes and employs a distance measure between two vectors: µ i, j and d i, j as defined previously in cosdist(d (i, j), µ (i, j) )=1 dt (i, j) µ (i, j) d (i, j) µ (i, j) The classifier decides based on the condition: (2.4) cosdist(d i, j, µ ( i, j)) C 1 (2.5) Where C1 is an empiric value that the authors suggest it should be 0.1 as to ensure a higher probability of classifying background pixels as foreground, rather than the other way around. If condition is satisfied the block is classified as background. The third classifier handles the temporal correlations in the successive frames, thus eliminating the false positives and it consists of two conditions: D prev i, j was classi f ied as background (2.6) 14 cosdist(d prev i, j,d i, j ) 0.5C 1 (2.7)

19 If both conditions are satisfied, the block is classified as background. Model re-initialization is the third stage of the algorithm and is a technique triggered if a portion of 70 % of each image is consistently classified as foreground for a reasonable period of time(ex: 15 frames). The last step is a correcting procedure by producing a probabilistic foreground mask. Given the fact that one block can contain both foreground and background pixels, after the classification of the blocks, a pixel classification is made as well. In this stage a pixel contained in several overlapping blocks will be classified as foreground only if the majority of blocks which contain it have been classified as foreground. 2.4 Adaptive self-organizing background (SOM) A new and very effective method with completely different approach is using a neural network to decide which pixel belongs to the background model. Self organizing features maps Artificial Neural Networks are motivated by cognitive science that model what happens in the cognitive capabilities of natural (human) systems. There are many researchers in the subfield of competitive learning. The most wellknown algorithm is Kohonen Networks [23], also known as Self-Organizing Maps or SOMs that are typically a single layer of inputs completely connected to a single layer of outputs. The general algorithm is presented in Figure 2.5. Figure 2.5. Algorithm for self-organizing maps The expected effect is that over time the weight vectors move towards the centers of clusters of input vectors. Final state (convergence) finds one weight vector over the center of each cluster of the input vectors. Kohonen maps or 15

20 SOMs define a topographic map and a notion of neighborhood of each output unit. On the basic algorithm all unit in the neighborhood of the winner are modified. Output units are typically organized in a grid. Then neighborhood consists of those output units within a given distance (Euclidean). Initial Algorithm Self organizing maps have been successfully used to classify pixels into background or foreground regions, by creating a competitive ANN configured as a 2D grid [28]. Each node computes a weighted sum linear of the inputs, represented by the value of the pixel. Each node is composed by a weight vector corresponding to all weights connected to it. The entire set of weight vectors represents the background model. The foreground will be consequently calculated as the difference between the frame and the background model. Initially each pixel is represented by a map of 3x3 vectors, each element containing its representation in RGB, or HSV. Initially the 3 weight vectors for each pixel will be equal with the initial value of it, constructing thus an initial background model. After the first stage, the future frames are fed to the network. Each pixel is compared with the 9 values from the model in order to determine the best fitting weight vector that describes it. Finding the best matching unit, in our case the best weight vector to describe is accomplished by calculating the Euclidean distance (Equation 2.8) between the pixel in HSV representation and each representing vector. The choice of the HSV representation is based in the established method used in the previous studies. d(p i, p j )=(v i s i cos(h i ),((v i s i sin(h i ),v i )) (v j s j cos(h j ),((v j s j sin(h j ),v j )) (2.8) d(c m, p t )=min(d(c i, p t )) ε (2.9) Where: v i,s i,h i are the components of pixel i in HSV representation c i is the i component of the model c m is the best matching unit of the model p t is the current sample ε is a fixed threshold that separates the foreground from the background If such a matching unit C m is found, the best matching unit (BMU) and its neighborhood (defined in the neighborhood function) are reinforced as background, otherwise the pixel is considered foreground and is not included in 16

21 Figure 2.6. Algorithm for self organizing background the model. This behaviour can be observed in the pseudo code presented in Figure 2.6. An improved method This method is a modified version of Adaptive SOM that uses a fuzzy rule to update the neural network background model [29]. The fuzzy spatial coherencebased self-organizing map for background subtraction (FSOM) is a method that uses a fuzzy update of the background improves the model s robustness to illumination changes in the scene. In more detail, this technique is an enhanced SOM algorithm that includes spatial coherence into background subtraction. The authors define spatial coherence as the intensity difference between locally contiguous pixels; for instance, neighboring pixels showing small intensity differences are coherent and neighboring pixels with high intensity differences are incoherent. The research shows that including spatial coherence of tracked objects when comparing with background pixels ensures robustness against false positives. Moreover, changes have been made to the update phase of the background model. This method introduces an automatic and data-dependent mechanism for reinforcing the background model in future steps. The decision function is now a fuzzy rule-based procedure. Fuzzy set theory offers an appropriate way of representing knowledge and uncertainty (the threshold in the background model), which creates a very flexible decision system for the existing neural network. This is concretized in the calculation of learning factors on the run and then including them in the update rule of the system. This further addition to the algorithm is presented in Figure

22 Figure 2.7. The algorithm for self organizing background with fuzzy update rule 18

23 3. Pre and Post Processing Methods In order to detect the foreground we previously calculated an adaptive background model. In the ideal case the background would be stationary and the model would reflect any change that is happening in the scene. After the first stage of foreground detection, a mask is obtained which contains fragments of all moving objects. Not all these objects represent actual silhouette of persons, but also shadows and "ghosts". Ghosts are those detected false objects that appear when we extract a background that is not accurate and reactive [12]. We can observe real objects, their shadows and ghosts in the results of foreground detection. The aim is to improve this result, by reducing the noise and also detecting the shadows and the ghosts for eliminating the false positives. In order to achieve this, and create a more accurate mask for extracting the foreground we will apply morphological operations and also shadow removal techniques. The final goal is to detect which information resulted from these methods can be usefully combined to accurately detect the foreground. The goals at this stage are: Creating a smooth input image for the background modeling stage Creating low-noise results Enhancing the foreground masks by creating smoother edges and compacts object masks Reducing the artifacts from the foreground masks The pre and post processing stages can be identified in Figure Noise reduction The initial frames will be smoothened in a preprocessing phase by employing two methods. Clip Clipping [31] is the simplest segmentation method. Unfortunately, our images are far too complex to be segmented with this technique. However, we can use a type of clipping in the preprocessing phase so that we can reduce the very 19

24 Figure 3.1. Preprocessing and post-processing stages in the video tracking system dark colors from the scene, to create a smoother transition between the color levels. For this goal, a truncate type clipping is being used, which keeps only the first two thirds of the color spectrum, according to the following formula: threshold = threshold, i f src(i, j) < threshold src(i, j),otherwise (3.1) This transformation is important because it reduces the difference between background shadows and background itself. Blur The blur operation is a simple filtering procedure performed with the purpose of reducing the noise level. The function smoothens an image using the kernel h(k,l) according to the formula: 20 new_image = src(i + k, j + l)h(k,l) (3.2) i, j

25 The normalized box filter blur outputs pixels with the mean of its kernel neighbors ( all of them contribute with equal weights): K =. K height K.... (3.3) width 1 1 The result is a smoothened image, with less noise. The normalized blur filter uses nevertheless not the best kernel (the Gaussian kernel has better results) but it is the fastest and produces acceptable results for our purposes. 3.2 Morphological operations As part of the post-processing of the grayscale foreground morphological operations such as erosion and dilation are applied. These morphological operations [36], although very simple, become very important when we deal with noise. This techniques is used especially when we will try to create a foreground mask. The problem is that the foreground mask that we can obtain for the test images contains only fractions of the objects, due to a variety of factors: top-down view, shadows, low frame rate that affects the update of the background model, and finally the fact that the floor has a dark color very similar to the color of the shadow. In this case, we can extend the foreground fraction with a morphological operation, so the mask will contain the entire surface of the object. Erosion The erosion function is defined as probing an image with a predefined structuring element and extracting conclusions on how this element fits or misses the shapes in the original image. Let be an Euclidean space and A a binary image in E. The erosion of the binary image A by the structuring element B is defined by: A B = z E B z ina (3.4) Where B z is the translation of B by the vector z. We erode the entire image with an ellipse structuring element of size 5x5 pixels. The result of this operation is reducing the noise or small objects that are not important for our edge detection, since we are looking for much bigger objects in the scene. 21

26 Dilation The dilation operation uses also a structuring element for probing and expanding the shapes contained in the input image. Let E be a Euclidean space or an integer grid, A a binary image in E, and B a structuring element. The dilation of A by B is defined by: A B = b BA b (3.5) We use the same structuring element- ellipse of the same size - 5x5 to dilate objects, so that the identified structures will appear clearer and with smoother edges. Morphological opening In fact, the succession of the two steps, erosion and dilation, one performed with a structuring element, and the other with a mirrored version of the same structuring element, is called in literature, morphological opening and it is defined formally: A B =(A B) B T (3.6) The aim of opening operation is to remove small objects from the foreground (usually noise) of an image, placing them in the background, while closing removes small holes in the foreground, changing small islands of background into foreground. 3.3 Shadow removal Shadow removal is an important step for improving object detection and tracking. Different methods exploit different characteristics of the image, which include: chromacity, physical properties, geometry and textures of large or small areas [35, 9]. The targeted shadows are of two types: Low-dynamics shadows: are attached to elements of the background (eg. shadows of trees). These shadow elements behave differently than the foreground or background, having a lower motion than the foreground but higher than the background due to the composite dynamics of both the object that produces them and the illumination changes. 22

27 Figure 3.2. An overview of useful characteristics used for shadow detection high-dynamics shadows: are attached to elements of the foreground (eg. cas shadow of a walking person). In this case the shadow can be as large as the foreground object itself and usually is classified as foreground. Chromacity based method Chromacity is a measure of color that is independent of intensity. The shadow detecting methods [12] that use this measure, assume that the shadow regions preserve their chromacity at the transition between frames. This is why this type of methods usually use color models that differentiate better between color and intensity like HSV, YUV, normalized RGB. This method analyses pixels in HSV (Hue, Saturation,Value) color space. There are three characteristics of the shadow region that were observed: pixels in the shadow have a lower intensity value than the other background pixels (V) the hue is not changed on a background region that was covered by shadow (H) empirically the saturation is decreased by shadow (S) If a pixel value transits in a manner that respects these three conditions, then it is classified as shadow pixel. The evaluation is not made pixel-wise though, but rather in a 5x5 observation window; this way the noise problem is reduced. 23

28 Geometrical method The geometrical method [34] employs physical characteristics of the shadow and has two main steps. The first step in this method is detecting the object consisting of the contour of the person and its shadow, and determining its orientation. Having in mind that the target objects are persons, a high peak and a low peak of each individual blob is found, representing the head of the person and the projection of the head on the ground - which is the shadow. In this particular implementation the orientation of an object is estimated from the properties of object moments. The second step is separating the shadow region from the actual silhouette region, done by finding the gravity center of each shadow-person pair which is considered the point where shadow begins. This is just an initial approximation of the shadow pixels, based on their geometrical features. Every suspect shadow pixel (lower than the gravity point in the figure) is included in Gaussian model, that updates with any new shadow candidate. Each pixel initially classified as shadow is not checked against the Gaussian model and classified finally as foreground or background. Large region texture The large region texture shadow detection method [34] is based on the assumption that the regions under the shadow preserve the same texture as the surface that it is projected on. The algorithm consists of two steps: First, there are selected candidate regions for shadow classification. Second, a comparison is made between the selected region s texture and, in turn, the one of the foreground and the one of the background, eliminating the regions that fall in the second category. For the first step, the implemented method uses chromaticity and intensity to detect candidate pixels for the shadow regions. Further on, all candidates for shadow-pixels that are connected are considered to be a shadow region. In the second step the candidate texture is correlated with the object texture and also with the background. Shadow regions are expected to have a high correlation. For each shadow candidate region the gradient magnitude and the gradient direction are calculated in each point. If the correlation value is greater that an empirical threshold, then the region is considered shadow and removed from the foreground. Although more computationally expensive, it is preferred to use large regions, because they are more probable to contain significant textures than the small areas that can be easily affected by noise. 24

29 4. Tracking Objects This chapter presents the chosen method for tracking objects in a video streams, starting from a general description of optical flow tracking methods and ending with suitable variations of the algorithms for surveillance applications. 4.1 Optical flow tracking methods Optical flow is defined as the motion of set points relative to the camera and the scene, which is projected on a 2D image plane. By determining the optical flow of the entire image, moving objects can be identified and tracked. Sequences of consecutive frames allow the estimation of motion as instantaneous image velocities or discrete image displacements. There are four major directions of research in the field of optical flow [3] as presented in Figure 4.1 Figure 4.1. Modalities of detecting optical flow Even though the methods are diverse they share a three-steps strategy, like in Figure 4.2: preprocessing the image usually with a low-pass filter, extraction of features such as spatio-temporal derivatives and finally, the correlation of these features in two-dimensional flow field. Probably the most explored approach is using differential methods for estimating the optical flow. Based on partial derivatives of the image signal or the 25

30 Figure 4.2. General steps for identifying optical flow sought flow field and higher-order partial derivatives, these methods create a trajectory of objects in motion, which is essential for surveillance systems [4]. One requirement for this type of algorithms is to employ a differentiable velocity function. The first order derivatives will represent the image translation, while second order derivatives will give indications about the 2D velocity of the object.1 Region-based matching techniques define velocity of a shift function that describes best the correlation between two regions at different points in time. This method implies finding an adequate similarity measure and a function that maximizes it for the two regions. This method can be employed even when the differential methods fail due to high rates of noise or impossibility of defining a differentiable function because of the lack of smoothness of the space. Energy or frequency-based methods are a different category of optical-flow detectors that use a Fourier transform capable of translating 2D patterns. It is shown that all translating objects (2D patterns) will produce a plane in the frequency space that intersects the origin, so they can be identified with the use of a Fourier transform. The phase-based techniques define velocity as the phase behavior of the signal that is produced after we apply a band-pass filter to the frames. Bandpass filters can decompose the input signal with respect to relevant features for optical flow detection, such as scale, orientation and speed. This method is very similar to the first one, since it uses first and second-order derivatives applied to the phase, as opposed to intensity of the signal. Based on the above mentioned study [3] the most efficient method in a wide variety of input con- 26

31 ditions is the first-order, local differential technique proposed by Lucas and Kanade [5] and local phase-based method proposed by Fleet and Jepson [16]. The concept of feature point tracking and its application to tracking moving objects or persons is essential for surveillance systems and it represents the core of the tracking system. The goal is to identify in the video sequence relevant local features, also referred to as corners or points of interest in literature. The features that ultimately are selected, are those which are easily recognizable and which do not change with movement and rotation. Figure 4.3. The general steps for feature detection stage The overall process can be generalized as in Figure 4.3, as a recursive function of updating the set of detected interest points and connecting them into meaningful feature set for tracking the target objects. 4.2 Lucas-Kanade-Tomasi method The KLT tracker was introduced in 1991 by Tomasi and Kanade and it is based on a feature matching, differential algorithm developed in 1981 by Lucas and Kanade. The method is still one of the most popular ones in the field, due to its robustness and capability of both detecting important features and producing an optical flow. Shi and Tomasi have significantly improved the method by including an algorithm of detecting good features for tracking such as the corners of the objects. We will describe the algorithms employed in the KLT tracker in the current section. This method assumes that the optical flow in a three by three window of pixels is constant, meaning that the center pixel and all its neighbors have the same motion. Spatial intensity information along with time information is included in a system of 9 equations, called basic optical flow equations. These equations are solved according to the Lucas-Kanade method to satisfy the least squares criterion. The least squares solution gives the same importance to all 8 neighboring pixels in the window. In practice is better to give more weight to the pixels that are closer to the central pixel. 27

32 An important part is finding good features to track [37] that will make the optical flow detection robust to noise and spatial deformations. A widely used approach is finding the most prominent corners in the image or in the specified image region, as described in the above mentioned paper. "Corner", "interest point" or "feature" describe the same concept, namely a well-defined position that can be robustly detected where exist two dominant and different edge directions in a local neighborhood of the point Feature Tracking The initial algorithm proposed by Lucas and Kanade does not include a feature detection stage: its only goal is to track frame by frame a template, procedure known as registration. The understanding of the algorithm relies entirely in the formulas that model the complex changes of the image intensities. The tracking algorithm and the formulas presented in the current section are derived from [41, 4, 5]. An image is defined in the present context as a function with special and temporal variables. By defining a displacement function d =(ξ,η)as the motion of the point at X =(x,y) between time instants t and t + τ, we can then formally decide a formula that correlates two consecutive frames: I(x,y,t + τ)=i(x ξ,y η,t) (4.1) To be more concise, the model that will be further used for the local image model does not contain the time variable, but it introduces a noise component in 4.2. J(x)=I(x d)+n(x) (4.2) The aim is to find an adequate displacement vector d, so that the model will be robust with frame transitions. Thus, the algorithm becomes a minimization problem of the error ε. ε = [J(x + d) I(x)] 2 ω(x)dx (4.3) w The double integral is applied in a window W, J is the displaced location from the original image I and ω is a weighting function. The desired minimization can be done by the Newton-Raphson [39] method by differentiating with respect to d and then search for a zeros. When J(x+d) is approximated by its first order Taylor expansion J(x)+g d, this differentiation can be determined: 28 ε [J(x)+gT (x)d I(x)] 2 ω(x)dx (4.4) W

33 The g represents the gradient of J(x): g(x)= J(x) x J(x) y (4.5) The limitation of this approach is that the first order Taylor expansion used to approximate J(x + d) does not always provide at each iteration a better approximation of d than the previous one, especially in the case of large displacements reported to the size of the window. The solution would be choosing a larger window at the cost of losing precision. As we mentioned, the algorithm s aim is to minimize the error, which is equivalent to setting the derivative of ε to zero. This would yield: W If we consider the following notations: [J(x) I(x) +gt (x)d] g(x)ω(x)dx = 0 (4.6) G = g(x)gt (x)ω(x)dx (4.7) w e = Equation 4.6 becomes the tracking equation: [I(x) J(x)] g(x)ω(x)dx (4.8) Gd = e (4.9) The tracking equation must be solved at each iteration, and it is presented as a system of two equations with two unknowns, represented by the two entries of d: G = W g2 xdx W g xg y dx W g xg y dx W g2 ydx (4.10) In order to implement the algorithm, we will need a discretization of equation. The solution in the discrete space, for the two unknowns, by the least square fit criterion would be: d1 W g = 2 x d2 W g x g y W g x g y W g x g t W g 2 y W g y g t (4.11) The last formula incorporates the gradient along time as well, that was omitted in the initial expression for simplicity. In practical implementations the parameters are set iteratively, either varying the window size or by varying the resolution of the images. The latter solution fixes the window size and builds resolution pyramids of the frames [6]. 29

34 4.2.2 Feature Detection The above tracking algorithm does not include feature detection; instead it is applied and iterated on all regions of the image, which makes it a weak tool for practical purposes. An optimization emerges from the detection of the image areas that contain motion information. A successful strategy has been proven to be the selection of image regions with a rich texture that provide motion components on two directions. To achieve this goal, several methods attempt trackable features as: corners, regions with high spatial frequency content, regions where second-order derivatives are present. The fitness of a feature can be hard to predict, this is why a robust mathematical model has been created to extend the feature tracking equation presented in Observations show that for a feature to be easy to track, the matrix G must have large eigenvalues. Therefore, features are chosen at the locations where the eigenvalues are largest and at least above a predefined threshold λ: min(λ 1,λ 2 ) > λ (4.12) The threshold λ is chosen arbitrary from the interval [λ min,λ max ].The lower bound λ min is calculated from the eigenvalues for images of a region of uniform brightness. The upper bound is calculated from regions with a set of various types of features, such as corners and highly textured regions Feature Selection Not all detected features are relevant to the tracking system, so a thorough selection is in place at this stage. A first strategy of identifying if a feature is qualified as adequate is to calculate its dissimilarity measure. This measure can be calculated either as the intensity difference between the corresponding windows that contain the feature in consecutive frames, or by using an affine transformation model [7]. Instead of comparing the first and the current frame by using the expression 4.13, the following equation is considered: ε = [J(Ax + d) I(x)] 2 ω(x)dx (4.13) W This expression is much more general, because it compares the window from the original frame with the window of the transformed feature in the current frame, by employing a transformation matrix A. This improvement makes it easier to exclude the unuseful features from the tracking system, such as features that have left the view of the camera, the object has changed due to rotation or other deformations. Other ways of selecting features from the list of tracked interest points that do not imply heavy computations have been proposed. A first method is imposing a stopping criterion for the iterative algorithm; this would be motivated 30

35 (a) (b) Figure 4.4. Tracking results for Sequence 1 (set distance 10) by the fact that if the algorithm does not converge in a certain amount of iteration, most definitely the feature has been occluded. Another method is assessing the value of the determinant of G, because small results indicate that the system of equations cannot be reliably solved. This observation indicates that the feature point is lost. Finally, if the window exceeds a set bound in reaching towards the edges of the frame indicates that the feature point has left the scene, so the feature is dropped. The Shi-Tomasi algorithm calculates the corner quality measure at every pixel using the minimal eigenvalue of gradient matrices representing intensity regions of the image. Further on, a non-maximum suppression (the local maximums in 3x3 neighborhood are retained) is performed. The corners that have a threshold passing eigenvalue are kept, while the weak ones are discarded. In the last phase, the corners are sorted according to their quality measure. A final decision is made to keep only the strongest corners within a defined area. 4.3 Tracking for surveillance systems The goal is applying the tracking algorithm on surveillance footage, thus a discussion on the practical implementation and setting up of the system is in place. The videos for tracking persons have several characteristics: Videos contain more than one trackable object Persons in videos don t have monotonous trajectories Persons might occlude each other while passing the scene Persons enter and exit frequently the scene Objects rotate or change their shape in a 2D plane What is expected of this system is, ideally, to identify with a sufficient points of interest each person, entering, exiting or already existing in scene and mark it. Further on, the mark should follow the person throughout his activity in the scene. No feature points should be detected on the background, or areas that do not represent persons. 31

36 In essence, there are two critical stages in the tracking sub-system: 1. Initializing correctly feature points for all persons 2. Removing incorrect or redundant feature points from the scene. The generic logical steps that were undertaken for the above mentioned purposes can be seen in Figure 4.5. Figure 4.5. Pseudocode for feature points tracking algorithm We observe that at each iteration we need the previously tracked targets and a foreground mask of the frame. The foreground extraction and processing have been discussed in Chapter 2 and Chapter 3. The KL tracking algorithm and Shi-Tomasi good features to track have been implemented in various libraries, including OpenCv. In our experiments we have employed GoodFeaturesToTrack function, which implements the algorithm previously described in order to obtain strong corners. This step is preliminary to the actual tracking session and it is used for initializing the start points. The important parameters of this method are: maxcorners - depending on the setting, the algorithm can return a various number of strong feature points. If we have an intuition of the range of tracked objects in the scene, we can set this parameter accordingly. Only the strongest points in this limit will be returned qualitylevel - this parameter determines the minimal accepted quality of image corners. The parameter value is multiplied by the best corner quality measure, which is the minimal eigenvalue. The corners with the quality measure less than the product are not considered. mindistance - this parameter also depends on the size and density of the objects that are intended to be tracked. If the objects are fairly small or cluttered in the scene, the distance should be small, as opposed to large, wide-apart objects. blocksize - Size of an average block for computing a derivative covariation matrix over each pixel neighborhood. 32

37 The location of the feature points and the future targets are refined using cornersubpix(), a function which iterates to find the sub-pixel accurate location of corners or radial saddle points. The tracking function is fulfilled in OpenCV by the CalcOpticalFlowPyrLK() method which is an implementation of Lucas-Kanade algorithm. The most important parameters are: prevpts - vector of 2D points for which the flow needs to be found; point coordinates must be single-precision floating-point numbers. nextpts - output vector of 2D points containing the calculated new positions of input features in the following frame winsize - size of the search window for each pixel maxlevel - maximal pyramid level number criteria - parameter, specifying the termination criteria of the iterative search algorithm (after the specified maximum number of iterations or when the search window moves by less than a minimum value. We will analyze the impact of the most important parameters in Chapter 5: Experimental results. 33

38 5. Experimental Results This chapter presents the used technologies in developing the system and, more importantly, the data set and the results obtained for each set of algorithms are discussed. For each experiment, relevant quality measures are defined. The chapter contains individual results for background segmentation, pre and post processing methods and tracking methods, along with a combined results from employing compatible methods for the best tracking results. 5.1 Data set In order to observe the behavior and the performances of each implemented method, we chose to use three different data sets. These video sequences emphasize the weaknesses and strengths of both classical methods and neural network approaches. The experimental results are relevant for qualitative and quantitative observations about the employed algorithms and also provide a good base knowledge about setting up the SOMs. The choice of data sets enables the detailed study of the proposed methods and gives a valuable indication of the sensitivity of each parameter and the ranges that these should be initialized in different circumstances: type of scene, type of tracked blobs, noise tolerance, etc. For experimental purposes we used CAVIAR Test Case Scenarios, which consist of video clips created July 11, 2003 and January 20, The CAVIAR project offers a collection of video clips of different scenarios and scenes: people walking alone, meeting with others, window shopping, entering and exiting shops, fighting and passing out. Sequence 1 The first section of video clips were filmed for the CAVIAR project with a wide angle camera lens with half-resolution PAL standard (384 x 288 pixels, 25 frames per second) and compressed using MPEG2. The file sizes we used for our experiments is 12 MB. For the first comparison we used MeetWalkTogether2.mpg (MWT2) from the data set, which is an outdoor setting with bright natural illumination, in 34

39 which two persons meet and walk together. We will analyze the detection accuracy of the proposed methods under several sets of parameters. Sequence 2 The second set of data consist of footage taken from a surveillance camera in the entrance of a pool in Jiading, Shanghai. The video clips are 720x576, 15 frames per second and compressed using AVI. The file size we used for our experiments is 3,5 MB. For the second comparison we used a sequence of an outdoor setting, with both bright and dim illumination, in which several persons enter and exit the scene. The sequence is quite complex, since some of the present persons are holding large objects in their hands, run, stop or change trajectories. In addition, illumination changes at several occasions and the footage is quite noisy. This sequence is a good input data for showing the weaknesses and strengths of each method. Sequence 3 The third sequence has been obtained from the TUIO Scene [18] project and it represent a public space in Portugal. The video clip is 320x240 pixels, 29 frames per second and compressed using AVI. The file size we used for experiments was 1,4MB. Due to the intense daylight, the sequence contains large regions of shadow, which represent a good case study for the shadow detection techniques in particular. The sequence contains a various number of persons walking, skating, and also cars in motion. 5.2 Key technologies The key technologies that will be used are C++ programming language and OpenCV (Open Source Computer Vision Library) [26]. OpenCV provides high-level functions for capturing video, working with images, and performing computer vision related calculations. The following modules are available: core - a compact module defining basic data structures, including the dense multi-dimensional array Mat and basic functions used by all other modules. 35

40 imgproc - an image processing module that includes linear and nonlinear image filtering, geometrical image transformations (resize, affine and perspective warping, generic table-based remapping), color space conversion, histograms, and so on. video - a video analysis module that includes motion estimation, background subtraction, and object tracking algorithms. calib3d - basic multiple-view geometry algorithms, single and stereo camera calibration, object pose estimation, stereo correspondence algorithms, and elements of 3D reconstruction. features2d - salient feature detectors, descriptors, and descriptor matchers. objdetect - detection of objects and instances of the predefined classes (for example, faces, eyes, mugs, people, cars, and so on). highgui - an easy-to-use interface to video capturing, image and video codecs, as well as simple UI capabilities. Another library that has been used in one of the implementations of the background subtraction methods is Armadillo [1]. Armadillo is a C++ linear algebra library aiming towards a good balance between speed and ease of use. The syntax is similar to Matlab. The following features are available: Integer, floating point and complex numbers are supported Various matrix decompositions are provided through optional integration with LAPACK A delayed evaluation approach is employed (at compile-time) to combine several operations and to reduce (or eliminate) temporaries; this is automatically accomplished through template meta-programming The library is open-source software, distributed under a license that is useful in both open-source and proprietary contexts 5.3 Method and implementation The software application was developed for the purpose of testing the methods detailed in Chapter 2. It was implemented in C++ under the Visual Studio 2010 environment with the aid of several image processing and mathematical libraries like OpenCV and Armadillo. The general architecture of the tracking system consists of several interconnected modules like in Figure 5.1: 36

41 Figure 5.1. Main modules of the tracking system Video Handler Module The Video Handler Module is responsible of reading the video footage which can be compressed by various technologies and stores it in a cv::videocapture object that will be used for extracting the frames and sending the different modules in a cv::mat format. Also, the display of intermediate and final results of each method is performed by this component Image Processing Module The image processing module uses OpenCV methods for performing blur filtering, low pass filtering and morphological operations. The shadow detectors used for the comparative experiments are adapted from [33] [43] that also uses OpenCV library for basic operations. Each shadow detector is implemented in a different class, so the removeshadow instantiates one object for each method. 37

42 Figure 5.2. Class diagram for Processing Module Background Subtraction Module The most extensive work has been done for the development of the background subtraction module, since it provides three options for obtaining the background model. In practice, we use BackgroundSubtractor object, that is extended by three different classes that override the perform method. By default, the type of the BackgroundSubtractor is set to 0 and the "perform" method has a basic implementation that considers the background model as the initial frame of the video sequence. As in all cases, this background model is subtracted from the original frame, thus resulting in a foreground mask. If type of the BackgroundSubtractor is 1, then the object will perform as defined in the GMM class. The implementation uses the OpenCV multiple Gaussians mixture model method cv::backgroundsubtractormog which is described in Section 2.2. If the type is 2, the Background subtractor will use an adaptation of the code [42] provided by the authors of the article [32]. This module uses Armadillo defined matrices and employs the mathematical operations provided by this specialized library. The third type is the self organizing background model which was implemented with the aid of the open source code of the project Scene [18]. All the implementations are adapted for the C++ implementation of the current application. 38

43 Figure 5.3. Class diagram for Background Subtraction Module Feature Tracking Module The feature tracking module contains a single class OpticalFlowDetector that performs KLT according to the OpenCV implementation of the pyramidal algorithm, but it is extendable to other implementations also. 5.4 Background segmentation This section presents the qualitative and quantitative results of the experimented foreground detection methods, based on a set of accuracy metrics Accuracy metrics for foreground detection For measuring accuracy we adopted classic metrics like precision, recall and F-measure. Recall is the detection rate, which gives the percentage of detected true positives as compared to the total number of true positives in the ground truth. For our experiments done on Sequence 1, we calculate the Recall by 39

44 frame, which means that we count the identified foreground blobs and we compare with the observations made in the original frames. Recall = TruePositives TruePositives + FalseNegatives (5.1) Sometimes the method detects not only foreground objects, but also other background regions that are considered to be false positives. In order to measure the frequency of this situation, we use another accuracy metric: precision. Precision = TruePositives TruePositives + FalsePositives (5.2) Also known as positive prediction, this metric gives the percentage of detected true positives as compared to the total number of items detected by the method. A composed metric is the F-measure, which is defined as: recall precision F = 2 recall + precision (5.3) Qualitative results Experimental results for moving person detection using 4 different approaches have been produced for several image sequences. We analyze and describe the two different sequences, that represent typical situations critical for video surveillance systems, and present qualitative results obtained with all methods and different parameters Method 1: GMM The most important parameters in this particular method are the number of Gaussian mixtures used which determine the sensitivity of the detector, the background threshold and the noise variance. Sensitivity determines the responsiveness to changes in the background. Low values enhance the detection of objects in the scene, but also make the model more sensitive to noise. The background threshold determines which of the distributions correspond to background pixels. High values are adequate for simple backgrounds. Lower values are recommended for complex backgrounds with moving objects. Noise variance sets the minimum value of the variance for the Gaussian models. Higher values are recommended for videos with noisy images. While the parameters are important for understanding the method workflow, 40

45 their variation has little impact in critical situations: sudden illumination, or camouflage. As we can observe in Figure 5.4, the variation of the background threshold produces no, or little difference on the result. The background model includes in any case artifacts caused by the small dynamics of the scene objects. It is important to observe that the foreground mask is strictly dependent on the chromatics of the scene, so only fragments of the person s image (that are significantly different) can be classified as foreground. (a) (b) (c) Figure 5.4. GMM performed on Sequence 1: (a) initial scene (b) GMM performed with a high threshold of 0.8 (c) GMM performed with low threshold of

46 (a) (b) (c) (d) (e) (f) Figure 5.5. GMM performed on three consecutive frames of Sequence 2: (a,b) Blob similar to the background (c,d) Background artifacts (e,f) Partial foreground detection Sequence 2 was chosen to perform experiments on, for its high rate of noise in this case. The mixture of Gaussians deals especially well with this issue, approximating in an accurate way the background. As we can observe in Figure 5.5, we obtain weak quality results in two situations of major importance: slow illumination changes and camouflage. In the three consecutive frames the light changes slightly, so we can detect progressively more false foreground pixels from the floor, shadows and umbrella, even though the changes are not sudden. This problem could not be fixed with a fine tune of parameters, because the method is not robust to complex light changes. Another, even greater problem that many foreground detection techniques still encounter is the cam42

47 ouflage situation: when objects from the foreground share the same chromatic display as the background. In this particular situation, a foreground pixel will find a Gaussian distribution that samples it, without exceeding the threshold, which would falsely classify it as a background pixel Method 2: BCC The entire set of parameters and their impact on the results is discussed in detail in [32]. We mention the most important ones: number of classifiers, block advancement and block size. For the best results, all three classifiers must be used, as the authors suggest. We ran experiments in order to determine for the given data sets, what block size and advancement impact the quality of the results. The advancement of the blocks represents how much 2 adjacent block overlap. This parameter influences the first and last step of the algorithm, the division of the image in blocks (foreground and background) and elimination of false positives for pixels that are included in the overlap area and are classified as both background and foreground, respectively. For instance, a pixel contained in several overlapping blocks will be classified as foreground only if the majority of blocks which contain it, have been classified as foreground. We can observe how sensitive are the results to this parameter and how the behavior of the method depends on the input data in Figure 5.6. and Figure 5.7. For a scene not affected by noise, the size of the overlap will work rather as a morphological opening of the blobs, making them contiguous shapes, which might be useful. On the other hand, on noisy inputs, the overlap might introduce a significant amount of false positives in the foreground such as in Figure 5.7. The accuracy of the algorithm is also lost, mostly because boundaries will become thicker and harder to classify. Note that the sitting man in the scene is correctly classified as background as he does not change his position in this particular sequence of frames. The experiment shows that the block size should be proportional with the blobs of the foreground objects, for all input data, size 8x8 pixels proved to be best. 43

48 (a) (b) (c) (d) Figure 5.6. BCC for Sequence 1, frame 217: The variation of the overlap. (a) original frame (b) block size = 8, overlap = 2 (c) block size = 8, overlap = 4 (d) block size = 8, overlap = 6 (a) (b) (c) (d) Figure 5.7. BCC for Sequence 3, Frame 80: The variation of the overlap. (a) original frame (b) block size = 8, overlap = 2 (c) block size = 8, overlap = 4 (d) block size = 8, overlap = 6 44

(a) (b) (c) (d) Figure 5.8. BCC for Sequence 3, Frame 80: The variation of the block size. (a) original frame (b) 5.4.5 Method 3: SOM Setting up a self organizing map can be a difficult process.

49 (a) (b) (c) (d) Figure 5.8. BCC for Sequence 3, Frame 80: The variation of the block size. (a) original frame (b) Method 3: SOM Setting up a self organizing map can be a difficult process. In the present case, the authors of the initial paper propose a basic configuration based on empirical results. The number of weight vectors can vary without any major quality impact, so all the experiments have been executed with a 3x3 neural network, meaning 9 weight vectors. Values for the distance thresholds are parameters that indicate how easily should the algorithm decide which pixel is included in the background model. The method, uses two such values: one for the calibration phase and one for the online phase. The former should be higher, since we assume that the first frames are a good approximation of the background, while the later should be more restrictive. The number of sequence frames for the calibration phase is an important parameter for modeling the background. If the initial frames correspond to a clean scene, without moving objects from the foreground, a small number should be enough, such as 5. On the contrary, if the clean background appears later in video, a higher value should be set. In the case of Sequence 1, we observed that no such clean sub-sequence is available, so the calibration phase 45

50 may include some false background regions that will persist even in the online phase. If the initial scene is crowded with people, a higher value for the distance threshold should be employed in the calibration phase. The learning factor should be ideally set according to the dynamics of the scene. If such information cannot be a priori set, a standard value of 1 can be used. (a) (b) (c) Figure 5.9. SOM on sequence 1 Frame 950 (a) original image (b) background model (c) foreground mask 46

SOM method is robust against noise, and also succeeds to deal with slow illumination changes, in

51 The difference between a small sensitivity and a high sensitivity of the distance threshold parameter (on the online phase) can be observed in Figure SOM method is robust against noise, and also succeeds to deal with slow illumination changes, in sequences that other methods fail. (a) (b) (c) (d) (e) (f) Figure SOM on sequence 1 Frame 950 (a) original image (b) background model (c) foreground mask 47

52 (a) (b) (c) Figure Results for SOM: (a) initial image (b) background model (c) foreground mask Accuracy results The experiments executed on Sequence 1 show that all methods have a high recall rate: Table 5.1. The difference appears in the speed with which method updates its model. The false negatives appear in this case only in the transition frames - when someone enters or exits the scene. Sequence one has few such situations, so errors correlated with this matter are also few. We detect that the recall is very high in the SOM algorithms, since the experiments were run, according to the explanations above, under suitable parameter configurations. The false positives impact the value of the precision metric, which we can observe has lower values for all studied methods. We considered any detected blob larger than 4x4 pixels as being a false positive, if that area was not contained in a foreground object. That is why, any artifacts produced by sudden illumination or dynamic of the background count as a false positive in the detriment of the precision value. F-measure, as expected, is affected by both false positives and false negatives, yielding to smaller values. In the case of SOM and FSOM no visible 48

53 Table 5.1. Accuracy results for all studied methods on Sequence 1 Method/Metric Recall Precision F-measure GMM 0,90 0,83 0,86 BCC 0,92 0,8 0,85 SOM 0,95 0,86 0,90 FSOM 0,95 0,86 0,90 differences occurred, so we can conclude that both have remarkable results in this case. Sequence 2 is more challenging for all methods, since all use an adaptive background model. As opposed to Sequence 1, in this video there are many enters and exits and the foreground objects occupy a larger fraction of the scene, which makes transition background models prone to errors. The illumination varies from bright to very dim, which makes the presence of shadows a real problem. We used the composed metric - the F-measure to compare the results of all methods on both sequences in Table 5.2. The results are considerably worse when we input the second sequence, but we observe that the SOM algorithms are more robust against noise and sudden changes. The F- measure value is significantly better in this approaches for SOM than the other two methods, producing steadily more accurate foreground masks, in spite of the noisy frames and high density of foreground objects. Table 5.2. F-measure results for all studied methods in Sequence 1 and Sequence 2 Method/Data Sequence 1 Sequence 2 GMM 0,86 0,62 BCC 0,85 0,70 SOM 0,90 0,80 FSOM 0,90 0,80 A detailed set of results for all accuracy metrics for the SOM algorithm is presented in Table 5.3. Table 5.3. Accuracy results for SOM method on Sequence 1 and Sequence 2 Sequence/Metric Recall Precision F-measure Sequence 1 0,95 0,86 0,90 Sequence 2 0,83 0,78 0, Pre and post processing This section presents the experimental results for pre and post processing methods, respectively. 49

54 5.5.1 Noise reduction and morphological operations results Noise reduction is the simplest subsystem of the tracking ensemble, nonetheless it is an important stage that impacts the precision of the tracking. In this section we will present the improvements that elimination of noise and post-processing of the foreground mask bring based on the best configuration we have found. In fact there are two separate stages: pre-processing the initial frame, and post-processing the result, namely the foreground mask obtained by the methods explained in Section 2.1. The experiment was applied to Sequence 1 and Sequence 2 data sets, and their respective foreground masks obtained with the most representative foreground detector: GMM. The choice is fairly obvious, since this detector provides foreground masks with defects, so the benefits of this stage become visible. The initial stage consists of pre-processing the frame, by smoothening with a blur and low-pass filter. The more interesting stage to analyze is the post processing of the foreground mask. (a) (b) (c) (d) Figure (a) GMM foreground mask without processing; (b) post-processed GMM foreground mask with blur filter and morphological opening; (c) the tracking result when using foreground mask (a); (d) tracking result when using foreground mask (b) KLT on Sequence 1 with distance parameter set 10 Figure 5.12 presents the impact of the foreground mask post-processing on the tracking system. The KLT algorithm receives as input at each iteration the existing targets along with a foreground mask, that is used for discriminating 50

between good features or occluded features. The aim of the tracking stage is to find sufficient interest points to describe the persons present in the scene and none feature points for the background.

12 (b) this error is avoided. The other important aspect is that now, the foreground objects are denser, with clearer contours which yields a better identification of the interest points.

55 between good features or occluded features. The aim of the tracking stage is to find sufficient interest points to describe the persons present in the scene and none feature points for the background. In the (c) picture of Figure 5.12 we see an erroneous detected corner belonging to the background. Due to the noise reduction and the more accurate background model obtained in Figure 5.12 (b) this error is avoided. The other important aspect is that now, the foreground objects are denser, with clearer contours which yields a better identification of the interest points. We observe that for the left person we improved from 3 feature points to 4, while for the right person, the improvement is even more significant: from only 2 grouped corners on the feet to 5 descriptive corners along the entire body. (a) (b) (c) (d) Figure (a) GMM foreground mask without processing; (b) post-processed GMM foreground mask with blur filter and morphological opening; (c) the tracking result when using foreground mask (a); (d) tracking result when using foreground mask (b) Sequence 2, Distance 10 The process of tracking in the situation of Sequence 2 rises numerous problems, first because of the noisy scene and the low frame rate. Nevertheless more accurate results for the tracking are produced when preprocessing is used: significantly less background targets are registered and main feature points of the foreground are kept, so no object is lost out of track due to the erosion. Experimental results are shown in

56 5.5.2 Shadow detection results The shadow detection experiment was undertaken on all three video sequences, in order to observe which method provides a better result that could be used in the tracking stage. All methods use as a foreground, the resulted mask from the GMM method. As a detailed study [35] shows, shadow detectors are firstly highly dependent on the scene and secondly dependent on the object size and shape. We will analyze in turn the results obtained for all three studied methods on the three input data sets. For the first sequence, the geometrical shadow detector fail, since it relies mostly on geometrical features that do not fit this particular objects. The detector would split any found blob in 2 regions based on the gravity point of the figure, but the angle of the camera and light source do not respect the expectations of this method in terms of shape and object orientation. The chromaticity method has a positive effect in a way that smoothens the foreground mask, and eliminates small shadows around the moving objects but fails to detect the small shadows that occur from background illumination variations. The method that uses the large region texture feature for detecting shadows performs the best in this situation, because, it is more independent of the scene, color or shape of the objects. The texture of the background is correlated with the texture of the shadow candidates, and in most of the cases it results in a high value that correctly indicates a shadow region. The results of all methods applied to Sequence 1 can be observed in Figure The results on Sequence 2 are shown in Figure Both the chromaticity method and the geometrical one have little impact on the foreground mask, detecting just small regions of shadow, and classifying valid foreground as shadow. On the other hand, the texture method, has good results, being capable of eliminating a large amount of background shadows, giving a cleaner mask for the foreground. Moreover, the cast shadow of the person entering the scene is significantly diminished. The large regions of texture are in this case relevant for the classification, and the algorithm is successful. The results are nevertheless far from perfect, since the scene itself raises many problems such as high noise, camouflage, dynamic background and sudden changes of illumination. The results on Sequence 3 - Figure 5.16 are somewhat more relevant, since the frames from this set contain many shadows from different objects, and with different sizes and orientations. The geometrical method attempts to eliminate shadows, but in most of the cases it erases the lower half of the objects, because it considers each blob half person, half its shadow. The orientation of the shadows in this case again, does not match the assumption of the algorithm. 52

57 (a) (b) (c) (d) Figure Performance of shadow detectors on Sequence 1, applied to the foreground mask obtained with GMM (a) GMM foreground mask (b) geometrical shadow detector (c) chromaticity-based shadow detector (d) large region texture shadow detector 53

58 The large region texture-based shadow detector, successfully detect most of the shadows, however it produces many shadow false-positives which lead to the elimination of some valid foreground parts. The objects in this video are quite small, so the texture from one region can be easily wrongly classified. (a) (b) (c) (d) Figure Performance of shadow detectors on Sequence 2, applied to the foreground mask obtained with GMM (a) GMM foreground mask (b) geometrical shadow detector (c) chromaticity-based shadow detector (d) large region texture shadow detector 54

detector (d) large region texture shadow detector In conclusion, the best shadow detector from the three methods that we studied, is the one using as a main feature, the texture from large regions.

59 (a) (b) (c) (d) Figure Performance of shadow detectors on Sequence 3 Frame 250, applied to the foreground mask obtained with GMM (a) GMM foreground mask (b) geometrical shadow detector (c) chromaticity-based shadow detector (d) large region texture shadow detector In conclusion, the best shadow detector from the three methods that we studied, is the one using as a main feature, the texture from large regions. The most visible results were obtained on Sequence 2, where objects are fairly large, compared with the entire scene, and both types of background and foreground shadows are visible. The algorithm eliminated a large fraction of both types of shadows. 5.6 Tracking Accuracy metrics for tracking For measuring accuracy we adopted measures proposed in CLEAR Evaluation Workshop [22] and combines four different measurements: Misses (m): an object is considered missed if it was not tracked within 50 cm accuracy. False positives ( fp): an object is false positive if it has feature points associated even though it is not part of the background Mismatches (mme): when a track belonging to an object switches to another object this is counted as one mismatches. These measure- 55

Chapter 9 Object Tracking an Overview

Chapter 9 Object Tracking an Overview The output of the background subtraction algorithm, described in the previous chapter, is a classification (segmentation) of pixels into foreground pixels (those belonging