DEALING WITH GRADUAL LIGHTING CHANGES IN VIDEO SURVEILLANCE FOR INDOOR ENVIRONMENTS

DEALING WITH GRADUAL LIGHTING CHANGES IN VIDEO SURVEILLANCE FOR INDOOR ENVIRONMENTS Chris Poppe, Gaëtan Martens, Sarah De Bruyne, Peter Lambert, and Rik Van de Walle Ghent University IBBT Department of Electronics and Information Systems Multimedia Lab Gaston Crommenlaan 8, bus 201, B-9050 Ledeberg-Ghent, Belgium +32-93314959, {chris.poppe, gaetan.martens, sarah.debruyne, peter.lambert, rik.vandewalle}@ugent.be http://multimedialab.elis.ugent.be Abstract Background subtraction is a commonly used method to segment moving objects in image sequences. By comparing new frames to a background model, regions of interest can be found. To cope with highly dynamic and complex environments, a Mixture of Gaussian Models technique has been proposed and has gained tremendous popularity. This paper analyzes its performance in typical indoor environments, like subways, and shows the lack of this technique to cope with gradual illumination changes. Furthermore, computational expensiveness of the maintenance of the Gaussian Models is shown. Consequently, in this paper we propose the use of different updating and matching mechanisms to improve the general robustness and speed. Finally, experimental results are presented to show the gain of our proposed system, according to the standard technique. Keywords: Moving Object Detection, Background Subtraction, Video Surveillance. Introduction The detection and extraction of moving objects in image sequences is a typical first step in a wide range of computer vision applications, such as visual surveillance, traffic monitoring, and semantic annotation. Consequently, it is desirable to achieve very high accuracy with the lowest possible false alarm rates. The recent rapid increase in the amount of surveillance cameras has led to a strong demand for automatic methods for processing their outputs. Nowadays, researchers are focusing on activity and behavior analysis, to be able to make automated intelligent decisions. Detection of loitering, luggage abandoned by the owner, trespassing, and theft of items are typical examples of high level actions which the computer vision community wants to tackle (1, 2). All these actions require an initial detection of moving objects before any further analysis can be done. The detection of moving objects in dynamic scenes has been the subject of research for several years and different approaches exist (3). A common approach for extracting moving objects is background subtraction. During the surveillance of a scene, a reference background model is built and dynamically updated. For each pixel in new images, deviations of the pixel values from the background model are detected and used to classify the observations as -1-

belonging to background or foreground. Many different models have been proposed for background subtraction, of which the Mixture of Gaussian Models (MGM) is one of the most popular (4). Stauffer and Grimson propose to model the value of each pixel as a mixture of Gaussians and use an approximation technique to update the model. Recently, the mixture of Gaussians method is showing tremendous popularity thanks to the dynamic and multimodal behaviour. Unfortunately, there are a number of important problems when using background subtraction algorithms, such as quick illumination changes, initialization with moving objects, ghosts and shadows, as was reported in (5). Another drawback that MGM suffers of, is the high computational cost to maintain the different Gaussian models. In this paper we present a solution to the problem of quick illumination changes by altering the matching step in MGM. We show how it is applicable in indoor environments such as subways or airports and present results on different challenging sequences. Additionally, we propose the use of a new mixture of models which is similar to MGM but achieves higher speeds with the same accuracy. To obtain even higher speeds we propose the use of an analysis mask to decide which pixels should be analyzed fully. Other pixels will be interpolated from the result of the surrounding pixels. The next section elaborates on a number of related techniques which improve the traditional MGM and try to deal with the above mentioned problems. Consequently, the original mixture technique and its observed shortcomings are discussed. Subsequently, our proposed changes to the original scheme are presented and experimental results are provided. Finally, we end with the conclusions. Related Work When using background subtraction techniques, a number of known problems occur. Toyama et al. discussed several of these problems in detail (6). Javed et al. adopted this list and selected a number of important problems which have not been addressed by most background subtraction algorithms (5). Several researchers have based or compared their system on MGM. Wu et al. gave a concise overview of background subtraction algorithms, of which they have chosen MGM to compare with their own technique (7). They present a system which is better for localization and contour preserving, but it is more sensitive to complex environmental movements (such as waving trees). Since the background models have to be learned from observations of the scene, constructing a reliable model is hard when many moving objects are present. Lee et al. proposed an online expectation maximization learning algorithm for training adaptive Gaussian mixtures (8). Their system allows to initialize the mixture models much faster than the original approach. Related to this topic, Zhang et al. presented an on-line background reconstruction method to cope with the initialization problem (9). They make use of a change history map to control the foreground mergence time and make it independent of the learning rate. Although they deal with the initialization problem, they can not deal with quick illumination changes. Javed et al. proposed a solution to the problem of quick illumination changes, but their technique is based on a complex gradients-based algorithm using pixel, region and frame level processing (5). Unfortunately, the paper does not provide any information about the -2-

additional processing times needed for this technique. Likewise, Tian et al. presented a texture similarity measure based on gradient vectors, obtained by the Sobel operator (10). By analyzing the texture they try to cope with illumination changes. However, a fixed window is used for the retrieval of the gradient vectors, which largely determines the performance (both in processing time and accuracy) of their system. We present a conceptually simpler approach by extending the MGM algorithm. The results presented in the experimental results section show our successes in coping with quick illumination changes. Background Subtraction using Mixture of Gaussian Models The Mixture of Gaussian Models MGM was first proposed by Stauffer and Grimson in (4). It is a time-adaptive per pixel subtraction technique in which every pixel is represented by a vector, called I p, consisting of three color components (red, green, and blue). For every pixel a mixture of Gaussian distributions, which are the actual models, is maintained and each of these models is assigned a weight. Formula (1) depicts a Gaussian distribution G. The parameters are µ p and Ʃ p, which are the mean and covariance matrix of the distribution respectively. For computational simplicity, the covariance matrix is assumed to be diagonal. For every new pixel a matching, an update, and a decision step are executed. The new pixel is checked for a match against the models of the mixture. A pixel is matched if its value occurs inside a confidence interval within 2.5 standard deviations from the mean of the model. In that case, the parameters of the corresponding distribution are updated according to formula (2), (3), and (4). The learning rate, denoted by α, is a global parameter and introduces a trade-off between fast adaptation and detection of slow moving objects. Each model has a weight, w t, which is updated for every new image according to (5). If the corresponding model introduced a match, M t is 1, otherwise it is 0. Formulas (2) to (5) represent the update step. Finally, in the decision step, the models are sorted according to their -3-

weights. A threshold T is used to define which of the sorted models depict background or foreground. More specifically, the models for which the sum of their weights exceeds this threshold are regarded as background. MGM assumes that background pixels occur more frequently than actual foreground pixels. Indeed, if a pixel value occurs recurrently, the weight of the corresponding model increases and it is assumed to be background. If no match is found with the current pixel value, then the model with the lowest weight is discarded and replaced by a normal distribution with a small weight, a mean equal to the current pixel value, and a large covariance. In this case the pixel is assumed to be a foreground pixel. Problem Description (a) (b) (c) (d) Figure 1. An example input image from the PetsD2TeC2 sequence (a); the output of MGM for frame nr 2090 (b), 2210 (c), and 2343 (d) The negative effect that lighting changes infer on background subtraction techniques have already been reported in (5,6). In our previous work, we described the problems that MGM suffers from during changing lighting conditions in outdoor environments (11). More specifically, the problem of consistent gradual lighting changes was discussed. Since a human observer would in this case not even notice the change, we must address this problem to make sure that surveillance systems act in agreement with what human operators expect. Figure 1 shows the detection results of MGM for the PetsD2Tec2 sequence (with a resolution of 384x288 and frame rate of 30 fps) provided by IBM (12). In these images, the black pixels represent background; the white pixels are assumed to be foreground. The first figure shows an image of the actual scene which is a typical outdoor environment, surveilled with a static camera. The three output images show the output of MGM at several time points. The images show the effect of changing illumination circumstances. These changes where caused by a repetitive increase of certain pixel values in a relatively short time period. As can be seen, the effect of the changes range from small regions of misclassified pixels to regions -4-

encompassing almost half of the image. Therefore, global frame processing (e.g. detection of lighting changes when a certain number of pixels changes) is not able to deal with the entire problem. The illumination change results in relatively small differences between the pixel values of consecutive frames. So, this change is hardly visible to the human eye. The consistent nature of these differences causes the new pixel values to eventually exceed the acceptance range of the mixture models. This is because the acceptance decision is based on the difference with the mean of the model, regardless the difference with the previous pixel value. The learning speed of MGM is typically very slow (α is usually less than 0.01), so gradual changes spread over long periods (e.g., day turning into night) can be taken into account by the models. However, the small learning rate makes the adaptation of the current background models not fast enough to encompass the short consistent gradual changes described here. Increasing the learning rate would result in too much misdetections of actual foreground pixels. Since, a high learning rate causes slow moving or stopped objects to be considered as background after a very short time, what results in false negatives. When analyzing sequences of indoor surveilled environments, we noticed that the same lighting problems occur. Reflections of light, smooth shadows, objects blocking light sources are just a number of origins of gradual changing lighting conditions. Dealing with shadows and lights is one of the major remaining issues in indoor video surveillance. Consequently, in this paper we present our changes to MGM to deal with gradual lighting changes. We apply it to a number of challenging indoor surveillance sequences and show our results. Improved Background Mixture Models Simple Mixture Models In this section we present adjustments to the standard MGM system. The next section shows our solution for the problem of gradual lighting changes. Secondly, we present changes to increase the processing speed, so the system becomes practically useable. Since moving object detection is typically one of the first steps in surveillance, it is important to make this step as fast as possible. Dealing with Consistent Gradual Changes MGM uses only the current pixel value and the mixture model in the matching, update and decision steps. The pixel values of the previous image are not stored and they are only used to update the models. We propose to make the technique aware of the immediate past by storing the previous pixel values in case background pixels were detected. The assumption we make here is that if a pixel was considered background in the previous frame, and there is only a minor change with the next frame, then the pixel should be regarded as background in the current frame. The matching step is altered according to the pseudocode shown in Figure 2. -5-

For each background model m do { If ( I previ[m] < gr * Thresh[m]) { Match = true; Decision = background; diff = I previ[m] ; If(diff < mindiff) { mindiff = diff; matchedmodel = m; } } } If(!Match) checkmatch(i); update(i); decide(i); If((Match == true) and (Decision == background)) previ[matchedmodel]=i; Figure 2. Pseudocode of the new matching procedure used in SMM For each background model in the mixture, the actual last pixel value which was recorded is stored (denoted as previ). All the background models need to be regarded in this step since we might be dealing with a multimodal environment. When processing a new pixel value (denoted as I), it is compared to these background values. If the difference is small enough, a match is immediately effectuated. Since we were dealing with a background model and we have a match, the decision would again be that the pixel represents background. Given that we might be dealing with multiple background models, we search for the background model which yields the smallest difference (mindiff). If the difference is larger than the gradual threshold, we proceed with the normal matching step (checkmatch). Regardless of the difference, the regular update and decision step are executed (respectively update and decide). To prevent that changes smaller than the normal threshold (2.5 standard deviations of the mean) result in false detections, the gradual threshold needs to be smaller than the normal one. As such, this step does not affect the matching in case of larger differences between pixel values. The difference between MGM and the proposed system will consequently only be noticeable if a new pixel value differs slightly from the previously matched one, but would fall out of the matching range of the model. Since the threshold for the normal matching process is dependent on the specific model, more specific on the standard deviation, it is better to enforce this for the gradual threshold as well. Furthermore, this threshold should not exceed the normal threshold. Therefore, we have -6-

chosen for a per pixel threshold dependent on the normal threshold. We introduce a new parameter, gr, which specifies a percentage of the normal threshold. In Figure 3. we have recorded the number of detection failures and false alarms for several values of gr for the PetsD2TeC2 sequence. A manually created ground truth has been used to calculate the false positives and negatives. The average values over the entire sequence are plotted in the curve to find the optimal value for the parameter (close to the center). A gr of about 0.7 gives the best results. If we use lower values, we will not be able to cope with some of the gradual lighting changes, resulting in many false alarms. Higher values for gr result in more detection failures. Indeed, if we use a high gradual threshold and a new background pixel differs much from the mean of the model, the threshold for the next frame would be very large, resulting in mismatches. Consequently, gr = 0.7 is chosen and is further used in all experiments. 270 260 250 1,2 Detection Failures 240 230 220 210 200 1,1 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,25 0,15 0,1 0,05 190 0 200 400 600 800 1000 1200 False Alarms Figure 3. ROC graph for PetsD2TeC2 sequence Increasing Processing Speed One of the major drawbacks of MGM is the slow processing speed. Since it is a pixel-wise background subtraction technique, it has to perform the same operations for every pixel in every frame. Moreover, the updating of the parameters of the models (as in equations (2) and (3) ) rely on the calculation of the probability distribution function (equation (4)). This is a heavy processing step and it has already been noticed that the resulting values tend to be very small. To speed the system up, we will use a constant update parameter (ρ). Since we do not use the probability distribution function anymore, we can step away from the assumption that the pixel values have a Gaussian distribution. In fact, we could regard the models now as a feature vector consisting of a mean and a threshold, regardless of the underlying distribution. Therefore, we call our system Simple Mixture of Models (SMM). -7-

In the decision step MGM will sort the models according to their weights. Then it adds the weights of the models in this order until the sum exceeds a new threshold T. The models that account for this sum are consequently considered to represent background, the other models are considered to represent foreground. The conceptual approach of sorting the pixels introduces again some complexity in MGM. In most of the cases there is only one model which simulates the background. Additionally, new foreground models tend to have a small weight. Therefore, in SMM, the first step is to check if the weight of the matched model is larger than the threshold T. The next step is to check whether the weight is smaller than the weights of the other models. In the latter case the model surely depicts foreground. Since these two cases occur frequently, a costly sorting step can be avoided. If none of the initial conditions is fulfilled the normal sorting can be applied. To increase the processing speed we have chosen to only regard part of the pixels per frame. We use a mask, equal to the image size to decide which pixels to evaluate when processing a new image. The mask is constructed by 2x2 patches, which are repeated over the entire image. We use 4 different masks to make sure that every pixel in an image is evaluated at least once in 4 consecutive images. So for every new image the mask is switched. Figure 4. shows the patches which constitute the different masks. The positions marked with an X are the pixels for which the normal background subtraction is executed. The results are then used to make a decision about the surrounding pixels. Positions denoted with H will apply horizontal interpolation. This means that for the pixel at that position, the results of the pixels to the left and right are evaluated. If both pixels were considered foreground than the decision for the current pixel is foreground, otherwise it is background. For the positions in the patch denoted by V we will take the upper and lower pixels into account. Finally, for the positions denoted by D, the upper left, upper right, lower left and lower right pixels are used. If three of them are considered foreground, the current pixel is foreground. X H D V H X V D V D H X D V X H (a) (b) (c) (d) Figure 4. 2x2 Patches of the analysis mask for every 1 st (a), 2 nd (b), 3 rd (c), and 4 th (d) frame An alternative to the proposed system is simple downscaling of the image. However, for surveillance cameras which monitor wide environments a single pixel might constitute a large part of an environment, so important information might be missed. By making the mask itself dynamic, we make sure that we analyze the entire scene at least once over a number of frames. Experimental Results We compare our system with the traditional MGM. This is an evaluation means which occurs frequently in related work (5,10). To show that our system is generally applicable we have adjusted our parameters based on the PETS2005 sequences and consequently used those settings for all our experiments. The PETS2005 sequences form a generally excepted benchmark and contain several frames which suffer from the gradual lighting changes. -8-

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) Figure 5. Left: current frame, center: output by MGM, right: output by proposed system. First row shows the s00-thirdview sequence of the PETS2007 dataset, image 170 (a), (b), and (c); second row shows image 300 (d), (e), and (g). Third row shows the AVSS AB easy sequence of the AVSS2007 dataset, image 1290 (g), (h), and (i); image 3790 (j), (k), and (l) -9-

Figure 5 gives subjective results for our system on 2 images from 2 different sequences. The first sequence is from the PETS2007 benchmark dataset, showing a terminal in a British Airport (2). The second sequence is part of the i-lids bag and vehicle detection challenge, issued during AVSS2007 (1). It showcases a subway station with moving trains and people. The left column of the figure shows the original images, the center one shows the outputs of MGM, and the right column shows the results of SMM. As can be seen, our system succeeds in dealing with small lighting changes and is still able to detect actual regions of interest (e.g. moving people). In this case, no morphological post-processing has been applied, so further refinements can be done. Figure 6. ROC graph for output of MGM and SMM on s00-thirdview sequence Figure 7. ROC graph for output of MGM and SMM on AVSS AB easy sequence -10-

An objective evaluation is given in Figure 6 and Figure 7. The figures show a quantitative comparison of MGM and SMM for the 2 sequences shown in Figure 5. To obtain these curves, a manual ground truth annotation is made for every 50th frame of the sequence. For each image the number of false positives and false negatives are recorded and consequently averaged over the entire sequence. The x-axis represents the False Positive Rate (FPR), which represents the number of incorrect positive results that occur among all negative samples available during a test. In our case, this means the percentage of the real background pixels which are incorrectly regarded as foreground. The True Positive Rate (TPR) denotes the percentage of the actual foreground pixels which were correctly classified as foreground. An ideal system would yield points in the upper left corner of the ROC graph. The FPR is typically very small, since in surveillance scenarios there would be more background than foreground. To obtain the curve we have used several values for α, both for MGM and SMM, to get different values for the FPR and TPR. As can be seen, our system yields lower false positive rates while achieving similar amounts for the true positive rates. Table 1 shows the execution times per frame for the discussed systems on the 2 sequences. The table clearly shows the gains in processing speed for both sequences. Table 1. Execution times in milliseconds per frames MGM SMM Avg Stdev Avg Stdev AVSS AB easy (360x288, RGB) s00-thirdview (720x576, RGB) 167 5 52 3 1227 39 411 31 Conclusions This paper presents an object detection technique using a simple mixture of models based on the well known Mixture of Gaussian Models. The original scheme has been discussed indepth and the incapability of dealing with quick illumination changes has been detected. Consequently, an update of the matching mechanism has been presented. Furthermore, changes have been proposed to increase the processing speed. Experiments have been done on challenging indoor environments (such as subways) and we show that our algorithm has significant improvements, both in detection accuracy and execution times. Acknowledgments The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research-Flanders(FWO-Flanders), the Belgian Federal Science Policy Office (BFSPO), and the European Union. -11-

References (1) The 10 th International Workshop on Performance Evaluation of Tracking and Surveillance (PETS2007) (2) IEEE International Conference on Advanced Video and Signal based Surveillance 2007 (AVSS2007) (3) A. Dick, M.J. Brooks: Issues in automated visual surveillance. Proceedings of International Conference on Digital Image Computing: Techniques and Applications. (2003) 195 204 (4) C. Stauffer, W.E.L. Grimson: Learning Patterns of Activity Using Real-Time Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol.22. (2000) 747 757 (5) O. Javed, K. Shafique, M. Shah: A Hierarchical Approach to Robust Background Subtraction using Color and Gradient Information. Proceedings of the Workshop on Motion and Video Computing. (2002) 22 27 (6) K. Toyama, J. Krumm, B. Brumitt, B. Meyers: Wallflower: Principles and Practice of Background Maintenance. Proceedings of the IEEE International Conference on Computer Vision (1999) 255 261 (7) J. Wu, M. Trivedi: Performance Characterization for Gaussian Mixture Model Based Motion Detection Algorithms. Proceedings of the IEEE International Conference on Image Processing. (2005) 97 100 (8) D. Lee: Online Adaptive Gaussian Mixture Learning for Video Applications. Lecture Notes in Computer Science, Statistical Methods in Video Processing. (2004) 105 116 (9) Y. Zhang, Z. Liang, Z. Hou, H. Wang, M. Tan: An Adaptive Mixture Gaussian Background Model with Online Background Reconstruction and Adjustable Foreground Mergence Time for Motion Segmentation. Proceedings of the IEEE International Conference on Industrial Technology. (2005) 23 27 (10) Y. Tian, M. Lu, A. Hampapur: Robust and Efficient Foreground Analysis for Real-time Video Surveillance. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. (2005) 1182 1187 (11) C. Poppe, G. Martens, P. Lambert, R. Van de Walle: Improved Background Mixture Models for Video Surveillance Applications, Asian Conference on Computer Vision. (2007) (12) L.M. Brown, A.W. Senior, Y. Tian, J. Connell, A. Hampapur, C. Shu, H. Merkl, M. Lu: Performance Evaluation of Surveillance Systems Under Varying Conditions. Proceedings of IEEE International Workshop on Performance Evaluation of Tracking and Surveillance. (2005) (13) A. Prati, I. Mikic, M. M. Trivedi, R. Cucchiara: Detecting Moving Shadows: Algorithms and Evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol.25. (2003) 918 923 (14) R. Cucchiara, C. Grana, G. Neri, M. Piccardi, A. Prati: The Sakbot System for Moving Object Detection and Tracking. Video-Based Surveillance Systems - Computer Vision and Distributed Processing. (2001) 145 157-12-