Human detection solution for a retail store environment

Size: px

Start display at page:

Download "Human detection solution for a retail store environment"

Baldwin Mills
5 years ago
Views:

1 FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Human detection solution for a retail store environment Vítor Araújo PREPARATION OF THE MSC DISSERTATION Mestrado Integrado em Engenharia Eletrotécnica e de Computadores Supervisor: Jaime Cardoso, PhD (FEUP) Co-Supervisor: Pedro Carvalho, PhD (INESC Porto) July 5, 2013

2 c Vítor Araújo, 2013

3 Abstract This document provides an overview of the current state of the art on the field of human detection by automated systems, as part of the "Preparation of the MSc Dissertation" course unit. This will serve as a starting point for the Dissertation work, which will take place on the 1st semester of the 2013/2014 academic year. For this effect, a work plan and proposed work methodology are also included. i

4 ii

5 Contents Abstract Symbols and Abbreviations i ix 1 Introduction Motivation Objectives Document Structure State of the Art HOG Detection Results Conclusion CENTRIST Detection Results Conclusion Viola-Jones Detection Results Conclusion C Detection Results Conclusion HOG-LBP Detection Results Conclusion Planning Work plan Challenges Expected results Tools Conclusion 11 References 13 iii

6 iv CONTENTS

7 List of Figures 2.1 Detection performance of the HOG descriptor (from [1]) Comparison between C 4 and HOG on the INRIA data set (from [2]) Work plan estimate v

8 vi LIST OF FIGURES

9 List of Tables 2.1 Detection rates for certain numbers of false positives vii

10 viii LIST OF TABLES

11 Symbols and Abbreviations Acronyms 1D 1 Dimension, p. 3 CCTV Closed-circuit television, p. 1 CENTRIST CENsus TRansform histogram, pp. iii, 4 6 CMU Carnegie Mellon University, p. 5 CPU Central Processing Unit, p. 6 FBI Federal Bureau of Investigation, p. 1 FPPW False positive per window, pp. 3, 4, 7 FPS Frames per second, pp. 1, 6 GPU Graphics Processing Unit, p. 6 HOG Histogram of Oriented Gradients, pp. iii, v, 3, 6, 7 IDE Integrated Development Environment, p. 10 LBP Local Binary Pattern, pp. iii, 7 MIT Massachusetts Institute of Technology, pp. 3 5 SVM Support Vector Machine, p. 7 ix

13 Chapter 1 Introduction This chapter contextualizes the current importance of automatic human detection systems, their major advantages and weaknesses, while also describing the main objective of this project and the structure of the rest of the document. 1.1 Motivation In recent years, the adoption of security camera systems has spread drastically, not only in the enterprise environment but also for home and small store surveillance. These systems are usually installed in a closed environment (CCTV), transmitting and/or recording only to a local endpoint. The quality of the cameras deployed has also improved considerably along the years, going from black and white video with a low resolution and frame-rate to colored high-definition video with 60 FPS or even more. Fortunately for consumers, these improvements in video quality were also accompanied by an increasing affordability of high capacity data storage media, which has allowed for the recording and archival of surveillance video for longer periods of time, with little cost. Unfortunately, these technological advances come with a price. The amount of data captured by surveillance systems has made it difficult to find the exact information needed, be it the moment an intruder is recorded by the camera, or the unpredictable but potentially helpful recording of a missing person. This is where automated recognition systems come into play. Image recognition software has been the subject of many improvements over the years, but still has a long way to go before it achieves the same quality of detection that a human is capable of. Currently, the biggest advantage of automated systems is their speed. With the right off-the-shelf hardware to support it, some of these systems can scan through hours of high definition video in minutes. Nevertheless, even the most powerful hardware clusters are useless if they are not backed up by a good detection algorithm. For instance, in the 2013 Boston Marathon bombings in the United States of America, an unprecedented amount of footage was recorded. This footage was then process by the authorities but the results came up empty, despite the fact that both suspects were already in the FBI database [3]. The suspects ended up being detected by a human that was 1

14 2 Introduction looking through the video. The lack of effectiveness of the system was attributed to several causes, like the low resolution of the cameras and the long range of the recordings, some of which were badly focused and caught from angles that fell within the software s weaknesses. 1.2 Objectives The primary object of this project is to improve and adapt an existing algorithm to work in an object dense area. In this area, which is based on a small store environment, there will be objects blocking the people in the image. The majority of these objects will be static, but some moving objects can also be present and should be taken into account. 1.3 Document Structure Apart from this introductory chapter, this document has 3 more chapters. In chapter 2 a state of the art analysis and related work is presented. In chapter 3 the development planning is described. And in chapter 4 a brief conclusion reflects on future developments.

15 Chapter 2 State of the Art The field of computer vision technologies is currently the subject of many academic research projects, yet improvements still come in small amounts, as most research activity builds up on one of the previously available frameworks. Therefore, this chapter will cover the frameworks that are currently regarded as the best in the field, and then highlight some of the recent improvements that have been made. 2.1 HOG This descriptor was first purposed in 2005 in the paper Histograms of Oriented Gradients for Human Detection [1]. It is based on the concept that the distribution of intensity gradients or edge directions can define an object within an image. The practical implementation is done by dividing the image into spatial regions, called cells, which will contain a local 1D histogram of gradient directions or edge orientations over the cell pixels. The combination of histograms from all cells forms the image descriptor. In order to diminish the effect of illumination and shadowing variance, the local results should be contrast-normalize. This can be achieved by measuring the intensity across a larger area, called block, and using this value to normalize all cells within the block Detection Results The paper provides results from two different data sets: the MIT pedestrian database, which contains only front or back views, with a limited range of poses; the INRIA database, which was developed by the paper authors and provides a bigger challenge for the descriptor, by containing images of people standing but in any orientation, with a wide range of backgrounds, including crowds. For the MIT data set, the descriptor performed near-perfectly, with a miss rate of less than 1% at 10 4 FPPW. It was due to these results that the INRIA data set was developed. Its miss rate, for 3

4 State of the Art the same FPPW rate, is of 10%. While these results are worse than for the MIT data set, they still provide a big improvement over other descriptors.

16 4 State of the Art the same FPPW rate, is of 10%. While these results are worse than for the MIT data set, they still provide a big improvement over other descriptors. All these results can be seen in image 2.1. Figure 2.1: Detection performance of the HOG descriptor (from [1]) Conclusion The HOG descriptor provides a big improvement over previously used methods, which has led to its popularity as a starting point for recent projects in this field. Nevertheless, it misses a few key features, like accounting for image orientation. Its detection speed is also of concern, specially when applied to high resolution images and video. 2.2 CENTRIST The CENsus TRansform histogram descriptor first appeared in the paper CENTRIST: A Visual Descriptor for Scene Categorization [4] in As defined in this paper, "Census Transform (CT) is a non-parametric local transform originally designed for establishing correspondence between local patches. Census transform compares the intensity value of a pixel with its eight neighboring pixels (...). If the center pixel is bigger than (or equal to) one of its neighbors, a bit 1 is set in the corresponding location. Otherwise a bit 0 is set." [4] The result is a 3x3 binary matrix, with the central position empty, which can be translated into a base-10 number in the [0 255] range. This number represents the Census Transform value for the central pixel. Repeating the process for each pixel in the image, the resulting set of values can then be used as input to the classifier. The initial processing of the image using the Census Transform method allows the classifier to do an easier recognition of the scene, as it can ignore distracting elements, like textures and color, and focus on the more important geometric features and structural properties.

17 2.3 Viola-Jones Detection Results CENTRIST was compared with 2 other visual descriptors, SIFT [5] and Gist [6]. In one test, with both outdoor and indoor environments, CENTRIST s accuracy was of 83.88±0.76%, while Gist got a rate of 73.28±0.67%. As for SIFT, its highest discriminative power makes it inefficient for images with high variation. It got a rate of 57.24% false negative results, against 35.83% for CENTRIST Conclusion This method allows for a good evaluation of the type of scene. However, it does not implement object detection. 2.3 Viola-Jones This framework describes a method of detecting objects in a scene. Its mostly used implementation focuses on face detection [7], which, in the context of this project, makes it useful as a first detection algorithm. This procedure works by classifying images based on simple features. Using features instead of working directly on the pixels makes for a much faster processing, allowing the focus to be put on the quality of the results. By also implementing a cascade of classifiers, this method can achieve good detection rates (>85%) while providing low false positive rates (<10 5 ). As an example, a target rate of 0.9 can be obtained by using a 10 stage classifier. Each stage needs to have a detection rate of 0.99, which may seem difficult to achieve, but by having a large margin for error, it becomes much easier. More precisely, each stage can have a false positive rate of 0.3. The end result is: Detection rate: = False positive rate: = Detection Results Table 2.1 presents the detection accuracy of the Viola-Jones framework, compared to the Rowley- Baluja-Kanade [8] results, using the MIT-CMU test set, which contains 130 images and 507 faces. False positives Detector Viola-Jones 78.3% 85.2% 90.8% 91.8% 93.7% Rowley-Baluja-Kanade 83.2% 86.0% 89.2% 90.1% 89.9% Table 2.1: Detection rates for certain numbers of false positives As for its speed, on a 700 Mhz Pentium III processor, using 384 by 288 images, each one took seconds to process, which is about 15 times faster than the Rowley-Baluja-Kanade detector.

18 6 State of the Art Conclusion The Viola-Jones framework provides good and fast results for face detection, which can be useful as a first step for this project. By detecting the face first, the solution developed could then more easily detect the rest of the person, achieving better results in less time. 2.4 C 4 C 4 [2] is a detector based on CENTRIST which focuses on contour cues for its detection. The method works by first creating the Sobel gradients of the image, then computing the Census Transform values and creating a single integral image. This image is then resized and the brute-force scan is performed. The major performance advantage comes from only using one integral image, and from the fact that CENTRIST does not require normalization, unlike HOG Detection Results A comparison with HOG can be seen on image 2.2. Figure 2.2: Comparison between C 4 and HOG on the INRIA data set (from [2]) When comparing detection speed, C 4 can process a 640 by 480 video at 20 FPS using 1 core of a 2.8GHz CPU. The nearest comparable solution ran at 10 FPS, while also used parallel processing on a GPU Conclusion This implementation seems to be fast and accurate, by improving on previous work. However, it is still not as accurate as other methods, which is the most important factor, as CPU performance is constantly improving.

19 2.5 HOG-LBP HOG-LBP This method combines the framework described in 2.1 with a Local Binary Pattern, in a detector that also handles partial occlusion [9] by taking advantage of the LBP high discrimination, along with the HOG edge and local shape information capture. After computing the Histogram of Oriented Gradients and the Local Binary Pattern integral images, these are combined in an augmented feature vector. It is on this vector that the sliding window acts. By feeding the sliding window results to an SVM, each block can be scored. If the SVM scores the block with an ambiguous classification, an image segmentation algorithm is run, which segments the possible occlusion regions Detection Results The HOG-LBP method achieved a detection rate of 91.3% at 10 6 FPPW and 94.7% at 10 5 FPPW using the INRIA data set. This compares with HOG s rate of 90% at 10 4 FPPW. With a custom upper body data set, the improvement over HOG was of 20% at Conclusion Despite increasing the detector complexity, this method proved effective in detecting partially occluded subjects. This kind of situation will be the most common in this project, and this type of approach may be one of the more effective ones for a store environment.

20 8 State of the Art

21 Chapter 3 Planning This section presents an overview on how the project development will take place. It includes some of the expected challenges and results, an estimation of the development schedule and the tools that will be used. 3.1 Work plan Figure 3.1: Work plan estimate Note: Dates for Dissertation document delivery and presentation are not yet available. These estimates were based on previous years dates. 3.2 Challenges The biggest challenge perceived at this point will be understanding the initial code, how it works, and how each function affects the end result. After this has been accomplished, the development of the necessary algorithms for the detection to work in a complex environment will start. 9

22 10 Planning 3.3 Expected results At the end of this project, it is expected that the solution provided allows for an accurate human detection in a complex store environment. This solution should account for partial subject occlusion, either by objects or other subjects, and its results should fall within a reasonable false positive and false negative range, based on similar solutions. 3.4 Tools The development will be done on the Windows and Linux operating systems, using an appropriate Integrated Development Environment, which will be chosen based on the project that will serve as a basis for this work, thus minimizing the risk of errors and avoiding the need to configure a different IDE from scratch.

23 Chapter 4 Conclusion Effective and reliable human detection on still images or video is presently one of the most challenging aspects in the field of computer vision. The state of the art presented in this document shows some of the most relevant work currently behind developed in this area, and related to the subject of this project. By focusing on these approaches to the problem, a new solution is expected to be developed and implemented in the following months, bringing forward another small contribution to help improve the human detection systems. 11

24 12 Conclusion

25 References [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, CVPR IEEE Computer Society Conference on, volume 1, pages vol. 1, Cited on pages v, 3, and 4. [2] Jianxin Wu, C. Geyer, and J.M. Rehg. Real-time human detection using contour cues. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages , doi: /icra Cited on pages v and 6. [3] Douglas McCormick. Face recognition failed to find boston bombers, April URL: face-recognition-failed-to-find-boston-bombers. Cited on page 1. [4] Jianxin Wu and J.M. Rehg. Centrist: A visual descriptor for scene categorization. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(8): , doi: /TPAMI Cited on page 4. [5] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91 110, Cited on page 5. [6] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42: , Cited on page 5. [7] Paul Viola and Michael Jones. Robust real-time object detection. In International Journal of Computer Vision, Cited on page 5. [8] Henry A Rowley, Shumeet Baluja, and Takeo Kanade. Neural network-based face detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(1):23 38, Cited on page 5. [9] Xiaoyu Wang, Tony X Han, and Shuicheng Yan. An hog-lbp human detector with partial occlusion handling. In Computer Vision, 2009 IEEE 12th International Conference on, pages IEEE, Cited on page 7. 13

Discriminative classifiers for image recognition

Discriminative classifiers for image recognition May 26 th, 2015 Yong Jae Lee UC Davis Outline Last time: window-based generic object detection basic pipeline face detection with boosting as case study