Visual object detection for animal behavior research

Size: px

Start display at page:

Download "Visual object detection for animal behavior research"

Gloria Cunningham
5 years ago
Views:

1 Visual object detection for animal behavior research BY ISLAM ISMAILOV B.S., Taurida National University, 2010 THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Chicago, 2012 Chicago, Illinois Defense Committee: Tanya Berger-Wolf, Chair and Advisor Charles Stewart Brian Ziebart

2 Dedicated to my family

3 Whatever the [visual] cortex is doing, the analysis must be local. David Hubel in Eye, Brain, and Vision

4 TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUCTION Motivation Wildlife animal detection and its applications Challenges Background on object detection Overview of our approach STANDARD APPROACH TO OBJECT DETECTION Image features Sparse representation of image regions Key point detectors Part detectors Dense representation of image regions Intensity based detectors Edge and gradient based detectors Wavelet based detectors Classification methods Discriminative approaches Bayesian and graphical models Fusion of multiple detections OVERVIEW OF METHODOLOGY AND RESULTS Overall architecture Learning phase Training image set normalization Overview of feature sets Rectangular DL (R-DL) Other descriptors Generalized Haar Wavelets Histogram of Oriented Gradients ADR-HOG The learning process Detection phase Multiscale object localization Object localization through classification Non-maximum suppression Detector window and classifier iv

5 TABLE OF CONTENTS (Continued) CHAPTER PAGE 4 EXPERIMENTS AND IMPLEMENTATION Implementation and Performance Study Gamma normalization Edge computation Lattice and block overlap Descriptor blocks and block normalization schemes Overview of the results Data set details Performance evaluation Detector performance R-DL HOG Haar wavelets ADR-HOG Results Automating zebra identification with detection CONCLUSIONS CITED LITERATURE VITA v

6 LIST OF TABLES TABLE PAGE I DETECTOR SPEED vi

7 LIST OF FIGURES FIGURE PAGE 1 Learning phase Detection phase Tagging program interface Aspect ratio distributions in two different data sets Normalized positive training data Negative training data Hard negatives PCA of HOG [1] Non-maximum suppression for fusion of overlapping detections Image filtering Edge computation Lattice and block overlap Descriptor blocks and normalization schemes Performance of the detectors Linear regression derived ROIs Examples of automatically derived ROIs Performance of the automatically derived ROIs vii

8 LIST OF ABBREVIATIONS SVM NMS CPU HOG SIFT ROI Support Vector Machines Non-Maximal Suppression Central Processing Unit Histogram of Oriented Gradients Scale Invariant Feature Transformation Region of Interest viii

9 SUMMARY An increasing focus of biology and biocomplexity research has been on developing new methods and computational strategies to model and manage complex biological systems [2]. Understanding these systems is important, for example, in order to study how biological and physical interactions across many scales of resolution [3] could be represented, or how weather patterns affect species distribution across space and time [4]. Learning such aspects requires spatio-temporal and behavioral data at relevant scales [2, 5]. Despite its importance, current data collection technology is not able to satisfy the need for detailed data on animal movements and their relationships to other factors and patterns. Until recently, wildlife tracking studies relied on very simple and limited technology such as collaring a sample subset of animals with simple VHF transmitters, GPS trackers, and custom sensor-network based systems [6]. These systems, however, apart from being expensive highmaintenance methods, require animal tranquilizing for device installation, which is a serious impediment preventing them from being used to study endangered or numerous species. To overcome these difficulties, non-invasive methods, such as camera traps are used. Camera traps and other photo capturing devises have been gaining considerable popularity for both research and general purposes. These ubiquitous devises are easy to operate, and they produce significant amounts of data in the form of digital photographs. Mining of biologically relevant data from the digital photographs is still based on the manual techniques. Manual ix

10 SUMMARY (Continued) techniques require a lot of human effort, consume considerable amounts of time and are generally expensive and slow. Detection frameworks use computer vision techniques to provide more efficient way to collect and comprehend large amounts of ecological data. This will improve the speed at which changes in ecologic conditions can be detected and quantified. Detection and identification frameworks will allow larger data sets, which in its turn might facilitate better understanding of such largescale ecological phenomena as species migration and extinction. It will also allow for larger and more detailed scale of ecological studies, leading to better understanding of ecological systems. From these premises it follows that we need robust methods for detection and identification of individual animals. This need has provided the main motivation for the work described in this thesis. A viable solution to reduce amount of human work associated with ecological data retrieval is the use of computer vision for objects of interest extraction from visual data, which usually is available in abundance. This technology can achieve high precision and therefore requires little human attention and implies low maintenance costs. This thesis is thus, concerned with the issue of building object detectors from a computer vision point of view, where the detectors search given images for objects of interest and localize them. The main contribution of this thesis is the development and implementation of object recognition framework for zebras. The proposed image descriptors are particularly easy to understand, and provide an example of how simple yet effective specialized features can be devised for a given object category. In general, designing image descriptors is a challenging task due to a fact that categorizing objects in images is an inherently ambiguous process, with image forx

11 SUMMARY (Continued) mation process, compression, intra-class variation and occlusion all contributing to ambiguity on different levels. However, we can take advantage of some domain-specific features in order to provide compact and efficient image descriptors. According to experimental results, our recognition framework achieves comparable performance to the state of the art techniques generally used in computer vision, while maintaining very fast training times, and easily parallelizable classification implementation. We have incorporated the image descriptor scheme with an existing identification algorithm to automate the zebra identification method, which previously required human intervention. The modular framework developed in this thesis is integrated with an existing identification tool to derive a system that achieves performance comparable to that of a manual approach. The thesis is composed of five chapters. Chapter 1 provides an introduction presenting motivations for this work, background on object recognition, challenges, and overview of our approach. Chapter 2 describes the state of the art in object detection. It first describes previous work on image description, then summarizes the key contributions on detection models. It also presents our motivation for using dense feature sets for object detection. Chapter 3 presents an overview of our approach to object detection. It does not give implementation level details but it describes the overall detection framework. Chapter 4 discusses the influence of each stage of the computation on performance, concluding that fine-scale Haar filters, fine lattice step, relatively coarse block sizing, and fine step for overlapping descriptor blocks are all important for good results. We also present brief xi

12 SUMMARY (Continued) guidelines on how to choose the descriptor parameters for a given object class. The chapter concludes with presenting the designed system s integration with an identification algorithms, which could find application for ecology data collection. Chapter 5 summarizes our approach, results, and provides some suggested directions for future research in this area. xii

13 CHAPTER 1 INTRODUCTION With computers becoming more pervasive and powerful, number of their applications has risen quite drastically during the recent decades. Their increased computational ability has enabled them to be used for the problems that previously has been solvable only by human intelligence. Even though it s only natural for researchers to try to extend its applications with the fields like speech recognition, visual sequence analysis and logical inference, computers still are far behind humans in these high-level tasks. Human visual system demonstrates how complex problem of object recognition may be solved efficiently. Taken for granted, humans ability to analyze complex scenes in very little time is incomparable to current state of computer vision algorithms. Our lives are literally filled with objects of different classes that we encounter and recognize every day of our lives. Brain deals with classes that have huge intra-class variation such as cars, which consist of entire subcategories such as sedans, SUVs, hatchbacks etc. Intra-class variation is worsened with differences in object appearance such as difference in color, view point, scale and others. These discrepancies are irrelevant to the decision that the object is an instance of a car. Similarly, we are able to detect animals irrespective to their position, occlusions, illumination or background clutter. Thus one goal of researchers working in the field of computer vision was to give computers an ability to perceive visual information with the help of object recognition. Detection on its own 1

14 2 turn may possibly advance other applications such as robotics, human computer interaction, and many others. This chapter introduces a problem of object detection - in particular detection of zebras, discusses challenges involved, and briefly presents proposed approach highlighting the contributions of the thesis. 1.1 Motivation The thesis targets the problem of visual recognition in images. In particular, it addresses the issue of building object detectors from a computer vision point of view, where the detectors search given images for objects and localize them. For a more precise definition of our goal, we can view an object detector as a combination of two key building blocks: a feature extraction algorithm that encodes image regions as feature vectors, and a detection framework that uses the computed features to decide where a given region contain the object of interest. Main contributions of this thesis relate to the first part: transforming visual regions into a feature space. This is fundamental in creating robust detectors, which differ from for example text categorization problem, where two words either match exactly or not. Categorizing objects in images is inherently ambiguous, with image formation process, compression, intra-class variation and occlusion all contributing to ambiguity on different levels. Another problem constructing features for zebra detection is designing vectors class-specific enough to differentiate between a zebra and a lion, and general enough to be able to treat two zebras with different stripes as an object of the same class. Different stripe contours become irrelevant, and the matching criterion is

15 3 changed regarding to the context. This thesis focuses on general purpose object detectors that do not make strong contextual assumptions. 1.2 Wildlife animal detection and its applications Our experiments will focus on detection of zebras in images and videos. Animal detection is a challenging task, with many applications that attracted a lot of attention recently. Consider a case for zebra tracking for ecology research, with a big amount of pictures being collected in short periods of time (with pace comparable to pictures in 3 weeks with 50 cameras setup). At point when it becomes quite tedious to go through images hand by hand, automatic detection becomes necessary. Intelligent tagging software will facilitate research, lower costs and thus is an important research goal. In conjunction with individual recognition, this may facilitate wider spread of information technology into ecology, a field that is only starting to leverage recent advances in computer science in general, and vision in particular. In this thesis we will mainly study the detection of fully visible zebras in more or less upright poses Challenges The main difficulty in building a robust object detection framework is the great amount of variation in object appearance, its surroundings, and visual region characteristics. 3D information is not preserved in images and videos Depth and other 3D information is ultimately lost during the image formation process, which creates a strong dependency on a camera viewpoint. Even small change in location or direction relative to the camera significantly affects object s appearance. The issue of variation in object scale arises from the same origin.

16 4 Large intra-class variations Most visual classes contain objects that differ from each other both in appearance and in pose. For example, one can define several visual subclasses for zebras such as standing, lying, browsing, and grazing. Different background context Background clutter is a widespread problem even in natural classes. In fact, background variance usually exceeds previously mentioned intra-class variance. For example, zebras can be spotted on a wide variety of backgrounds such as grass, gravel, other animals, etc. Variance in object color and illumination Although color and illumination invariant models greatly advanced in recent years, direct sunlight and shadows are still considered challenging obstacles for robust object recognition system construction. For example, sunlight makes dark stripes on zebra s back appear lighter than light stripes that are in shadowed areas of its body. This challenge is especially indicative of the fact that computer vision models are still far inferior from being effective when compared to mammalian visual systems, c.f. [7, 8, 9, 10]. Occlusions Partial occlusions create further complications as they prevent certain part of an object to be analyzed by the object recognition framework. This affects both rigid template-like and part-based models.

17 5 A robust object detector must be able to deal with the above mentioned challenges. It must be invariant to viewpoint and scale changes, capable of discriminating between object class and background, and it must handle color changes and provide invariance to a wide range of illumination changes. A robust detector must also solve conflicting challenges of background clutter and intra-class variance: it must neither be very specific to a particular object instance, nor too generic. A very object specific detector will produce less false detections on background regions, but will also miss many other object instances. And an overly general detector may handle large intra-class variations but will generate large amounts of false detections on background regions. Another challenge is presented by the fact that current object detection systems do not use the high level context information that humans and other mammals employ to successfully prune regions likely to contain false negatives, and at the same time discover subtle details of the appearance to find objects of interest. Moreover, humans are able to find regions most likely containing objects of interest even on the images that contain only part of the silhouette. 1.3 Background on object detection As mentioned earlier, detection frameworks are combinations of a module that transforms visual regions into feature vectors, and a module that provides classification decisions. An optional post-processing module to filter some of the overlapping detections is usually included, even though it is not necessary.

18 6 Feature extraction usually captures intensity patterns, contours, texture details, sub-regions with strong edge patterns, etc. Two main approaches to feature extraction exist in modern computer vision. Sparse features extracted from salient image regions. The motivation behind this approach is that not all regions in a given image are equally important: some are too cluttered, others contain no texture, or are simply too dark. Support for this approach comes from biology research which suggest that human eye spends much more time scanning salient regions, than the whole picture. Dense feature generation on image regions. Features are densely computed on the image region in this approach, in a manner aimed at preserving all small details. Support for this approach comes from the studies of mammalian visual system, on early stages of which visual encoding involves the computation of dense and overlapping centre-surround receptive-fields of different scales. Note that the differences between above mentioned approaches are not as great as they may seem at first, as detection of salient regions requires dense scan of the image. However, the criterion used to scan images is usually different from the information encoded in sparse feature vector. For example, interest based approaches treat blob like structures as salient regions, whereas vectors are usually encoded using SIFT techniques, which utilizes gradient and a contour information. Regarding the decision unit in detection framework, several approaches have been studies in literature. They can be divided into two main categories.

19 7 In whats called parts based approaches, one needs to learn how to detect different parts of an object. Details include how the parts are detected, and graph topology that interconnects them. For example, for zebras parts could be head, legs, and body frame connected in a star network. The notion of parts is not clear: while some treat part of the object as physical parts (head, legs, torso, etc), other approaches just find salient image regions that are detected in a sparse image representation. Another approach is somewhat simpler, and just encodes spatial information in a form a rigid template. This scheme is usually based on densely computed feature vectors, but could also be applied to sparse representations. The overall detection framework is provided then by explicit template matching or learning algorithms such as SVM or KNN. Both generative and discriminative approaches can be used in the detection phase Overview of our approach Our approach is based on scanning a detection window over the image at different positions and scales, transform each window into a feature vector, and get decision from classification unit. We thus adopt a bottom-up fully data-driven approach that uses low level information, and does not deal with general context or prior information. A dense feature based representation is computed for each region of sliding window, and passed to a classifier, with spatial information coded implicitly. Being data driven, this approach fully depends on a good learning set of images representing any invariance that may occur in the real test data. Image regions are thus densely computed high dimensional feature vectors, and classification is provided with state of the art machine learning algorithms. The approach performs implicit feature selection, and

20 8 expectation is that if the encoding of the images into feature vectors is sufficiently discriminative then the machine learning algorithm will be able to learn to recognize the object class. We adopt a relatively simple architecture in order to concentrate on one of the fundamental problems of object detection - building robust visual descriptors. Support vector machines as the classifiers offer state of the art performance and are fast to run. We developed a flexible framework which allows different classification back-end s to be integrated quite easily for the further experiments. For zebra detection in static images we propose fixed Haar-like wavelets popularized by Viola and Jones, which are designed to capture map of strong dark to light and light to dark edges. They are used for dense lattices (DL) computation of normalized stripe maps over small spatial neighborhoods. These horizontal and vertical counts retrieved from DLs are then clipped, normalized, and as values for the constructed feature vector for a given block (spatial neighborhood). Once we have the window level classifier, to obtain detections over a test image we scan the detection window across the images at all locations and scales, giving a detection score at each point. This typically results in multiple overlapping detections, around each true object instance in the image. These need to be fused into the final detection results. We use a general solution developed by Dalal and Triggs [11] as a fusion strategy. It is based on kernel density estimation that also takes into account the number of detections, their confidence scores and the scales of the detections. Basically, negative scores from the linear SVM are zeroed and a

21 9 3D position-scale mean shift process [12] is run to identify significant local peaks in the resulting score. If above threshold, these are declared as positive detections.

22 CHAPTER 2 STANDARD APPROACH TO OBJECT DETECTION Object recognition has been an area of intensive research during recent years, and has received a lot of attention from pattern recognition and computer vision communities. As described in 1 it is a challenging problem with many important applications. This chapter concentrates on modern techniques and algorithms in object recognition and localization. Most of the described work in object detection can be categorized into two rough classes: image descriptors or feature vectors in use by object recognition frameworks, and decision and localization engines, which are built around the image descriptors. 2.1 Image features Image descriptors have to be able to extract crucial information about the object of interest, while maintaining invariance to changes in illumination, pose, viewpoint etc. Using raw pixel data from images results in feature vectors that are not normally usable for obtaining good results in object recognition. Therefore, more complicated forms of descriptors based on points [13, 14], Laplassian of Gaussian [15] or Difference of Gaussians [16] blobs, intensities [17, 18], gradients [19, 20], color, texture or combinations of all. These descriptors have to exhibit good generalizing abilities as well as characterize visual regions sufficiently well for the detection and classification task in use. We will roughly divide different approaches for designing image 10

23 11 descriptors into two categories: sparse representations making use of few visual key points, and dense, encoding whole visual region Sparse representation of image regions Sparse image regions representation has received increased attention since mid 1990 s [21, 22, 16, 23, 24, 25, 26]. Approaches using sparse representation operate on usually dispersed set of salient image regions. These image regions are used to build local descriptors, which are later combined to derive a final feature vector. These regions can be selected using key point or parts detectors Key point detectors The motivation behind using sparse key point is that they hypothetically extract maximally stable regions invariant to a given set of transformations. The stable regions that are common within object of given visual class, and therefore are highly informative about the content of the region. The overall detector framework is thus dependent on quality of selected key points. Commonly used key point extractors include Forstner [27, 28], Harris [13], Laplacian [15] or Difference of Gaussians (DoGs) [24], and scale invariant Harris-Laplace [20]. Compactness is one of the bigger advantages of sparse image representation: quantity of pixels in a region usually vastly outnumbers the amount of extracted key points. However, key points might have lower generalization ability and therefore weaker performance when generalizing to object classes or categories. Different key point extraction techniques can be used separately or in combination.

24 12 Once key points are detected within a region, many approaches can be taken to compute feature vectors from the local regions surrounding the key points. Among the most popular approaches are such local histogram computation techniques as Scale Invariant Feature Transformation (SIFT) [29, 24] and shape contexts [30, 31]. SIFT computes scale and rotation invariant feature vectors from the local scale and dominant orientation given by the key point detector to vote into orientation histograms with weighting based on gradient magnitudes. The scale information is also used to define an appropriate smoothing scale when computing image gradients. SIFT computes histograms over rectangular grids, whereas shape contexts use log-polar grids. The initial shape context method [31] used edges to vote into 2-D spatial histograms, but this was later extended to generalized shape contexts by [32] where gradient orientations vote into 3-D spatial and orientation histograms with gradient magnitude weighting similar to SIFT. Despite many strong points, sparse representation of image regions also has a number of disadvantages. First, interest points do not necessarily capture all relevant image features. Second, in image categories where invariance is not important, sparse descriptors become irrelevant. Dense image descriptors are used to deal with these shortcomings Part detectors Local parts based detectors are widely used in object recognition systems [33, 34, 35, 36, 37]. For instance, [35, 36] use human body limbs and parts, which are assumed to be well approximated by simple geometric models. Parts are extracted with the help of parallel edge detectors and later combined in graphical or articulation constraints models. Rather simplistic assumption that limbs can be represented by parallel lines may suggest that scalability of this

25 13 approaches to the real world data sets is questionable. Another parts based approach [37] describes a system that builds appearance models of animals automatically from a video sequence of the relevant animal with no explicit supervisory information. Animals are modeled as a 2D kinematic chain of rectangular segments, where the number of segments and the topology of the chain are unknown. The system detects possible segments, clusters segments whose appearance is coherent over time, and then builds a spatial model of such segment clusters Dense representation of image regions For dense representations image region is usually whole detection window, which is often analyzed pixel by pixel to produce a high dimensional feature vector used for image labeling or classification. Typically the representation is based on image intensities, gradients, different convolution masks, or higher order differential operators. One of the problems of densely encoding whole image region is background noise. This problem can be treated with Gaussian window approach [38] to accentuate object usually located in the certain consistent position of the detection window Intensity based detectors Early examples of densely encoded fixed size image regions are works by Sirovich and Kirby, and Turk and Pentland [39, 38]. They used fixed size face images to produce high dimensional feature vectors based on pixel intensities. Principal Component Analysis (PCA) was employed to derive the face space eigenvectors called eigenfaces for main variations characterization.

26 14 Intensity based dense encoded detectors can be quite sensitive to the difference in lightning conditions. Histogram normalization among other preprocessing techniques has been used to overcome this problem in a neural network based face detector [40]. Using pixel intensities to explicitly maximize the information content of the descriptors with respect to the class in order to obtain feature spaces where simple linear classication is sufcient has been suggested by Vidal-Naquet and Ullman [18]. This framework as been demonstrated to tackle well such relatively rigid classes as cars and faces Edge and gradient based detectors Image edges and gradients are extensively used in object recognition, and some descriptors are based solely on these techniques to generate feature vectors. For example, a relatively simple detection system [41] where extracted edge images are matched to a set of learned exemplars using a chamfer distance was used in practical pedestrian detection system [42]. Articulated body detector was built by Ronfard et al by incorporating SVM based limb classifiers over first and second order Gaussian filters in a dynamic programming framework. They propose 7 part based detectors: front and profile view of head, upper body and legs. Gradient images are used to compute multi-scale descriptors based on histogramming dominant orientations at each position, similar in spirit to SIFT Wavelet based detectors Well known approach to object detection is described in Papageorgiou and Poggio, Mohan et al, Viola and Jones. These approaches use dense encoding of image regions based on operators similar to Haar wavelets.

27 15 Papageorgiou and Poggio use absolute values of Haar wavelet coefficients and scales as their local descriptors. Images are mapped from pixel space to an over complete dictionary of haar wavelets that is rich enough to describe patterns. Horizontal, vertical, and diagonal wavelets are used. To obtain an over-complete basis, the wavelets are computed with overlapping supports. Haar wavelets can be calculated efficiently, while still remaining rich enough to describe patterns, and providing a reasonable degree of translation invariance. Haar wavelets have been used successfully with kernelized and linear SVM, wholistic and part based classifiers. Rather than using a single strong classifier, Viola and Jones build a more efficient cascade of weak classifiers that use differences of rectangular regions in several Haar-like arrangements as features. Classifiers become progressively more complex with depth in the cascade. Each stage is designed to reject as many negative sub-windows as possible, while retaining all but negligible fraction of positive cases. In this fashion, the first stage of the cascade, called the attentional operator is rejecting 40% of the sub-windows while keeping almost all the positive cases. Therefore, for each rejected subwindow, only one weak classifier was evaluated. To train each stage, features with all possible rectangle dimensions are tested and the sample reweighting procedure AdaBoost is used as a greedy feature selection method. Efficiency of this approach allowed for building a real time face detector Classification methods Different classification methods can be divided into discriminative approaches such as Support Vector Machines, and generative ones such as Naive Bayes.

28 Discriminative approaches Support Vector Machines (SVM) have been widely used for the tasks of object recognition. SVM is essentially a binary classifier that finds a hyperplane with maximum separation gap between the positive and negative training cases in the feature space. Different kernels can be applied to this method to simulate non-linear classification boundaries. SVM can be used as part detectors, combined into cascades to derive the final decision. Cascaded AdaBoost is a boosting techniques that combines a collection of weak classifiers to derive a strong one. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers. In computer vision AdaBoost is usually used to speed up performance of an existing detection framework, or to build detectors capable of producing results in real time. Viola and Jones used AdaBoost to train cascades for face detection using difference of rectangles based descriptors. Opelt et al used AdaBoost with key point based framework. Recently Zhu et al used histograms of oriented gradient proposed by Dalal and Triggs with a cascade of rejected based approach. They use an integral array representation and Ada-Boost to achieve a significant improvement in run time compared to original Dalal and Triggs approach, while maintaining similar performance levels. K nearest neighbors (KNN) is a method for classifying objects based on closest training examples in a feature space. KNN is a type of instance based learning where the function is only approximated locally and all computation is deferred until classification. In computer vision, KNN has been used in handwriting recognition for digits, and letters. A hybrid SVM-KNN framework for robust multi-class object labeling was proposed by Zhang et al.

29 Bayesian and graphical models It was shown that image fragments selected by maximizing the mutual information between the fragment and the class label provide an informative and independent representation. Naive Bayes classification was successfully used to derive a detection framework with these features. Weber et al use Bayesian generative models learned with EM to characterize classes. Ferguson also used likelihood ratios but with a more elaborate model of conditional probabilities that includes the position and scale of the features as well as their appearance Fusion of multiple detections When dense feature based detector is used with binary classifier, detection window is tested at all locations and scales. This usually produces multiple overlapping detections near the object detection, as well as some weak non-overlapping detections at some non-object detections (false positives). Rowley et al. [1998] proposed a heuristic for fusing multiple overlapping detections. Number of detections within specified neighbourhoods is computed, and if it is bigger than the threshold, centroid of the neighbourhood is accepted as the location of the detection result. Centroids are aggregated in 3D position scale and space, overlapping centroids with lower score are removed and the remaining centroids constitute the final result. Viola and Jones proposed a simpler way of tackling this problem, where a set of detections is partitioned into disjoint partitions, and each partition results in one final detection. Two detections are put in the same set if their bounding boxes overlap. The final detection and location region is the average from all the bounding boxes.

30 CHAPTER 3 OVERVIEW OF METHODOLOGY AND RESULTS 3.1 Overall architecture The overall object detection framework is built around a method for classifying individual image regions. The framework is divided into two phases: learning phase and detection phase. The learning phase (see Figure 1) creates a binary classifier that provides object/non-object decisions for fixed size image regions. The detection phase (see Figure 2) uses the classifier to perform a dense multi-scale image scan, reporting preliminary object decisions for each location of the test image. The preliminary decisions are then fused to obtain the final object decisions. Both learning phase and detection phase consist of three stages. For the learning phase, first, images are normalized to construct a training data set. Then the training set is encoded into a feature space. Finally, the classifier that uses the feature space is trained on the data from stage one. For the detection phase, the first stage consists of an input image being scanned at all scales and locations resulting in a set of scan windows. In the second stage, the classifier trained in the learning phase provides object/non-object decisions for every window. Finally, multiple overlapping detections are fused into one strong detection, whereas weaker ones are pruned. Overall, this defines a relatively simple and flexible system for object detection. The final performance of the detector depends on the accuracy and reliability of the trained binary classifier (which, in turn, depends on the quality of the training data set and descriptors). The 18

31 19 Figure 1. Learning phase Figure 2. Detection phase system s performance also relies on the techniques for fusing multiple detections in the final stage of detection phase. The described system has a lot of advantages. First, its deceptively simple structure is very modular as its different parts are largely independent. Modularity of the system allows trying and using different strategies for each step. Second, the system also allows for existing state-of-the-art classifiers to be used without being altered. Third, strategy for the overlapping detections fusing is completely independent from the decision engine, thus allowing for extra flexibility and Moreover, the classifier does not have to deal with different scales, since this issue is addressed explicitly by the dense image scan. In addition, the classifier works in relative rather

32 20 than absolute coordinates, which allows for fixed template-like features to be used. Another great advantage of the system is that one of the most computationally expensive parts (dense scan) can be easily parallelized to leverage potential speedups achievable with the modern GPUs. This framework is not ideal, and there are challenges remaining to be solved. First, scanning an input image at all scales and locations results in a large number of classification tests, which may be computationally expensive. Second, a large number of windows to be classified introduces extreme sensitivity to false positive rate of the used classifier. In fact for a 640x480 pixel image, dense scan results in up to windows, so acceptable performance is false positives per window (FPPW), which is well below the rates usually tested in ROC plots [11]. The first stage of the learning phase is the creation of the training data. Training image set normalization is required for constructing fixed template-like features that are used for the second step for the transformation of the images to a feature space. Then, constructed features are fed into a training module to derive our decision engine used in dense image scan. The first stage of the detection phase uses rigid image template to scan an input image at all scales and location. In the second stage, each window derived from the dense scan gets a score from the classification engine. And finally, multiple detections with significant overlaps get merged into final strong detections, which are then outputted to a user.

33 Learning phase In this phase, training image set is prepared and decision engine is trained for later use during the detection phase. In this section we will discuss the learning phase in our framework Training image set normalization The first stage of learning phase is the creation of the training data. Raw image data is not usually well suited for the training of detection algorithms, although some success in unsupervised computer vision learning has recently been achieved [43]. For different computer objectives in computer vision, different types of annotations are used. For the detection tasks initial dataset is usually constructed by annotating the objects of interest with bounding boxes, as well as some metadata about position, view angle, etc. To both capture biologically important features of the animals such as behavior, a number of special visual classes were defined, which correspond to different behaviors. We employ six different visual classes: left, right, frontal, rear, perfect left, and perfect right. The first four classes correspond to respective classes in the PASCAL challenge style of annotations [44]. The later two are used for testing algorithms that may be sensitive to objects viewpoint. Additional behavioral-visual classes were introduced to account for animals activity patterns: browsing, grazing, standing, and lying. All of the visual behavior classes in theory have enough of visual descriptive information to be detected by model based techniques. These visual behavior classes are important because they allow for automatic mining of animal behavior and might help with behavior automatic data collection and interpretation, which on its turn might speed up many

22 ecological research projects carried out in the wild environment. For these special annotation purposes a tagging program with a clean interface was designed and implemented (see Figure 3).

34 22 ecological research projects carried out in the wild environment. For these special annotation purposes a tagging program with a clean interface was designed and implemented (see Figure 3). Figure 3. Tagging program interface As we are using rigid template-like scheme for feature construction, it is essential for us to have both positive and negative images of the same fixed size. The positive training examples are retrieved directly from the set of annotated images. Objects contained in the tagged set

35 23 of the images are, however, not only dispersed through different resolutions and scales, but more importantly, different aspect ratios. The most common approach to deal with this is to force these images to the same resolution and aspect ratio. The resolution is chosen rather arbitrarily, however it is usually done so that it matches the minimum size of an object that could be found in an input image. This means that if we choose a normalized resolution of pixel, given an input image, we will not be able to detect objects smaller than this size even if they are present. Criteria for choosing this minimal size is rather subjective, and it can affect both the scanning speed (smaller size with take more time to scan) and feature quality (very small templates can look blurry and distorted, thus corrupting the training set). Ideally, each positive example contains single instance of an object. Far harder problem is deciding on the aspect ratio of the minimum template. Aspect ratios can tell more about objects than one might suspect. In animal visual classes, aspect ratio may be able to distinguish between lying and standing animals, as well as side facing and rear/front facing animals. As aspect ratios might contain such important information as object orientation and behavior, they have been used for the clustering of the training image set [45]. Clustering of the image set based on the aspect ratio, and/or the object orientation is used to construct composite object models. Composite models consist of several object models, one per aspect ratio. During the detection phase, each objet model evaluates the scanning window, with the motivation that better specialized models will detect objects of interest with higher precision. Even though composite object model might increase classifier performance, naive implementation using multiple dense scans increases computational cost quite significantly.

36 24 (a) (b) Figure 4. Aspect ratio distributions in two different data sets Using single dense scan still involves construction of multiple feature vectors (one per model). This makes composite models desirable, although computationally expensive. To tackle aspect ratio problem in a simple and efficient fashion we have implemented an approach that chooses the most common aspect ratio among the positive set of images. The rest of the images are forced to chosen aspect ratio. Images with aspect ratios that differ significantly are pruned to avoid noise in the feature space. Once the aspect ratio is fixed, all tagged image regions with objects of interest are cut and forced to the fixed aspect ratio. Tagged images are forced to a single aspect ratio using single boundary expansion operation. If a boundary cannot be expanded any further, pixel values of its last row pr column are copied multiple times to suffice aspect ratio requirements.

25 The last step of the image set normalization specification of the minimum object size that is desirable to be detected.

In composition model, multiple minimum sizes are specified (one per model) or only one of the dimensions is specified or all models (the

Normalized positive training data Negative examples collection consists of three phases in our algorithm.

The first phase, compared to positive data normalization, is usually an easier process due to the fact that the requirement for the tagged

37 25 The last step of the image set normalization specification of the minimum object size that is desirable to be detected. During the dense scan, this will be the initial windows size, and all smaller object will be ignored by the dense scan. In composition model, multiple minimum sizes are specified (one per model) or only one of the dimensions is specified or all models (the other one is calculated using the aspect ratio formula). Figure 5. Normalized positive training data Negative examples collection consists of three phases in our algorithm. In the first phase, similar to the positive example collection, visual regions are collected from the training image data set. The first phase, compared to positive data normalization, is usually an easier process due to the fact that the requirement for the tagged data is relieved. Negative examples are usually subsampled from the images or image regions that do not contain any instances of an object of interest (Figure 6). Negative data is collected at the fixed size of the minimal window with the ratio and the dimensions specified at positive data normalization step.

tagged positive examples. This disbalance in the training data might seriously decrease performance of the constructed classifier.

The problem of selecting a perfect subset of the negative examples have to be solved in the majority of the computer vision frameworks.

Due to the nature of the image descriptors constructed and used in this thesis, we were able to devise an effective process of reducing the initial negative image

38 26 Figure 6. Negative training data Automatic nature of the first phase of the negative examples extraction generates amounts of data that greatly exceed those of manually tagged positive examples. This disbalance in the training data might seriously decrease performance of the constructed classifier. We therefore need a method for negative examples selection. The problem of selecting a perfect subset of the negative examples have to be solved in the majority of the computer vision frameworks. The second phase of the negative examples collection is performed after the initial set was extracted from the training data. Due to the nature of the image descriptors constructed and used in this thesis, we were able to devise an effective process of reducing the initial negative image set to a reasonable size, while maintaining variance. This approach consists of two steps. In the first step, the initial negative image set is transformed into a feature vector set. In the second step, duplicate feature vectors, and feature vectors with high cosine similarity are pruned. If the resulting set is still much larger than the manually labeled positive image set, it is further sampled in random fashion. This results in great training time reduction as well as and the classifier s quality improvement.

27 After the filtering performed in the second phase, the vast majority of training examples are still negative.

Instead, it is common to construct training data consisting of the positive instances and hard negative instances (Figure 7), where the

The third phase of negative data collection performs hard negative data mining to further refine the set of negative image instances.

The initial classifier is trained with the negative set produced by the second phase.

39 27 After the filtering performed in the second phase, the vast majority of training examples are still negative. It is infeasible to consider all negative examples at a time. Instead, it is common to construct training data consisting of the positive instances and hard negative instances (Figure 7), where the hard negatives are data mined from the large set of possible negative examples [45]. The third phase of negative data collection performs hard negative data mining to further refine the set of negative image instances. To accomplish this, test data in our experiments is divided into validation and test subsets. The initial classifier is trained with the negative set produced by the second phase. This classifier is then evaluated on the validation set, and false positives are added to the hard negative set.. The need to adjust the decision engine, get rid of weak false positives, and reduce training time is the main motivation behind the hard negative data mining. The process for hard negative mining will be covered in more detail in section Figure 7. Hard negatives

40 28 After the described normalization step we get a final data set used to train a decision engine for our detection framework. The training set consists of positive and [hard] negative examples of the same fixed size as required by the our relatively rigid template-like feature construction scheme Overview of feature sets Rectangular DL (R-DL) Our proposed image feature sets are based on dense and non-overlapping encoding of image regions using Dense Lattice (DL) descriptors. The general scheme for DL encoding is given later in this section. Dense Lattice descriptors provide a dense description of image regions. This section describes variants of the DL encoding and presents the key parameters involved. DL is computed on a dense grid of uniformly spaces cells. They capture local edge information by encoding hard edges in visual regions, they achieve a small amount of spatial invariance by locally pooling the edge counts over the spatial image regions. Rectangular DL (R-DL) descriptors are descriptor blocks that use rectangular grids of cells. They are the default descriptor for all the experiments undertaken in this thesis. The descriptor blocks are computed over dense non-overlapping uniformly sampled grids. Each block is normalized independently. We use square R-DLs and compute n n pixel cells each containing β horizontal and vertical trenches, where n, and β are parameters. The trenches are used to sample the edge counts in a cell.

41 29 R-DLs are a combination of Haar-like wavelets [46] and R-HOGs. R-HOGs are similar to SIFT descriptors [16] but are used rather differently. SIFT descriptors are computed at a sparse set of scale-invariant key points, rotated to align their dominant orientations and used individually, whereas the R-HOGs and R-DLs are computed in dense grids at a single scale without the dominant orientation alignment. The grid position of the block implicitly encodes spatial position relative to the detection window in the final code vector. SIFTs are optimized for sparse wide baseline matching, whereas R-HOGs and R-DLs for dense robust coding of spatial form. R-HOGs and SIFT apply Gaussian spatial window to down-weight pixels near the edges of the block, whereas this feature was left unimplemented in the original version of the R-DL. R-DLs use basic Haar-like wavelets for optimal edge search, whereas R-HOGs and SIFT use approach based on convolution computed gradients and histogram binning Other descriptors In the evaluations below we compare R-DL descriptors to several other image descriptors used in object recognition Generalized Haar Wavelets Haar wavelets [47] have been used to build multiple general frameworks for object detection. They have been used in conjunction with the integral image data structure [48] that enables the wavelets to be computed in constant time. This construct allowed for a prominent real-time face recognition system to be designed [46]. It has been also used to design several systems that have been successfully used for pedestrian detection [49, 50, 51]. It has been shown that the performance could be improved significantly by using an augmented set of the first and second

42 30 order wavelets [52], where an extended set of oriented first and second derivative box filters at 45 degree intervals and the corresponding second derivative xy filter have been used Histogram of Oriented Gradients Histogram of Oriented Gradients (HOG) [11, 52] is based on evaluating normalized local histograms of image gradient orientations. The motivation behind this approach is that the local object appearance and shape can often be characterized rather accurately by the distribution of local intensity gradients or edge directions. In practice this is implemented by partitioning the image window into small spatial regions, for each cell accumulating a local histogram of gradient directions, combination of which forms the representation. For better invariance to various distortions found in real life data sets, the local responses are normalized. This is done by accumulating a measure of local histogram values over larger spatial regions and using the results to normalize all of the cells in the block. The HOGs were inspired by another descriptor based on gradient orientation binning: Scale Invariant Feature Transformation (SIFT) [16]. One of the most important differences between these two approaches is that SIFT was designed to be used in sparse detection framework, whereas HOG employs very dense and in fact overlapping approach ADR-HOG Analytic Dimensionality Reduction for Histogram of Oriented Gradients (ADR-HOG) were proposed by Felzenszwalb et al [1] as an improvement to the original HOG features. PCA performed on the large number of HOG features from different resolutions of a large number of images led to some interesting discoveries. The principal components are shown in figure

43 31 Figure 8. The eigenvalues indicate that the linear subspace spanned by the top 11 eigenvectors captures essentially all the information encoded in original HOG. It has been shown that the performance of the 11 dimensional features matched the performance of the original HOG features used in the PASCAL 2007 challenge. Using lower dimensional features leads to models with fewer parameters and speeds up the computation. However some of the gain is lost because of the need to perform a relatively costly projection operation. The costliness of projection operation motivated for an analytic dimensionality reduction of the HOGs. This can be performed because the top eigenvectors in figure Figure 8 have a very special structure: they are each approximately constant along each row or column of their matrix representation. Thus the top eigenvectors lie approximately in a linear subspace defined by sparse vectors that have ones along a single row or column of their matrix representation. A 13 dimensional vector could be defined by summing over 4 normalizations for a fixed orientation, and summing over 9 orientations per fixed normalization (original HOG feature is 4 9 matrix. The new 13 dimensional features carry approximately the same amount of information as the original 36 dimensional HOGs and their computation is much less costly than performing projections to the top eigenvectors obtained via PCA The learning process We use Support Vector Machines (SVM) as our object/non-object decision provider for a window being tested. We run standard SVM with linear kernel and quadratic programming solver. Different kernels can be incorporated and used easily within proposed framework alas they take significantly higher amounts of time to train. We train a classifier in one-stage fashion

44 32 Figure 8. PCA of HOG [1] skipping the data mining of hard negative examples, as results are largely unaffected by this. To create positive examples for the classifier training, Images containing objects of interest are tagged by user, then a histogram of bounding boxes aspect ratios is constructed. All positive examples are normalized to have the same aspect ratio, and are subsequently downscaled to the smallest size window, which defines scanning window size, and minimum size for an object to be detected. Next, negative examples are produced from images that do not contain objects of interest by extracting every non-overlapping window of minimum positive size. The number of hard negatives depends on the initial detector s performance, and its type. If too many hard negative instances are generated, the set of hard examples is subsampled so that the descriptors of the resulting training set fit into RAM for SVM retraining. This retraining process significantly and consistently improves the performance of all tested detectors by an order of magnitude, and is employed in most of the modern computer vision frameworks [11].

45 Detection phase During detection, the input image is scanned over all scales and locations. For each location at each scale, a feature vector is computed for detection window, and the classifier is run to produce monolithic object/non-object decision. Regions containing objects usually produce multiple overlapping firings, and its necessary to fuse those into a single coherent detection boundary box. The overall detection score depends on how finely the image is scanned, and how detections are fused Multiscale object localization The final goal of the detection framework is to localize the objects that appear on the test image. For the detectors using an object/non-object decision engine (classifier), exhaustive search is performed on the test image. During the search at all scales and locations, each window is evaluated with the trained classifier. Each set of overlapping windows with positive scores is merged to derive a final detection. This process is a crucial part of the detection framework, and greatly affects the performance of the final system. Many systems omit the issue of multiple overlapping windows entirely [50, 51], whilst others use rather simple approaches. Viola and Jones describe a simple very strategy to combine multiple detections [53]. They partition the set of detections into disjoint subsets. Two detections are in the same subset if their bounding regions overlap. Each partition yields a single final detection. The corners of the final bounding region are the average of the corners of all detections in the set.

46 34 We use a generalized approach for the fusion of multiple overlapping windows. We used a simple approach from Viola-Jones paper as a post-processing step performed after main nonmaximum suppression method for reducing the number of detections for better prediction in single zebra photographs. The problem of multiple overlapping windows merging is posed as kernel density estimation problem. It is solved by locating modes of the density estimate and suppressing non-maximum responses. A fine grained scan of position and scale is necessary for good performance of this method Object localization through classification Object localization is usually implemented through training a classifier able to discriminate between an image with an object of interest, and negative image containing no relevant objects. Images fed into the classifier are of fixed size and scale. Scanning window approach tackles efficiently tackles the problem of object localization on arbitrarily sized images with fine grained scan running a fixed size detection window though all locations on the test image. To scan through all scales, the test image is scaled down after each iteration of location search until it becomes smaller than the scanning window size. The finer the search, the larger is the number of overlapping windows surrounding a candidate detection. For each candidate detection, overlapping windows surrounding it must be merged. Non-maximum suppression approach is based on two hypotheses: Robust classifier produces high non-maximum responses if the detection window is slightly off-center or off-scale on the object

47 35 Reliable classifier does not reproduce the same behavior around false positive object detections The first hypothesis assumes that during the fine image scan classifier response gradually increases up to maximum value achieved at the object s center, and then gradually decreases. The second implies that the probability of overlapping false positive window responses is low. Figure 9 illustrates how effective fusion strategy works. The classifier produces multiple overlapping windows with positive scores near true zebras, as well as some non-overlapping falsepositive for a couple of gazelles and a bush in 9(a). The classifier responses for true object are also higher than responses for the false positives. The goal is to achieve results similar to 9(b) (blue for ground truth, green for true positives) Non-maximum suppression An ideal method for overlapping detection merging satisfies the following requirements: 1. The higher classificator s response for the center region of an image region, the higher the probability for the region to be included in the final set of detections. 2. The more overlapping windows in the vicinity of a regions with high response within a small range of scales, the higher the probability for this image region to be a true positive. 3. Overlapping regions should be fused in a careful manner, with overlaps at very different scales and positions not affecting the given regions final detection output window. Animal detection problem introduces additional problems for overlapping windows fusion. In particular, zebras are social animals, and tend to stay in groups [54]. Young zebras also

48 36 (a) (c) (b) Figure 9. Non-maximum suppression for fusion of overlapping detections usually stay in close proximity to their mothers or group of females, and most of the ground truth boundary boxes capturing mothers are overlapping with some other boundary boxes. This issue of strong detection overlap makes the third requirement especially difficult to satisfy. It implies that the ideal detection fusion method would be able to separate two strong detections in the case when detection window sets overlap for nearby objects. Most of simple heuristic methods will not work for these cases. For example, simple heuristic described by Viola and

49 37 Jones [53], which aggregates detections into disjoint subsets of overlapping positive windows, each subset resulting in a final strong detection. This method will merge two objects in 9(a) into one final detection instead of treating them as two separate detections. Simple heuristic methods fail even if overlapping windows are of very different scales. Detections are represented using Kernel Density Estimation (KDE) in 3-D position and scale space. KDE is used to evaluate continuous densities by applying a smoothing kernel to observed overlapping windows. The observed detection points are weighted by their computed density values. KDE satisfies two requirements for the merging strategy because detections at different scales or locations are far in 3-D space, and therefore, will not be smoothed together. The computed maxima of the density estimate correspond to the locations and scales of final strong detections. The kernel width is chosen to be greater than the stride of the exhaustive search phase but not be wider than the fixed size object itself so that nearby objects are not merged Detector window and classifier As visual context has been shown to improve performance, our window includes about 8 pixels margin around the zebra on all four sides. We utilize fixed window size for all positive examples to be able to easily use conventional machine learning algorithms. By default we use linear SVM classifier trained with SVM light software [55]. We attained much better performance using a standard SVM with quadratic optimizer rather than faster soft (C=0.01) linear one.

50 CHAPTER 4 EXPERIMENTS AND IMPLEMENTATION In this chapter we report the experimental results concerned with the performance of the Rectangular Dense Matrix (R-DL) features developed in this thesis. Consecutively, we compare our features to the Rectangular Histogram of Oriented Gradients (R-HOG), Histogram of Oriented Gradient with Analytically performed Dimensionality Reduction (ADR-HOG), and Haar-like features within Viola - Jones framework. 4.1 Implementation and Performance Study The image feature extraction process maps the fixed size visual regions into feature vectors of fixed dimension. These feature vectors are then fed into a pattern recognition style classifier. We use Support Vector Machines (SVM) as our baseline binary classifier as it proved to be most accurate, reliable and scalable of the classifiers tested in our initial experiments. As SVM works directly in the feature space, it ensures that the feature set it as linearly separable as possible, so improvements in performance imply better encoding. The resulting system gives monolithic object/non-object decisions over the detection window - it does not provide part based based detection and typically is not very robust to visibility or partial occlusions. Nevertheless, the monolithic approach gives a high quality detector that is difficult to beat for fully visible objects. 38

51 39 We now give details of our different R-DL implementations and systematically study the effects of various choices on detector performance. We discuss every step of the descriptor processing chain, and provide experimental results to support our conclusions. For all of the experiments in this section we restrict ourselves to a measure for the performance of the window level classifier Gamma normalization We evaluated two input pixel representations including grayscale and RGB color spaces with optional gamma normalization. When color information is available, it consistently increases performance of the detection frameworks that make use of it. We experimented with square root and log compression for gamma normalization. Assuming that image formation is a multiplicative process, and that illumination is varying slowly with logarithmic factors, log compression becomes justifiable. Similar conclusions can be made about square root compression assuming that photon noise is proportional to a square root of the intensity [52]. These assumptions also imply that square root gamma normalization will make effective noise uniform. Square root compression have been shown to increase performance for certain classes of objects such as man made (bicycles, autos, buses, etc), as well as for some animal classes (cows and sheep) [52]. In our experiments, using square root compression resulted in slight detection performance improvement. However log compression proved to be too strong, decreasing performance of the detector in our experiments.

52 Edge computation Haar-like wavelets, simple and adaptive thresholding were applied to find the best way to elicit zebra stripes on a visual region. Thresholding techniques are simple filters based on the absolute color information, and thus are not invariant towards the illumination changes. This makes thresholding techniques to produce substantial noise and to fail to distinguish between dark and light stripes on zebra s body under the direct sun light and while in the shadow. Moreover, thresholding techniques techniques do not generalize well, and strongly depend on their threshold parameter. As overall detector performance is sensitive to the way in which stripe edges are computed, the simplest thresholding techniques do not seem to be a viable solution for edge detection. We envisioned our detection framework to be used with zebra identification techniques, so we made a design choice to devise features that underline the strong stripe edges. We found out that horizontal and vertical Haar-like wavelet operators were best suited for filtering out most of the natural occurring noise and brining out the stripes (Figure 10). The advantage of these operators is their use of edge difference instead of absolute color values. We computed image edges using basic horizontal and vertical Haar-like wavelets. We used grayscale images for simplicity, but the method could be easily extended to handle the RGB or HSV color spaces. For color images consisting of more than one channel, operation is repeated on each channel and the channel with the maximum values is taken as the value for a block. Color images are handled in similar way when computing HOG features.

41 (a) Adaptive thresholding (b) Haar-like filtering Figure 10. Image filtering We are not using traditional convolution when applying the Haar wavelets.

For the wavelets with even dimensions, we collect scores in the center 2 2 square. After the initial score computation we scale and normalize it.

53 41 (a) Adaptive thresholding (b) Haar-like filtering Figure 10. Image filtering We are not using traditional convolution when applying the Haar wavelets. For the centralized wavelets, we collect the score into the center of the matrix, and write it to new image feature map. For the wavelets with even dimensions, we collect scores in the center 2 2 square. After the initial score computation we scale and normalize it. We clip values that are lower than the chosen threshold. We tested several clipping thresholds to find the optimal one. In our experiments we used 3 3 and 5 5 dimensional filters (Figure 11(a)). Smaller 3 3 filters give substantially better results due to the fact that the fixed size templates used for the learning and exhaustive search phases are too small for bigger Haar-like wavelets. Due to their size, bigger wavelets merge neighboring stripes, and ultimately lose important information during the encoding.

54 42 (a) Wavelet size (b) Wavelet threshold Figure 11. Edge computation To filter out the noise (edges produced by the background and non-object instances), we clip values below certain threshold. Several threshold values were tested (Figure 11(b)), and the optimal one was selected. Low thresholds produce substantial amount of noise, whereas high threshold do not allow for zebra s stripes to be captured effectively Lattice and block overlap In the next step, a visual region is split into a grid of blocks of a certain size. Horizontal and vertical edge counts are computed for each block, and used later for image descriptor construction. Due to block overlap, a single edge can contribute to different blocks. Block overlapping accounts for the non-linear nature of the image descriptor scheme, and is essential

55 43 (a) Lattice step (b) Block overlap Figure 12. Lattice and block overlap to good performance. In our experiments, we compared different block overlaps, all relative to a block size. Block step was defined through the overlap: block step = block dim block overlap We used values 1, 2, 4, and 8 for the overlap effects experiments. No overlap yields the poorest results in our benchmark. Higher overlap significantly increases the detector performance. However, higher overlap comes at a price: it increases dimensionality of the descriptor, and decreases learning and detection speeds. For the final version of the descriptor we chose the overlap value of 4, which is a good compromise between detector performance and speed.

56 44 After the horizontal and vertical Haar wavelets were convoluted with the visual region, and it was split into a grid of blocks, a lattice structure is used to count horizontal and vertical edges in each block. We use uniformly distributed lines along X and Y axis in the rectangular cells to measure the edge counts. Edge counts are counted along the lattice lines. The lattice density is defined by the lattice step, the smaller the step, the denser the lattice. Lattice density does not affect performance or detection speed as much as the overlap, but large steps tend to reduce the performance. In the final version of the descriptor we use lattice step of 6 pixels Descriptor blocks and block normalization schemes Gradient strengths vary over a wide range owing to local variations in illumination and foreground-background contrast [52]. This means that effective local contrast normalization is essential for good performance. We used a normalization scheme based on edge count normalization in each block. In fact, the blocks are typically overlapped so that each edge count response contributes several components to the final descriptor vector, each normalized with respect to a different block. This may seem redundant but good normalization is important and including overlap significantly improves the performance (Figure 12(b)). For lattice edge count normalization we used median and arithmetic average for both horizontal and vertical counts accumulated along the lattice, with median count used in the final version of the descriptors. For count normalization we use calculate norm denominator dividing block dimension by Haar-like wavelet dimension: norm = block dim dim. In this way we get estimate maximum number of edges per block. Consecutively, we use the following formula to normalize horizontal and vertical edge counts:

57 45 (a) Normalization (b) Block size Figure 13. Descriptor blocks and normalization schemes edge count = edge count norm We present normalization effects in Figure 13. In our experiments we got slightly better results using unnormalized counts. This might be caused by the fact that counts are implicitly normalized by using the block overlap. We also review the effects of the different block sizes (Figure 13(b)). Evident is the negative effect of large blocks on the detector s performance. For the final version of the image descriptor scheme we use square block size of pixels.

58 Overview of the results In this section we report the experimental results concerned with the performance of the Rectangular Dense Matrix (R-DL) features developed in this thesis. Consecutively, we compare our features to the Rectangular Histogram of Oriented Gradients (R-HOG), Histogram of Oriented Gradient with Analytically performed Dimensionality Reduction (ADR-HOG), and Haar-like features within Viola - Jones framework Data set details The dataset used for the experiments in this thesis was collected specifically for the task of zebra studies. The images were annotated with rich metadata containing spatio-temporal, behavioral and visual information. For the experiments undertaken in this thesis, we use visual metadata in the form of bounding box annotations around the zebra objects. Annotated data contains six different visual classes: left, right, frontal, rear, perfect left, and perfect right. The first four classes correspond to respective classes in the PASCAL challenge style of annotations [44]. As we are mainly concentrating on the task of detecting side-facing animals for the sake of future identification, we limit ourselves to only four classes: left, right, perfect left, and perfect right Performance evaluation We use a usual output format for the detection framework, and return a list of bounding boxes with associated confidence (rank) as an output for a given test image. Detections are assigned to the ground truth objects and judged to be true or false positives by measuring the bounding box overlap. To be considered a correct detection, the overlap ratio a 0 between the

59 47 predicted box B p and the ground truth bounding box B gt has to exceed 50% rate according to the following formula: a 0 = area(b p B gt ) area(b p B gt ) where B p B gt denotes the intersection of the prediction and ground truth bounding boxes, whereas B p B gt is their union. The performance evaluation metric, as well as the threshold of 50% is borrowed from the peer reviewed and highly acclaimed PASCAL visual object classes challenge [44]. The threshold is originally explained to be selected to account for inaccuracies in bounding boxes in the ground truth data, for example dening the bounding box for a highly non-convex object, e.g. a animal with limbs spread, is somewhat subjective. Detection framework s output is assigned to ground truth objects satisfying the overlap criterion in order ranked by the (decreasing) condence output. Multiple detections of the same object in an image are considered false detections e.g. 5 detections of a single object counted as 1 correct detection and 4 false detections - it is considered the detection system s responsibility to filter multiple detections from its output Detector performance In this section we compare performance of our detector to a set of different state-of-the-art algorithms for object recognition.

60 R-DL In the final version of our descriptor scheme we use fixed window of pixels. We run the 3 3 centralized Haar wavelets. We break the fixed template into the grid of overlapping pixel rectangular blocks with block step of 32/4 = 8 pixels. We superimpose a 8 8 uniform lattice on each block to count horizontal and vertical edge transitions. We normalize each count dividing it by the factor of block dim dim, which approximates the maximum count of horizontal and vertical counts per block. We use standard versions of the other evaluated descriptors, as described by the publications where they were presented HOG We use the default version of the Histogram of Oriented Gradients (HOG) descriptor: image gradient computed by applying [1, 0, 1] filter along x- and y- axis with no smoothing; linear gradient voting into 9 orientation bins in ; pixel blocks containing 2 2 cells of 8 8 pixel; Gaussian block windowing with σ = 8 pixel L2-Hys (Lowe-style clipped L2 norm) block normalization; blocks spaced with a stride of 8 pixels (hence 4-fold coverage of each cell); detection window; and linear SVM classifier Haar wavelets We use standard implementation of Viola-Jones [46] detector within OpenCV framework. We use the following parameters for this framework: non-symmetric visual class, which is a discrepancy from the standard setting of face detection application it was initially designed for; 14 stage AdaBoost cascade classifier; minimum hit rate, 0.5 maximum false alarm rate;

61 49 non-equal weights with 0.95 weight trimming per weak classifier; original non-extended set of basic Haar-like wavelets; minimum of 500 positive samples per cluster, and a pixel fixed window ADR-HOG We use the latest version discriminatively trained, multiscale, deformable part models with their image descriptors. The image descriptors in use are 31-dimensional histograms of oriented gradients with analytically reduced dimensionality. Other parameters include left-right pose clustering with an examples assigned to the cluster with minimum Euclidean distance between the cluster center and the example; default part filters in each deformable model at twice the spatial resolution of the root filter; latent SVM to train model parameters. Finally, we use the recommended 3 component model for the task of zebra recognition Results In this section we compare the above mentioned frameworks in their detection performance measuring both precision and recall, which are visually represented on the ROC curves on the Figure 14. The resulting performance of the detector framework described in this thesis matches the performance of the state-of-the-art detection algorithms. Viola-Jones framework demonstrates the greatest detection speed, as well as fair precision rate when recall is below 0.5. HOG image descriptors within Dalal-Triggs framework show slow detection rate, and poor performance. ADR-HOG framework is capable of achieving high recall at the expense of precision. Its precision rate is the highest in this benchmark on the recall rates between 0.0 and 0.9. However at

62 50 the recall rates between 0.9 and 0.95 ADR-HOG is dominated by the system developed in this thesis. The R-DL features demonstrate both fast detection speeds, and great performance. At lower recall rates between 0.0 and 0.6 the performance essentially matches the performance of much more complicated system (ADR-HOG). This indicates the efficacy of R-DL features for the task of zebra detection. Figure 14. Performance of the detectors

63 51 R-DL HOG VJ ADR-HOG TABLE I DETECTOR SPEED 4.3 Automating zebra identification with detection Collecting behavioral data about a species in the wild animal populations often requires individual animal identification between sightings. This is an important tool in ecological analysis that underlies broader aspects of animal behavior research [56, 54]. Electronic tracking devices embedded in animals can be prohibitively expensive for the use in field conditions. They also involve significant cost and risk for large animals [57, 58]. Researchers are therefore left with no choice other than time consuming techniques such as manual visual identification from photographs or video [59, 60], genetic markers in excrement [61], or capture-recapture techniques [62]. Advances in hardware and the corresponding drop in prices of digital cameras have increased the availability of digital photographs of wild animal sightings at high resolutions and qualities, making fully automatic or computer assisted animal identification an attractive approach [63]. The demand for automated data collection for ecology research, and abundance of digital photographs triggered development of techniques able to automatically identify individual animals from their coat markings. These techniques are applicable to animals with prominent

64 52 morphological characteristics like stripes or large patches, and are intended to be part of a cost-effective, computer-assisted individual identification system for animals. StripeSpotter [63] is an example of such system. It approaches the problem of zebra identification by extracting a set of discriminative features from an image of an animal, tolerant to noise from variances in scale and exposure, occlusion, partial deformations, and mild shear. These features allow it to efficiently and robustly compare images in a database by their appearance. It uses a distance measure between a pair of feature sets taken from two images, and an efficient algorithm for computing it, which allows to judge how different are the coat markings depicted in two pictures. A lower distance between images of two animals signifies a higher chance that the two animals are the same individual. This measure is used to determine whether an animal just photographed in the wild exists in a database of prior sightings. StripeSpotter and similar individual identification tools have greatly improved the speed and accuracy at which ecological data is collected. However, these systems are not designed to be fully independent automated information retrieval tools. They only augment the efforts of professionally trained field assistants due to their human input requirement. Need for human input comes from the systems inability to detect and label regions of interest (ROI) bounding boxes, which depending on the identification algorithm may cover different specific areas of animal s body. We show how zebra detection algorithm developed in this thesis can be applied to augment zebra identification software toward fully automated ecological information retrieval system by localizing ROIs. For our experiments, we use R-DL features developed in this thesis as the

65 53 base for our detection algorithm, and StripeSpotter as identification front-end and performance benchmark tool. (a) (b) (c) (d) Figure 15. Linear regression derived ROIs

54 No detection framework can be directly used as a ROI

Detection frameworks are generally trained to output bounding

StripeSpotter, however requires certain regions within zebras

contain regions of each color accurately separable by a color

This requirement means that inclusion of background in the

Examples of automatically derived ROIs Our detection framework

66 54 No detection framework can be directly used as a ROI providing back-end for the StripeSpotter. Detection frameworks are generally trained to output bounding boxes that envelop whole areas of the detected objects. StripeSpotter, however requires certain regions within zebras body, which are cropped as consistently as possible, and which contain regions of each color accurately separable by a color segmentation algorithm. This requirement means that inclusion of background in the output bounding boxes is not desirable. Figure 16. Examples of automatically derived ROIs Our detection framework outputs bounding boxes for zebras that contain considerable background regions (in the area between the legs, between the head and the neck, etc). Therefore,

Histogram of Oriented Gradients for Human Detection

Histogram of Oriented Gradients for Human Detection Article by Navneet Dalal and Bill Triggs All images in presentation is taken from article Presentation by Inge Edward Halsaunet Introduction What: Detect