Detecting License Plates

Size: px

Start display at page:

Download "Detecting License Plates"

Easter Anderson
5 years ago
Views:

1 Bachelor Informatica Informatica Universiteit van Amsterdam Detecting License Plates Willem F.K. Fibbe October 26, 2008 Supervisor(s): Rein van den Boomgaard (UvA) and Leo Dorst (UvA) Signed: Leo Dorst (UvA)

2 ii

3 Abstract This paper fully describes how a computer program detects if, and if so, where, a license plate is located inside an image. It detects such a license plate no matter what scale or color it has within the image. The algorithm that is used to achieve this, is developed by Viola and Jones. It scans a full input image of, say, 800 x 600 pixels, with a small (detection) window where each window might contain a license plate. After the whole image is scanned with this window, the window is resized and the scanning-process starts over with a bigger window. The process is repeated until this detection window gets bigger than the image itself. By using this method, the detection-task is invariant to the size of the license plate within the image (e.g. the license plate may be 11 x 52 pixels, but it also may be 77 x 364 pixels). Normalising the input image makes this task also invariant of color. The algorithm from Viola and Jones makes decisions about the presence of a license plate using data that is prepared in advance: the so-called weak classifiers. If the algorithm decided there was a license plate present in the window, it has based this decision by using many of them. The training process takes care of the preparation of these classifiers: it generates many weak classifiers and trains each of them and afterwards, the weak classifier that performs best is selected. It does this by analysing many input examples: positive examples, where each image contains a license plate, and negative examples (or backgrounds ), where each image may contain everything but a license plate. Traditional methods would train each generated weak classifier on every input example, but Pham and Cham recently published an algorithm which uses statistics of the input examples. This way, not every classifier has to be trained on every input example to get the desired data, but instead the data can be directly calculated from the statistics. This method reduces the training time significantly, and so, the training-task benefits from great speedups. Ultimately, the detection process uses many weak classifiers. If the training process generated the first classifier that performs best, this process would be started over repeatedly with (possibly only slightly) different training data (i.e. input examples). This way, other weak classifiers are generated to use for detection. This paper discusses how to train the first weak classifier that performs best. For the ease of reading, a concrete example for detection is used: license plates. It should be noticed, however, that to detect other kinds of objects (like faces) the same theory described in this paper can be used; the only thing that has to be changed is the input examples that are used for the training process (positive examples would then contain faces and negative examples certainly not).

5 Contents 1 Introduction The project Description Approach Used software Layout Detecting Haar-features Detection window Integral Image Cascade of classifiers The weak classifier The cascade Training Preparation Generating training-set of images Gathering training-data Linear relationships Generating Haar-features Using statistics to train a weak classifier Global and local statistics Choosing the best weak classifier Results Issues Improving while implementing Results Measurements The ultimately selected weak classifier Conclusions 27 Appendices 29 A Structure of B 31 B OpenCV customizations 33 B.1 Calculating the integral image B.2 Calculating sparse matrix multiplications C Generating the g (t) vector directly 35 Bibliography 37 3

6 4

7 List of Figures 1.1 A schematic overview of the implementation A Haar-feature applied at an image, containing a license plate basic Haar-feature types, as presented by Viola and Jones Detecting with a Haar-feature, visualized on a license plate The Integral Image The cascade of classifiers Making a vector of a Haar-feature The extended set of Haar-features as proposed by Lienhart and Maydt The probability density functions for the positive and for the negative input set Determining the error from the probability density functions Example of an ignored Haar-feature with too much error The probability density functions for the best weak classifier The best Haar-feature, applied on a license plate An example of an image out of the positive input-set A.1 Construction of the B-matrix for features with size height x width = 3 x C.1 A horizontal Haar-feature and making its corresponding g (t) vector

8 6

9 CHAPTER 1 Introduction Nowadays, the practice of analysing images is increasingly being used. This holds for manipulating images, but also for detecting objects that appear within them. An example of the latter is a photo camera that automatically detects the position of faces, even while the photographer is still focusing to take a nice group photo. Another example is a security system that decides whether a barrier should open for a car, by automatically detecting (and recognizing) its license plate to verify if the car s owner is authorized to enter. This paper will fully describe the process of implementing a (computer)system that automatically detects license plates inside images, by using techniques developed by Viola and Jones [9] and optimizing it, i.e. making it faster, by using a technique developed, and recently published, by Pham and Cham [6]. 1.1 The project Description The goal of the project is to use the system of Viola and Jones (described in [9, 7]) that is built around detecting faces at different scales and colors, to detect license plates. A system that detects (and recognizes) license plates (a so-called LPR 1 system) may have already been built, but not yet in a scientific setting, as this project is in, where all the details of each step are fully described and the whole process is published and so, publicly available Approach A schematic overview of how the system works and what is to be implemented, is given in Figure 1.1. As can be seen, it consists of two parts: training and detecting itself, both of which will be discussed in this paper. 1 License Plate Recognition 7

10 8 input images >< training process training >: weak classifiers 8 detection process >< Image detecting False True license plate present in Image >: license plate not present in Image possibly start over with another image Figure 1.1: The implementation of detecting license plates consists of 2 parts: the training process and the detection process Used software For my project, i.e. for the detection part as well as the training part, I used an open source library called OpenCV 2, which is available for Python and (Visual) C++ and is originally developed by Intel. Because working with images goes much faster in C++, I chose for this language. The library itself provides many tools in reading image-files and making calculations with them. It offers lots of mathematical functions and graphical functions (e.g. showing the derivative of an image together with some new red rectangle inside it, in a window), so the user doesn t have to worry about implementing that functionality. For the development environment I used the Eclipse IDE together with a plugin (CDT) that made developing in C++ possible and pleasant. 3 Furthermore I used the Linux distribution Ubuntu as the Operating System to work on. 1.2 Layout In this paper, I will first describe how a computer program detects license plates (the detection process ). It does this, among others, at different scales, so if one would take a close-up photograph of a license plate or a far-away shot of it, the program would still detect it and know its location in both cases, despite its measures within the image. The three concepts that, when bound together, make this possible and are introduced by Viola and Jones, are discussed in chapter 2. But to detect these plates, it uses data from which it decides whether a license plate is present or not (the weak classifiers from the Figure above). How this data is being generated (using Pham and Cham s method), and why it is faster than the algorithm used by Viola and Jones, is discussed in the next chapter, chapter 3, about the training process. After the theory has been discussed, the results about the implementation are presented in chapter 4. Finally, I will end the paper with some conclusions about my project in the last chapter: chapter 5. 2 Open Computer Vision library: 3 Eclipse homepage: CDT plugin: 8

11 CHAPTER 2 Detecting How does a computer program know whether a given input image has a license plate in it and where it is located in the image? This chapter will explain how the detection process works. This process relies on the three key concepts described by Viola and Jones in [9, 8, 7], all of which will be described in the following sections. 2.1 Haar-features The building blocks of the detection process are the Haar-features (its name originating from the Haar-wavelets 1, though not that relevant for this paper). To detect certain objects inside an image, we must find a way for a computer program to know what to look for. The Haarfeatures help us with achieving that goal. Figure 2.1: One particular Haar-feature applied at an image with a license plate. Sum of pixel-values in white area: 3840 Sum of pixel-values in black area: 7050 Outcome of this Haar-feature (the feature-value): = 3210 Such a feature consists of a simple rectangle, which itself consists of two or three subrectangles. Consider Figure 2.1 where we apply one such Haar-feature consisting of two subrectangles to a license plate. The definition of a Haar-feature is that we take the sum of pixel-values 2 lying in the white area and subtract that sum from the sum of pixel-values lying in the black area. The feature-value in Figure 2.1 for instance, is equal to As proposed in 1909 by Alfréd Haar, a Hungarian mathematician 2 A pixel-value is simply its gray-value, so e.g. ranging from 0-1 for double values or ranging from

12 Figure 2.2: The basic Haar-features (as presented in [9, 7]) out of which many features can be generated, such as the one applied in Figure 2.1 A feature can be generated from the basic-set of the 4 features shown in Figure 2.2. We can generate many features (as is discussed in subsection 3.1.4), just by changing a basic features shape (e.g. doubling the width, but leaving the height intact or both tripling the width and height) and location (keeping the shape, but only change the top left location). For example, suppose we scaled the image (Figure 2.1) to a smaller size of 11 x 52 (more about this in subsection 2.1.1), this particular Haar-feature has its (top left) location at coordinate (44,1) and its size is 9 x 8. Furthermore, this feature is generated by scaling (9 times its height and 4 times its width) and moving feature A in Figure Detection window Suppose we have the necessary generated Haar-features (more about this in the next chapter), now we are going to use them in a logical way. We want to detect license plates and because (Dutch) license plates have a size (ratio) of11 x 52, we can use this fact to our advantage during detecting. We re going to scan the (input) image horizontally and vertically with a detection window of size (ratio) 11 x 52. At the end of this scanning, we went over every pixel in the image with our window. At each location of our detection window in our input-image, we apply all Haar-features over this (temporary) sub-image after it is normalized 3, so that we calculate all feature-values for this sub-image. We do this to check if there lies a license plate in this sub-image, so that we do not only know there is a license plate present in the image, but also where. How we do this, is discussed in section 2.3. If this is not the case, we move on with our detection window, as explained, to check if a license plate is present in the next sub-image. However, a license plate may be bigger than 11 x 52 pixels in the input-image (e.g. a close-up or high-resolution photo of a license plate may contain a plate covered by 1100 x 5200 pixels). To also cover this situation, we scan the image again, but this time we increase the size of the detection window with 10% and start the whole scanning process again. We repeat this process, increasing the size with 10% each iteration, until the detection window s size gets bigger than the input-image. Furthermore, together with the increase in size of the detection window, we also increase the size of the Haar-features being applied, with the same ratio, as though we would be still calculating with an area of11 x 52 pixels. To make this more clear: suppose we have the input-image from Figure 2.1, but note that this image s original size is 319 x 1508 pixels. The first round or iteration, we scan the image with a detection window of size 11 x 52 pixels. When completed, we start the next round, this time with the detection window having size 12 x 57 pixels. At a certain point, we are in the thirty-second round with the detection window having size 211 x 998 pixels (the original size just multiplied with ) and we are somewhere in the middle of the scanning process, so that the window is located at pixel-coordinate (100,60). Furthermore, we apply all (generated) Haar-features and at some point, we apply the same one as applied in Figure So the detection works invariant of the lighting conditions 10

This step, the combination of the detection window s size and location and the particular Haar-feature (with, again, its own size and location, but inside the detection window) being applied, is

13 This step, the combination of the detection window s size and location and the particular Haar-feature (with, again, its own size and location, but inside the detection window) being applied, is shown in Figure 2.3. Note that the only difference between Figure 2.1 and Figure 2.3 is the detection window s size (and consequently its location): in Figure 2.1 the size of the detection window is the same as the whole image s size. Figure 2.3: The detection step almost halfway the scanning process at the 32 nd round: the detection window (green) has size 211 x 998 at coordinate (100,60) and the Haar-feature (red) being applied is the same as in Figure 2.1. The image s size is 319 x 1508 pixels. 2.2 Integral Image Calculating the sum of pixel-values in an area can be an expensive operation. The easiest way of calculating such a sum is of course by looping through the involved pixels and increasing some variable that holds the sum, with each pixel value. This means that we need to access 24 pixels to calculate the sum for an area of 6 x 4 pixels. This operation gets even more expensive when the area-size gets bigger. To make this operation much faster and, thus, more efficient, Viola and Jones introduced the Integral Image representation of images. This is a representation of the image for which each pixel s value is equal to the sum of the rectangle that is formed from the image s origin (i.e. top-left pixel, or coordinate (0,0)) to the pixel itself. More formally, with ii(x, y) in the integral image and i(x, y) in the original image: ii(x, y) = x x,y y i(x, y ) To calculate the sum of any rectangular area now, we only need to access the four corner pixels in the integral image, which means we calculate an area s sum in constant time. We can create the integral image in one pass by scanning the original image. So with this representation, we can calculate feature-values in constant time and with 9 pixelvalue lookups at most (see Figure 2.2, feature D). Figure 2.4 shows the concept of the Integral Image. For a discussion about the implementation of calculating the integral image in OpenCV, see Appendix B Cascade of classifiers The weak classifier Now that we have the concept of Haar-features and we know how to calculate their outcome, we introduce the weak classifier to give a meaning and purpose to this outcome. Such a classifier simply exists of a Haar-feature together with a threshold and a parity. The threshold states a boundary for the feature-value and the parity decides whether the feature-value should be higher (parity=1) or lower (parity=-1) than this threshold or boundary. The outcome of a weak classifier is either true (license plate present in the current detection window) or false (license plate not present in the current detection window). 11

Figure 2.4: The Integral Image: the value of pixel (x,y) in the Integral Image is equal to the sum of all the pixel-values (up to and including (x,y)) in the area it covers in the original image.

14 Figure 2.4: The Integral Image: the value of pixel (x,y) in the Integral Image is equal to the sum of all the pixel-values (up to and including (x,y)) in the area it covers in the original image. For example, consider again Figure 2.1 where we want to use a weak classifier that uses that Haar-feature. An example of a weak classifier then, has a threshold of 3000 and a parity of +1 because we would want the outcome of the Haar-feature to be bigger than the threshold 3000, which is (correctly) true in this case. Another example of a weak classifier might have a threshold of 3500 and a parity of -1 in which case the classifier would still evaluate to true in this case, but this classifier would be less logical since the meaning of this particular classifier is that we want to make sure that there is a letter present at the most right part of a license plate (the full license plate would lie in the detection window to make the outcome of the weak classifiers evaluations to true). When there is indeed a letter present at this Haar-feature s location, the left-side of the feature would contain many black pixels. If it contains many white pixels, the outcome of the Haar-feature would for instance be: = 1550 in which case the probability is very small there is a letter present in that left-part. In that case we want the classifier to evaluate to false. That s why a minimum value (parity of +1) is more suitable here. Or, stated differently, the first example generates less false positives than this second example. These weak classifiers will be generated automatically, however, in the training process. This subject is discussed in the next chapter The cascade We follow Viola and Jones, so we will use many weak classifiers to decide whether a license plate is present in the window and we will do this with a cascade. Viola and Jones algorithm solely uses (or evaluates) this cascade to detect a license plate. It consists of several steps and each step either returns true or false. When a step returns false we immediately stop looking for a plate in the current detection window and we start over in the next detection window, running the cascade again. When a step returns true, we continue the detection process with the next step. We keep moving forward in the cascade, until we finished the last step. If this step returns true, it means each step in the cascade returned true, in which case we conclude there must be a license plate present in this detection window (or: sub-window). This process is shown in Figure 2.5. One way to implement such a cascade with steps or stages is to divide each stage into several weak classifiers. We must keep in mind that every detection window/sub-image is being subjected to at least the cascade s first stage, so we want to reject as many negatives as possible (while maintaining accuracy) while at the same time using as little as possible classifiers in this stage to achieve this. This way, we spend the least CPU time to the negatives because they get rejected very early, reserving more processing time for further inspecting the sub-images that reach further into the cascade and are more likely to contain license plates. 12

15 Figure 2.5: Cascade of classifiers: we move on in the cascade until either one step returns false or we finished processing the last step. However, we cannot expect that all weak classifiers in a stage return true despite the fact it may be processing a positive sub-image. To solve this issue, each stage has its own threshold and every classifier has a weighted coefficient, making it possible (but not required) to make some classifiers more important than others. For example, the first stage may consist of only three weak classifiers and each classifier has the same weighted coefficient of 0.2. To make it so that the first stage only requires at least two of the three classifiers to pass, the stage s threshold is set to 0.4. In general of course, most classifiers returned true in the end. For the interested reader: Viola and Jones suggest several methods to assign weights and thresholds to classifiers and stages. This paper, however, only addresses the issue of training a weak classifier because of time constraints concerning the Bachelor s project. 13

16 14

17 CHAPTER 3 Training Before we can use the just explained detection mechanism, we must prepare the data for it. We must know which of the more than to be generated Haar-features we should use and which thresholds and parities we must choose to build the best weak classifiers. Ultimately we must know how we should divide the weak classifiers to build the different stages in the cascade, although not all these steps are covered in this paper. Choosing the threshold and parity for a Haar-feature, so we can make the best (i.e. with lowest error-rate) weak classifier out of it, is called training. This chapter explains this process by using statistics, as described by Pham and Cham in [6]. 3.1 Preparation Before we can actually apply the theory in [6] we first need to prepare the data we will be working with and analyse some linear relationships. These points are discussed in this section Generating training-set of images To train the classifiers, we first need a training-set that contains as well positive as negative images. For the positive images, I was able to use a dataset from DACOLIAN B.V., a Dutch company specialized in the recognition of license plates. This dataset contained photographs, taken from cameras hanging above the highway in The Netherlands. However, to use this dataset for training, I needed metadata that specified the exact size and location of the license plate in each image. Otherwise, the program would not know what pixels to analyse, so it can later do the detection job properly. To generate this metadata, I used a tool called mr. Tag, built by the authors in [3], which one of the authors provided me, to tag 1000 images. For the negative images, I found a data source for it 1 from which I extracted the pictures containing license plates. When I processed this dataset, I extracted the biggest possible rectangle, that still fits in the negative image, of ratio-size 11:52 in the center of the image, so I could directly use this set for generating statistics, as explained in the upcoming sections Gathering training-data Firstly, with a classifier s error-rate we mean how many false negatives (rejecting a license plate) and false positives ( accepting a sub-image with no license plate) it generates. To minimize this rate, we must carefully choose the right threshold (and parity) for a Haar-feature. We do this by using statistics of the training-set. We start by gathering the training-data, or: examples. What we do is take our training-set 1 at 15

18 that consists of positive and negative images and process them. For each image, after normalizing them 2, we get a matrix with pixel-values back from OpenCV. We vectorise this matrix for calculation purposes, by walking over every row in the matrix and appending the pixels to this vector. Because we will be using a detection-window of size 11 x 52 (or: a found license plate will lie in a rectangle of this size, proportionately), the matrix from the input-image has this same size, which leads to a size of (11 * 52 =) 572 x 1 for the vector. We store all these vectors in two arrays: one for the positive training-set and one for the negative training-set. Once we have these two arrays, we made enough preparation to start calculating the statistics, but first let s analyse some linear relationships in this matter Linear relationships We know that Haar-features differ in location and size (and of course out of which basic featuretype they are constructed). Recall the already considered Haar-feature in Figure 2.1 and Figure 2.3. We consider a Haar-feature now as a rectangle that just has the same size as the detection window. Furthermore, the pixels in the white subrectangle will have a value of -1 and the pixels in the black subrectangle will have a value of +1. The rest, pixels that do not belong to a subrectangle, will just have a value of 0. Consider h as the vectorised Haar-feature (or: Haar-vector ), the construction of h for the mentioned (and downscaled) Haar-feature, is then shown in Figure 3.1. (a) Haar-feature = (b) Matrix form = Figure 3.1: Making a vector of a Haar-feature (c) Vectorised form h During the detection process, we can apply a Haar-vector h to a vectorised (sub-)image x to calculate the feature-value v, with: v = h T x (3.1) But we will be working with integral (sub-)images for speedup, so we must rewrite h in order to calculate v for such an integral (sub-)image. To make an integral image of an original image, we use a transformation matrix B, which is constant. If x is the original image, we make the integral image y of it, with: y = Bx (3.2) Following from Equation 3.1 and Equation 3.2 we can now make a Haar-vector g that can be applied to an integral image y to calculate the feature-value v, as shown in Equation Again, to be invariant of lighting conditions 16

19 v = h T x = h T (B 1 y) (3.3) = g T y g = B 1T h For the exact structure of B, see Appendix A. OpenCV, however, didn t provide enough functionality for sparse matrix or vector multiplication, to benefit from using these kind of matrices or vectors. See Appendix B.2 for more information about the adjustments I made for OpenCV Generating Haar-features To improve accuracy, we will use more features than the original 4 basic features as introduced by Viola and Jones. Lienhart and Maydt proposed an extended set of Haar-features in [5], which are shown in Figure 3.2. Figure 3.2: The extended set of Haar-features as proposed by Lienhart and Maydt For my project, I only used the 7 feature-types 1a-b, 2a-d and 3a. To use the other featuretypes, I d have to use diagonal integral images, which are not discussed in this paper. To generate all possible Haar-features, for each basic feature-type I just looped over all possible sizes and for each new size, I also looped over all possible locations inside a window of 11 x 52. Looping over all combinations of sizes and locations for these 7 basic features, yielded a total of different Haar-features to make a weak classifier from and ultimately choose the best one. 3.2 Using statistics to train a weak classifier Conventional methods, as used in [7, 8, 9, 5, 4], have a complexity of O(NT log N), where N is the number of examples and T is the number of Haar-features, in our case A short explanation of this traditional method, is that for each of the T Haar-features, the feature is applied to all the N input-examples and so, for each feature, we get back many different featurevalues (namely N). The best feature-threshold for every feature is then determined from all these different outcomes and its parity from whether the outcome should be true (for a positive example) or false (for a negative example). This process can go a lot faster using statistics of v and y over the examples, because then we break up the NT factor, achieving a complexity of O(Nd 2 + T) 3, where d is the number of pixels in the detection-window (in our case: = 572). 3 According to [6], achieving a training time of 5.74s instead of >10m, for a weak classifier 17

20 Suppose µ and σ 2 are the mean and variance of v. Furthermore, consider as the expected value of a random variable or vector. Then, deriving from the linear relationships above, together with the understanding that we are considering one Haar-feature (namely the t th), we get (citing the formulas from [6]): µ (t) = v (t) = y T g (t) = m T y g(t) where m y = y is the mean vector of y, and: σ 2(t) = v (t)2 v (t) 2 = g (t)t yy T g (t) g (t)t m y m T yg (t) = g (t)t ( yy T y y T) g (t) = g (t)t Σ y g (t) where Σ y = yy T y y T is the covariance matrix of y. This means that for every feature, instead of applying it on all the N examples to determine which threshold (and parity) would cause the lowest error-rate, we can simply determine it from µ and σ 2. This will be more clarified below Global and local statistics In the implemented algorithm, what we need first is to gather the global statistics over our examples, as just discussed. However, we need these statistics separately for the positive and for the negatives examples. So, for our test-data, we create m n and m p and Σ yn and Σ yp for the negatives and positives respectively. This can be done in one scan over the input-set, as shown in Algorithm 3.1. Algorithm 3.1 Calculating positive and negative global statistics over input examples in O(Nd 2 ) Require: y (n) for n = 1..N, where y (n) is the integral image of the n-th input example. // For the set of i positive y (n) s Compute m p = 1 i n y(n) Compute Σ yp = ( 1 i n y(n) y (n)t) m p m T p // For the set of j negative y (n) s: Compute m n = 1 j n y(n) ) Compute Σ yn = n y(n) y (n)t m n m T n ( 1 j Now, while looping over all the Haar-features being generated, what we do in each step is calculating the expected feature-value and the variance of this feature-value over the test-set, again for the positive and negative examples. These steps are shown in Algorithm 3.2. Algorithm 3.2 Calculating local statistics for each Haar-feature in O(1) each O(T) Require: The positive and negative global statistics: m p, m n and Σ yp, Σ yn foreach feature t do Compute µ (t) p = m T p g (t) Compute µ (t) n = m T ng (t)... end Compute σ 2(t) p Compute σ 2(t) n = g (t)t Σ yp g (t) = g (t)t Σ yn g (t) We assume that the feature-values over the examples are spread by a Gaussian or normal distribution. This means that for each feature we can make a probability density function (PDF) 18

with µ and σ 2 for the distribution over the positive and negative examples. With these two functions we can choose the threshold and parity that leads to the lowest error-rate. Figure 3.

21 with µ and σ 2 for the distribution over the positive and negative examples. With these two functions we can choose the threshold and parity that leads to the lowest error-rate. Figure 3.3 shows this clearly for an imaginary Haar-feature. Figure 3.3: Two probability density functions: the left/blue one about the negatives is created with parameters µ n and σ 2 n; the right/pink one about the positives is created with parameters µ p and σ 2 p. As we can see from the picture, we can choose the best threshold for our classifier if we choose the value where the two probability density functions cross. Since a weak classifier accepts or denies a sub-image based on the feature-value applied on it, we must also know whether the feature-value should be greater or smaller than this threshold. The picture shows that in this specific case, the feature-value should be greater than the threshold, because our test-set pointed out that when we apply this Haar-feature on all the examples, by far most of them led to a feature-value higher than the threshold (although not all of them, discussed below). That the feature-value should be bigger than this threshold, leads us to a parity of +1 for our threshold, which, together with the threshold, makes our newly generated weak classifier complete. One might ask why we don t just choose the most left point of the right mountain in our picture as the threshold. Then we accept all of the positives! This is because at the same time, not only do we want to accept as many positives as possible, but we also want to reject as many negatives as possible. Rejecting negatives is an important part of the detection process, since the speed of the detection process relies on it. If we would not reject many negatives early, we would unnecessarily spend much time processing those sub-images (referring to subsection 2.3.2). In Algorithm 3.2, we determine the threshold (the line that reads... ), by calculating where the two probability density functions cross. A probability density function is given by 1 σ (x µ) 2 2π e 2σ 2 (3.4) so if we fill in the values for the positives and negatives and we equate them together, we would get the desired threshold. Pham and Cham talked about a closed form solution to this problem, 19

22 referring to [1]. However, I haven t found it, so I implemented the solution to this problem just by using Newton-Raphson s method. That this isn t a closed form solution, but recurs until the found value suffices, doesn t incur that much slowdown thankfully (due to a maximum of only 5, maybe 6 recursions). Now that we can determine each best classifier, based on a Haar-feature, we still need to choose the one that brings the lowest error-rate Choosing the best weak classifier When we generated as many weak classifiers as Haar-features, which all have their own threshold and parity, we need to choose the one that has the lowest error-rate. On the picture above, we directly see the error-rate as the full area at which the two functions cross. This is the complete purple, middle area. Therefore, the classifier that causes this area to be smallest, must be chosen as the ( best ) weak classifier. To do so, we need to calculate this value, which we can do with the integral of the two functions. We first need to calculate where the functions cross the x-axis, the zero-points, which comes down to solving the equations shown in Figure 3.4, deriving from the PDF equation (Equation 3.4). In theory, they will never reach 0, so in order to evaluate the equations in the picture, we don t equal the formulas to 0, but to a value close to it, like (hence the -sign). When we have these results, we can determine the false positive and false negative rate, which, summed up, form the error-rate. For instance, the area that lies left from the threshold and in the positive PDF (the pink mountain ), covers the false negatives: it represents the amount of sub-images that will be rejected by this classifier, despite the fact that they contain a license plate. Likewise, the area right from the threshold that is still part of the negative PDF (the blue mountain ) represents the false positives. If we calculated the zero-points z p and z n for the positives and negatives respectively and we write the threshold-value as θ, we can calculate the error-rate ǫ as the sum of the false positive and false negative rate, as follows: ǫ = zn θ θ 1 e (x µn) 2 2σn 2 dx + σ n 2π z p 1 σ p 2π e (x µp) 2 2σ 2 p dx Figure 3.4: Determining the error by summing up the false positives (right of the threshold, in the purple intersection) and false negatives (left of the threshold, in the purple intersection) 20

23 After we constructed every possible weak classifier, each belonging to one Haar-feature, and we determined all their error rates, we simply choose the weak classifier which has the lowest error-rate. Final Note Although not implemented, because it laid outside the available time for this Bachelor s project, for the sake of completeness I will briefly describe how one would build each stage in the cascade, on top of weak classifiers (see subsection 2.3.2). The general idea is that a stage, also called a strong classifier, performs better than a weak classifier. This is achieved by using the AdaBoost 4 algorithm. The key idea is that when a weak classifier is trained, it is appended to the strong classifier that is being constructed. Then, a new weak classifier is trained, but the training-set is now weighted with different values, i.e. Algorithm 3.1 and 3.2 now involve a weighting-variable in each calculation. After the training of every weak classifier, the weights are redistributed as follows: the input examples that are misclassified (e.g. false positives) carry a heavier weight than the examples correctly classified, with emphasis on the false negatives. So each new weak classifier learns from its predecessor and the ultimately formed strong classifier performs relatively very well. Furthermore, the training of new weak classifiers for this strong classifier, continues until a certain (low) false acceptance rate and a certain (also low) false rejection rate is achieved, just like the training of strong classifiers for the cascade continues until the FAR couldn t be reduced by a certain amount while maintaining some (very high, e.g. 99.9%) detection rate. With this knowledge in mind, it should be noticed that this paper only addresses the creation of a weak classifier. Implementing the theory above, on top of the theory that is implemented in this project, would give us a fully functional cascade of classifiers with which we can detect license plates (or other objects, as long as the input examples contain the desired objects). 4 from Adaptive Boosting ; there are various formats (not discussed): Gentle, Real, Discrete AdaBoost 21

24 22

25 CHAPTER 4 Results Now that we know the theory about generating the best weak classifier, it s time to implement it and see what the resulting classifier looks like. In this chapter, the results of implementing the theory and the encountered pitfalls, doing this, will be discussed. 4.1 Issues The most important feature of Pham and Cham s algorithm was its speedup in training classifiers. They achieve this using statistics. However, to get the statistics for each Haar-feature that we generate, we must also generate the g (t) vector. We calculate this vector with g (t) = B 1T h (t) (Equation 3.3). But, B is a big matrix of size 11*52 x 11*52 = 572 x 572! This means we would multiply this big matrix with h (t), which are both dense matrices, and we do this tenthousands of times. With this we wouldn t achieve the advantage we want from using statistics: speed! I overcame this issue by directly generating g (t), instead of generating h (t) each iteration and from this vector calculate the desired g (t). Because we do this directly, we don t even need the B matrix anymore. Speedups I saw by directly generating this vector, went up to 300 (from 6000µs to 20µs, processing one Haar-feature)! I wonder how Pham and Cham made this vector; it might be that they didn t have this speed-issue because they used another (highly optimized) package for their linear algebra tasks: GotoBLAS[2]. How we generate this g (t) vector directly, is discussed in Appendix C. 4.2 Improving while implementing When implementing the task of finding the best threshold for a Haar-feature (using Newton- Raphson s method), I debugged the found statistics of some of the generated features. One thing I noticed was that there were many cases where the positive and negative PDFs would have a big overlapping area. This means that the error would be very high and thus we could simply ignore the feature in such a case, because this certainly wouldn t construct the weak classifier we are looking for. A simple check whether the found optimal threshold (i.e. intersection point) lies between the two means (µ (t) p and µ (t) n ) would suffice to know if we can ignore this feature or not. A case where we can ignore the feature, is shown in Figure 4.1. The advantage of doing this, is that we skip the part of calculating the error, i.e. the integral-calculation and determining the zero-points as shown in Figure 3.4 and so, we speed the process up a little bit further. Another small point of improvement lies in the way the program calculates the zero-points of the two probability density functions. The empirical rule states that % of the values lie within 3 standard deviations of the mean. The program now simply calculates z p and z n as 23

Figure 4.1: The intersection point, or our found threshold, doesn t lie between the two means. We can then simply skip this Haar-feature, as justified by noticing that the purple intersection area, i.

26 Figure 4.1: The intersection point, or our found threshold, doesn t lie between the two means. We can then simply skip this Haar-feature, as justified by noticing that the purple intersection area, i.e. the error-rate, is very big. the value that lies 3 σ p s of µ p and 3 σ n s of µ n, respectively. This way we don t have to solve the equations as given in Figure Results Measurements Timing 1 the different processes while training, showed the following measurements (taking the average of 10 successive runs): Task Calculating global statistics calculating Σ y{p,n} Training all classifiers and selecting the best one Total training time for a weak classifier Time 9.02 seconds 8.98 seconds 3.08 seconds seconds Table 4.1: Time measurements for the implemented algorithms These measurements were taken on an AMD Athlon 64 X processor, running at 2.2 GHz and using only one of its two cores, with 2 GB memory. I used a total of 1000 positive and 750 negative samples to train on. As can be seen, calculating the covariance matrix costs a significant amount of time. This can be explained by the fact that this matrix needs 1000 matrix-matrix calculations for the positive variant and 750 matrix-matrix calculations for the negative variant of Σ y, where each calculation considers sizes 572 x 1 times 1 x 572. However, the benefit of this computationally intensive preparation work is that we process roughly Haar-classifiers in just about 3 seconds! This means that we can strongly extend the Haar-feature types to achieve better accuracy, while suffering minimally from an increase in training time. 1 using OpenCV s own cvgettickcount() and cvgettickfrequency(), from which the number of microseconds can be measured between two tick s 24

In comparison to the original algorithm of Viola and Jones: it took them more than 10 minutes to train a weak classifier, despite the fact they were using only 40 000 Haar-features 2.

27 In comparison to the original algorithm of Viola and Jones: it took them more than 10 minutes to train a weak classifier, despite the fact they were using only Haar-features 2. Although this wasn t tested on my dataset, the difference nonetheless speaks for itself The ultimately selected weak classifier Statistics When the implemented training program finished, it went over all those features, constructed the best classifier out of each of them, and selected the best one. First, let s see the probability density functions belonging to it: Figure 4.2: The PDFs for the best weak classifier. Note that the purple intersection area is quite small. And the error that belongs to it: ǫ = (x+332.8) 2 2π e dx (x ) 2 2π e dx = = where σ 2 n = and σ 2 p = So the least possible percentage that is misclassified by a weak classifier is 4.24 %. 2 as reported in [6] 25

28 Haar-feature The Haar-feature itself, as part of the best weak classifier, is shown below in Figure 4.3. It consists of two subrectangles with size 2 x 37: the first has location (2,0) and weight -1, the second has location (2,2) and weight +1. Figure 4.3: The best Haar-feature, applied on a license plate. Recall that the feature-value is calculated as black minus white. Looking at the threshold of 4034 and parity of +1, apparently the classifier focuses on the fact that the top-border must contain more black pixels than the area that lies beneath it, despite the upper parts of the letters. To be exact: the difference must be at least an equivalent of almost 16 fully white pixels (4034/255 = 15.8). However, to help us understand this classifier more, let s look at a positive example the program was trained on: Figure 4.4. As we see, the input-set doesn t contain perfect examples like the one showed above. Now the selected classifier makes a little bit more sense, since the top-border really contains many black pixels, and it can also be applied in practical cases (e.g. low-quality photographs taken on a highway, as is the case in this example). Figure 4.4: An example of an image out of the positive input-set As would be expected: the second best classifier looks very much the same as the best weak classifier. This second best weak classifier, namely, has the same location (2,0), also exists of two subrectangles with the same weights and has almost the same error: 4.29 %. The only difference between this classifier and the best one is its size: 2 x 38, where the best classifier is 2 x 37. So the first few best classifiers probably all focus on the black top-border in the license plate. 26

29 CHAPTER 5 Conclusions The implemented algorithm by Pham and Cham makes some important improvements. First, because of the significant reduce in training time of a weak classifier, it makes it more attractive for others to experiment with solutions they come up with. They don t have to wait for days or weeks anymore, before they have fully built a cascade of classifiers and they can test their ideas or solutions. Second, because of the very little increase in training time with a very large increase in Haar-features, one can extend the mentioned extended set of Haar-features by Lienhart, to even further improve accuracy. It should be noticed, that this paper only discusses the detection of license plates (and Dutch only). It is possible, with very little change, to detect other objects, e.g. faces. The only thing that should be changed then, is the training-set. Positive images in this case, should contain faces, while negative images should certainly not. For the ease of reading, however, this paper addresses one concrete example: license plates. Another point that should be noticed is that the images that will be used in the detection process should be similar to the images that were used during the training process. This applies for the quality of the images, but for instance also for the skewness of the images. For example, in this paper the classifiers are trained on photographs that are taken from cameras that hang above a highway. This causes a certain skewness on these photographs, which mean they are not similar to photographs that are taken straight in front of a license plate. So, during the detection and the training process, these photographs shouldn t be used mixed up. In short: consistency should be kept in mind during the detection and the training process. Furthermore, this paper addresses the issue of creating a weak classifier. To make a fully functional Cascade of Classifiers, the next step would be to form a stage of different weak classifiers. In a stage each classifier has its own weight and each classifier might be linked to another classifier. The final cascade then is constructed of different stages where each stage has its own threshold. Among others, Viola and Jones present methods to achieve this. My project didn t address this subject because of time constraints in the Bachelor s project. Lastly, the results showed that the best weak classifier looks much like the second best weak classifier. An interesting point of research might be if there lies a point of improvement in taking similar classifiers (i.e. with similar Haar-features) together. An ultimately formed stage then, might spend attention to more different features of a license plate (or any other object) and in the best case, the number of classifiers would be reduced in the final Cascade of Classifiers because of this. 27

30 The detection task was already optimized greatly by Viola and Jones, to the point that nowadays even digital cameras use it to automatically detect faces while still focusing. Now that the training task also gets optimized further, I m hoping to see some interesting development as well on scientific area as in practical cases. 28

31 Appendices 29

33 APPENDIX A Structure of B The structure of the transformation matrix B, for transforming a vectorised original image to its integral variant, depends of the feature-size we will be working with. In this paper, we are working with a size of 11 x 52. However, for simplicity we presume we are working with a feature-size of 3 x 4. B always consists of small matrices m and n of the same size and n existing of only 0 s. For the feature-size of 3 x 4, we generate B with: m = (a) The m-matrix of size 4 x n = (b) The n-matrix of size 4 x 4 B = m n n m m n m m m (c) The B-matrix consisting of m s and n s (d) The B-matrix written out Figure A.1: The B-matrix for features with size height x width = 3 x 4 So, the m and n matrices have size feature-width x feature-width = 4 x 4 and B itself consists of repeating feature-height x feature-height = 3 x 3 of these little matrices. As an exercise, the reader might verify if the B matrix is correct, by making a small Haarfeature of size 3 x 4, vectorise it and then make the integral image of it by hand and see if the result is the same if the reader makes the integral image of it by multiplying the original vectorised Haar-feature with B, as given above. 31

Classifier Case Study: Viola-Jones Face Detector

Classifier Case Study: Viola-Jones Face Detector P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. P. Viola and M. Jones. Robust real-time face detection.