EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm Group 1: Mina A. Makar Stanford University mamakar@stanford.edu Abstract In this report, we investigate the application of the Scale-Invariant Feature Transform (SIFT) to the problem of CD cover recognition. The algorithm uses modified SIFT approach to match keypoints between the query image taken by camera phone and the original database of CD covers. The extracted features are highly distinctive as they are shift, scale and rotation invariant. They are also partially invariant to illumination and affine transformations. All these properties make them very suitable to the problem at hand. Experimental results show the efficient performance of the developed algorithm in terms of recognizing all the 99 images provided in the training set. I. Introduction As the number of mobile phones with built-in cameras increase, applications based on Mobile Augmented Reality become more and more attractive. The problem described in this report is CD cover recognition from a picture taken by a camera phone. This idea can be used in marketing where you can provide interaction between the user and the CD producer when the latter recognizes an image sent from the user's mobile phone and sends him back advertising material. In this report, we apply a slightly modified version of the Scale- Invariant Feature Transform (SIFT) to solve the problem. In Section II, we discuss initial experiments that suggested using local robust features to solve the problem. Then, we present the whole algorithm developed to suite the given image database and training set images. Section III describes the SIFT algorithm in more detail and explains the modifications made to increase simplicit In Section IV, the procedure of image matching to the database is discussed. Section V presents the results we got on applying our algorithm on the training set consisting of 99 images where 90 images have matching CDs in the database while 9 images have no matches. We conclude in Section VI by reviewing the benefits of the whole technique. II. Algorithm Selection & Overview A. Comparing Two Solutions A survey was made on possible techniques that can be used to solve the CD cover recognition problem. We found [1] that there are two solutions that appeared very much suitable for it. The first is to use eigenimages in order to project the images on lower dimension space and then identify the new images by their distances to the images in the database in this space. The second approach was to use local robust features as representative of each image and do the matching based on these features. We studied the images in the training set. We found that the main deformations present are 1. Rotation that may reach up to 40º and affine or projective transformation.. Noise, motion blur, defocus and spurious glare/ reflection from the CD cover. 3. Sometimes the CD is occluded on the edge by the hand holding it. But also there are some common good characteristics that apply for all the images which are 1. The CD is always in the middle of the image.. The images are high resolution 180 x 960 compared to images taken by cell phones. 3. No big variation in scale between areas containing CD covers all over the images. Using eigenimages is much simpler and faster but we have to wrap the image treating any affine or projective transformation before projecting it on the lower dimension space. So, the first step is to accurately segment the CD cover from the whole image. Some experiments were made using different edge detection algorithms [,3] and then line detection by Hough transform [1,3]. The best results worked well for most CD covers but there were always some CD covers that failed the segmentation process. This was because the background clutter was high. Since we can have any background in our problem, the segmentation approach was rejected because it may become unreliable in some cases. Fig. 1 shows images of the same CD from the training set, where the segmentation algorithm we developed gives very good results for the first and unacceptable results for the second. (a) (b) Fig.1. Results of segmentation algorithm (a) Good performance (b) Bad performance

This suggested turning to the other solution of using local robust features. Of course, if we made right segmentation, this will make the task of the feature detector much easier because it will not detect features from the background. However, segmentation here is an optional step. We still can identify the CD cover when there are some wrong features from the background. To reduce number of features from the background, we thought of cropping the image. This is because we always find the CD cover in the center of the image. Too small cropping area will be a problem for large CDs because many of the features near the edges will be deleted. Too large cropping area will be a problem for small CDs because you will still get many wrong features from the background. By experiment, we settled on cropping the center part of the query image to size of 750 x 750 pixels. Also, we resize either the database image or the query image after cropping to 400 x 400 pixels. This is the size on which we designed the feature detection algorithm. More detail about the choice of this size in Section V. Also, we convert the image to grayscale because we are not making use of color information anymore. After that, we apply the SIFT algorithm [4,5] on the image where we detect the robust keypoints, generate the feature descriptors for each keypoint and compare the feature descriptors of the query image to the feature descriptors of the database images to get the closest match. Fig. is a block diagram showing an overview of the whole algorithm. III. Modified SIFT In this section, we describe the SIFT algorithm [4,5] in more detail. We also state the modifications that were made to increase simplicity. Since, we have a small database of only 30 CDs, these simplifications were quite reasonable and didn t affect the accuracy of the algorithm. A. Detection of the Keypoints The first step in the SIFT algorithm is the detection of keypoints which are desired to be localized and robust to image deformations. To do so, a Gaussian image pyramid L ( σ ) is generated by successively filtering the image I ( with gaussian filter G( σ ) according to equation 1. Every octave which corresponds to factor of change in σ, the image is downsampled by. Adjacent Gaussian images are subtracted to produce the difference-of-gaussian images (DoG) (equation ) which approximate the Laplacian of a Gaussian filtering. This process is shown in Fig. 3. We have a small image database, so we don't need a large number of keypoints for each image. Also, the difference in scale between large and small CDs is not so big. So, we decided to have only octaves and to detect keypoints from only one interval per octave. This required generating 4 images per octave in the Gaussian pyramid so that the DoG pyramid has 3 images and thus we can detect local maxima and minima. Fig. 4 shows example of generated pyramids for database image 'Sheryl_Crow_Light_Eyes'. 1 ( x + y ) σ G( σ ) = e πσ L( σ ) = G( σ ) I( (1) D( σ ) = L( kσ ) L( σ ) () Fig.. Algorithm block diagram Fig.3. Gaussian & DoG pyramids (Source: Reference 5)

Tr( = D xx + D yy Tr( < Det( Det( = D ( r + 1) r xx D yy ( D ) xy (3.b) (a) (b) Fig. 4. (a) Gaussian pyramid & (b) DoG pyramid for CD number 3 To detect a certain keypoint, it has to pass three tests. The main test is that it has to be a local maximum or a local minimum with respect to its 6 neighbors in a 3x3 region at the current and adjacent scales in the DoG pyramid. For local extrema detection, we followed the idea described in [6], where we make a dilated and eroded version of the DoG pyramid using a 3x3 flat SE. Then, a pixel is a local maximum if its value is equal to the pixel at the same location in the dilated DoG pyramid and greater than the values immediately above and below it in this pyramid. Local minima are detected the same way from the eroded DoG pyramid. Instead of performing the scale-space interpolation described in [5] for more accurate localization of the keypoint and to remove the keypoints with low contrast, we just discard keypoints in which the absolute value of DoG pyramid at the interval they are detected is smaller than certain threshold. We follow the same threshold used in [5] which is 0.03. The final test is to make sure that the keypoint is not lying on a strong edge. For this, we use discrete differences between neighboring pixels around the keypoints under study to calculate the Hessian matrix (equation 3.a). After that, we discard the keypoints that don't satisfy the condition in equation 3.b. If this ratio in 3.b. is small, this means that the eigenvalues of the Hessian matrix are close to each other which means that the keypoint is lying on a corner and not an edge. We used the value of r = 10 suggested in SIFT paper [5]. The detected keypoints on the previous CD image are shown in Fig. 5. D Η = D xx xy D D xy yy (3.a) Fig. 5. Detected keypoints for CD number 3 B. Orientation Assignment In order for the feature descriptors to be rotation invariant, an orientation is assigned to each keypoint and all subsequent operations are done relative to the orientation of the keypoint. This allows for matching even if the query image is rotated by any angle. In order to simplify the algorithm, we tried to skip this part and assume no orientation for all keypoints. When we tested that, it gave wrong results with nearly all the images where the CD cover is rotated with an angle of 15º to 0º or more. We realized that this step can't be eliminated. The scale of the keypoint is used to select the Gaussian smoothed image, L, with the closest scale. Then, the gradient magnitude and orientation are calculated using equation 4. An orientation histogram is formed from the gradient orientations of sample points within a region around the keypoint. The orientation histogram has 36 bins. Each sample added to the histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window. The SIFT paper then suggests locating the highest peak in the histogram and any other local peak that is within 80 percent of the highest peak. In order to decrease the number of keypoints without affecting the accuracy much, we assign only one orientation to each keypoint which corresponds to the peak of the histogram. A parabola is fit to the 3 histogram values closest to the peak to interpolate the peak position for better accuracy. m( = θ ( = tan ( L( x + 1, L( x 1, ) ( L( y + 1) L( y 1) ) 1 + L( y + 1) L( y 1) L( x + 1, L( x 1, (4)

C. The Feature Descriptor First the image gradient magnitudes and orientations are calculated around the keypoint, using the scale of the keypoint to select the level of Gaussian blur for the image. The coordinates of the descriptor and the gradient orientations are rotated relative to the keypoint orientation. Note here that after the grid around the keypoint is rotated, we need to interpolate the Gaussian blurred image around the keypoint at non-integer pixel values. We found that the D interpolation in MATLAB takes much time. So, for simplicit we always approximate the grid around the keypoint after rotation to the next integer value. By experiment, we realized that, this operation increased the speed much and still had minor effect on the accuracy of the whole algorithm. The gradient magnitude is weighted by a gaussian weighting function with σ, equal to one half of the descriptor window width to give less credit to gradients far from center of descriptor. Then, these magnitude samples are accumulated into an orientation histogram summarizing the content over 4x4 subregion. Fig. 6 describes the whole operation. Trilinear interpolation is used to distribute the value of each gradient sample into adjacent bins. The descriptor is formed from a vector containing the values of all the orientation histogram entries. The algorithm uses 4x4 array of histograms with 8 orientation bins in each resulting in a feature vector of 18 elements. The feature vector is then normalized to unit length to reduce the effect of illumination change. The values in unit length vector are thresholded to 0. and then renormalized to unit length. This is done to take care of the effect of nonlinear illumination changes. Fig. 6. x descriptor array computed from 8x8 samples (Source: Reference 5) D. Simplifications to SIFT Algorithm In summar the main modifications that were made on the original SIFT algorithm are: 1. We don't perform the scale-space interpolation which is done for more accurate localization of the keypoint.. We assign only one orientation to each keypoint which corresponds to the peak of the histogram. We don't search for other local peaks within 80 percent of the highest peak. 3. After we rotate the grid around the keypoint with the orientation angle, we don't do D interpolation. We approximate the grid to integer pixel values. IV. Image Matching For image matching, we saved the feature vectors for the original CD covers. When a query image is applied to the algorithm, preprocessing steps discussed in Section II are first performed. Then, we use our modified SIFT algorithm to calculate the feature vectors for this query image. The minimum Euclidean distance between each feature vector of the query image and all the feature vectors of the database is found. The CD having a feature vector with the minimum Euclidean distance to a feature vector of the query image is given a vote to be the right CD. After we go over all the feature vectors of the query image giving votes to CDs in the database, we observed that we always have the right CD to be the one with the highest number of votes. The problem now is how we can identify a 'No Match'. For this, we saw that the 'No Match' query images are in many cases confused with the CDs that have a large number of feature vectors in the feature vectors database. We decided to compare the highest vote (corresponding to the right CD) and the second highest vote (corresponding to the most conflicting CD). If the difference between them is larger than a threshold, then there is a match and this match corresponds to the highest vote. If the difference is smaller than a threshold, then we declare a 'No Match'. For CDs with large number of feature vectors we put a larger threshold. So, the rule that we followed is: 1. Detect the highest and the second highest votes. If the difference between them is greater than THRESHOLD, the output is the CD with highest vote 3. If not, the output is 'No Match' 4. THRESHOLD equals 30 for CD number, 19 and 4. THRESHOLD equals 15 for CD number 8 and 14. THRESHOLD equals 7 for all other CDs The values of THRESHOLD were chosen by experiment on training set images either with match or no match. Also, we have to say that we tried to perform the ratio test described in [5] but for some images the number of votes that passed the ratio test was too small to be above the threshold. So, we didn t use the ratio test for image matching. V. Results First, we applied this modified SIFT algorithm on the dataset of original CD covers. We saved the feature descriptors in a database for the purpose of image matching. CD 10 has the smallest number of keypoints (9). CD has the largest number of keypoints (363). The average number of keypoints per CD is 189 which was reasonable for our matching rule. Fig. 7 shows the number of keypoints for every CD in the original CD images dataset. We have to mention here that all the experiments and matching rule were also designed on images with resolution 300 x 300 and the highest match also corresponded to the right CD with fewer number of keypoints and much less processing time. The problem was that the threshold between the highest and second highest votes was small so that we could not design an efficient rule to detect a 'No Match' condition.

Number of Keypoints 400 350 300 50 00 150 100 50 1 3 4 5 6 7 8 9 10 11 1 13 14 15 17 18 190 1 3 4 5 6 7 8 9 30 CD Number Fig. 7. Number of keypoints for every CD in the original dataset We applied the image matching procedure described in the previous section on the 99 images in the training set where 90 have matches in the database (3 for every CD) and 9 have no matches. Fig. 8 shows the difference between the highest vote which corresponds to the right CD and the second highest vote. The values in the figure are the average values for the 3 images describing every CD. We can see the efficiency of the developed algorithm in terms of the big difference in number of votes between the highest and second highest votes which made setting a threshold for 'No Match' easier. 10 new CD cover images not present in the training set. The algorithm worked well and always declared a 'No Match'. Example of the 10 new no match CD covers is shown in Fig. 10. Number of Votes 6 4 0 18 14 8 7 Highest Vote Second Vote 1 1 3 4 5 6 7 8 9 CD Number Fig. 9. Number of highest and second highest votes for CDs with no matches. Highest vote CD number is shown 19 140 10 Highest Vote (Right CD) Second Vote (Conflicting CD) Average Number of Votes 100 80 60 40 0 0 1 3 4 5 6 7 8 9 10 111 13 14 15 17 18 190 1 3 4 5 6 7 8 9 30 CD Number Fig. 8. Average number of right and conflicting votes per CD in the training set Fig. 9 shows the results of the algorithm in terms of the highest and second highest votes when applied on 'No Match' CDs where the CD number corresponding to the highest vote is shown on the figure. The difference between the highest and second highest votes is much smaller than the design threshold in case of CDs with no matches. We can see from figures 8 and 9 that our matching rule works very well on the training set images and that all the 99 images can be identified correctly using this rule. As a final check on our matching rule and its performance towards no matches, the algorithm was tested on Fig. 10. Example of a no match CD cover not in training set VI. Conclusion An algorithm was developed to identify a CD cover image taken by a camera phone. It is based mainly on using SIFT features to match the image to original CD database. Some modifications were made to increase the simplicity of the SIFT algorithm. Also, a rule for image matching was designed. Applying the algorithm on the training set, we found that it was always able (100 % accurac to identify the right CD or to declare a No Match in case of no match condition. The algorithm was highly robust to scale difference, rotation by any angle and to other artifacts like noise, motion blur, defocus and reflection from the CD cover. With more tuning to the matching rule and other parameters in the algorithm, it is likely that we can work on lower resolution images like 300 x 300 pixels, or perhaps even smaller. This can make the algorithm much faster and simpler.

References [1] B. Girod, Lecture Notes for EE 368: Digital Image Processing, Spring 008. [] R.C. Gonzalez and R.E. Woods, Digital Image Processing, Prentice Hall, 008. [3] R.C. Gonzalez, R.E. Woods and S.L. Eddins, Digital Image Processing using MATLAB, Prentice Hall, 004. [4] D. Lowe, Object Recognition from Local Scale-Invariant Features, Proceedings of the Seventh IEEE International Conference on Computer Vision, vol., 1999. [5] D. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol. 60, no., pp. 91 110, 004. [6] G. Hoffmann, P. Kimball, S. Russell, Identification of Paintings in Camera-Phone Images, EE368 Project Report, Spring, 007.