Fast Image Matching Using Multi-level Texture Descriptor

Fast Image Matching Using Multi-level Texture Descriptor Hui-Fuang Ng *, Chih-Yang Lin #, and Tatenda Muindisi * Department of Computer Science, Universiti Tunku Abdul Rahman, Malaysia. E-mail: nghf@utar.edu.my Tel: +60-5-4688888 Department of Computer Science and Information Engineering, Asia University, Taiwan # Department of Medical Research, China Medical University Hospital, China Medical University, Taichung, Taiwan Corresponding author: Chih-Yang Lin, e-mail: andrewlin@asia.edu.tw Tel: +886-4-23323456 Abstract At present, image and video descriptors have been widely used in many computer vision applications. In this paper, a new hierarchical multiscale texture-based image descriptor for efficient image matching is introduced. The proposed descriptor utilizes mean values at multiscale levels of an image region to convert the image region to binary bitmaps and then applies binary operations to effectively reduce the computational time and improve noise reduction to achieve stable and fast image matching. Experimental results show high performance and robustness of our proposed method over existing descriptors on image matching under variant illumination conditions and noise. I. INTRODUCTION The fundamental goal in computer vision's applications such as object recognition, image content retrieval, and motion is the detection and description of local image features. The complexity associated with image description emanates from the ever changing conditions in the scene's environment (i.e. illumination, rotation, blurring, scale, clutter, etc.) which will hence cause the whole image description to be altered in the process. Image features must satisfy invariance properties, namely invariant to changes in scale, rotation, illumination and viewpoint in the description process. A corner detector was developed by Harris and Stephens [3, 9] that is robust to changes in rotation and illumination because it relied on image geometric properties. It does attempt to relate regions in the model image to all possible regions in the matching image. Computation time was achieved by only matching regions centered at corner points in each image. A limitation of examining an image with only one single scale is evidenced. When the change in scale becomes significant, these detectors respond to different image points. The detector is very sensitive to changes in image scale and therefore lacks providing a good basis for matching images of different sizes. Lindeberg devoted a lot of attention to the scale invariance problem [5]. Scale space theory entails that a given image is exposed to different scales, and therefore a multiscale approach is vital when extracting information such as features from an image data. A lot of attention has been given to this topic. Scale invariant features transform (SIFT) algorithm, proposed in [6, 7], attracts a lot of attention because its invariance to common image transformations like scaling and rotation has led to its prominence. SIFT by David Lowe [7], made a revolution contribution in key point detection and description. SIFT descriptor is a 128 dimension vector based on computing the magnitude and orientation of the gradient images in the neighbor regions. It describes the key points using histograms of image gradients computed in its neighborhood and relies on extracting scale invariant key points using the DoG (Difference of Gaussian) operator. SIFT descriptors have proved to be one of the most robust techniques of feature points extraction owing to its good characteristics of being invariant to image scaling and rotation and being partially invariant to change in illumination and camera viewpoint. [7]. However, it is not only computationally expensive, but also susceptible to color images as it is mainly designed for gray images. Color is a powerful information component for object recognition in our day to day lives as it helps in distinguishing objects and does help with a misclassification of objects problem. Research work has been ongoing to improve SIFT [7] in order to reduce computational time of the algorithm. Some notable ones are as follows: PCA-SIFT [4], did bring about an interesting subject, fewer components are required and thus it results in faster matching. They reduced the length of descriptor vector from 128 to 36 to improve the efficiency, but proved to be less distinctive on feature points. GLOH [8] did use the same dimensions as the PCA-SIFT and resulted in it been more distinctive, but was more computationally expensive. Speeded-Up Robust Features (SURF) [1] detector is based on the approximate Hessian matrix and does rely on integral images to reduce the computation time. It describes a distribution of Haar-wavelet responses within the interest point neighborhood. SURF only uses 64 dimensions, thereby reducing the time for feature computation and matching, and increasing simultaneously the robustness. SURF has shown its good qualities in computer vision applications by it been faster and its ability to be robust and more distinctive, but it also has some limitations. It has proved not to work well if rotation is great or if there is an intense difference in view angle when comparing 2D or 3D objects. SIFT and other similar descriptors have shown state of the art performance in different problems. We were especially interested in seeing if the gradient orientation and magnitude 978-616-361-823-8 2014 APSIPA APSIPA 2014

based feature used in the SIFT algorithm could be replaced by a different feature that offers better or comparable performance and extend it to color images. Multiscale implementations have been widely investigated in the context of texture analysis and due to their advantages these approaches have been further generalized to cover the color texture domain [10]. In this paper, we propose a computationally-efficient alternative to SIFT that has similar matching performance and less affected by image noise. We propose a new interest region descriptor that uses block-based multiscale feature instead of original gradient feature. The new descriptor allows simplification of several steps of the algorithm which makes the resulting descriptor computationally simpler than SIFT. It also appears to be more robust to illumination changes than the SIFT descriptor. Our experimental results show that key point description using the multiscale method is achieved at low computation and is efficient on object recognition. The paper is organized as follows: Section 2 discusses the proposed multi-level texture descriptor in details; Section 3 presents the experimental results and; Section 4 provides the concluding remarks. II. PROPOSED METHOD A. Binary Bitmap (BM) Generation For a given image region, it is first normalized to 160 160 pixels and then divided into M M non-overlapping blocks as shown in Fig. 1. Each block contains n n pixels. In the followings, unless otherwise stated, all processing steps are performed separately on each color channel. Next, the mean value of each block, m, is calculated and the result is compared to each value x ij in the block using Eq. 1. If m is larger than the value x ij, it is marked as 0; conversely, if m is smaller than or equal to x ij, it is denoted as 1. This processing step will convert an image region into a binary bitmap (BM) [10]. Fig. 2 gives a simple example of BM generation, where the block size is set to 4 x 4 pixels. Since the BM reveals the profile of the given block, it is regarded as texture descriptor in our method. 0, if xij < m b = (1) ij 1, otherwise n Fig. 1 Schematic of non-overlapping blocks generation. n Fig. 2 Example of binary bitmap generation (BM). In our study we aim to get a block in line with the implementation of the multiscale method for a robust description. However, blocks are exposed to unstable 0/1 bits because pixels values are close to the mean. A threshold value is added to the mean value (m) to reduce this influence before comparison as shown in Eq (2) [10]. The value for TH will be determined experimentally. 0, if xij < m + TH b = (2) ij 1, otherwise B. Multiscale Texture Representation A 1-bit representation of pixels is shown in the previous texture description of 0/1 bit. Texture features of an image region are revealed by the use of the BM. Coarse to finer structures for the texture description can be easily derived from the 1-bit mode. The 1 bit mode can be further extended into a 2 bit mode if the block is not producing 2 finer means, low mean (lm) and high mean (hm) respectively. The '0' region does produce the lm and the '1' region produces the hm from the 1-bit mode. The BM for the 2-bit mode can then be generated by Eq. 3. 11, if xij hm + TH 10, if m + TH xij < hm + TH bij = 01, if lm + TH xij < m + TH 00, if xij < lm + TH. The 3-bit mode can be generated by a further transformation of the 2-bit mode's two means lm and hm into four finer means where llm (low low mean), lhm (low high mean), hlm (high low mean), and hhm (high high mean). The multiscale representation of transforming coarse-tofine features using the binary pattern of the image region provides a simple way to extract texture features in an image. Fig.3 shows an example of the transformation of a 1-bit mode to 2-bit mode and then to 3-bit mode respectively. (3)

111, if xij hhm + TH 110, if hm + TH xij < hhm + TH 101, if hlm + TH xij < hm + TH 100, if m + TH xij < hlm + TH bij = 011, if lhm + TH xij < m + TH 010, if lm + TH xij < lhm + TH 001, if llm + TH xij < lm + TH 000, if xij < llm + TH. (4) more accurately. After generating the multi-scale representation, an 8-bin histogram is built to accumulate the counts of the bit patterns in each block and will be used as the final texture descriptor. For the 1-bit mode, since there are only two bit patterns, 0 and 1, therefore the count of zero (0) goes into the first four bins and the count of one (1) goes to the last four bins in the histogram. For the example in Fig. 3, if a histogram is to be built for the 1-bit mode, the resulting histogram will be [9, 9, 9, 9, 7, 7, 7, 7]. For the 2-bit mode, there are four bit patterns: 00, 01, 10, 11. Similarly, the count of the first pattern (00) goes to the first two bins and the count of the second pattern (01) goes to the next two bins and so on. Using the 2-bit mode example in Fig. 3, the resulting histogram should be [4, 4, 5, 5, 4, 4, 3, 3]. Finally, for 3-bit mode, since there are eight bit patterns, each bin of the histogram is corresponding to each of the bit patterns. For the 3-bit mode example in Fig. 3, the resulting histogram is [2, 2, 3, 2, 2, 2, 1, 2]. After constructing a histogram for each block in an image region, the final descriptor is obtained by concatenating the histograms from all the blocks in the image region. For an image region that is divided into M M blocks, the dimension of the descriptor is M M 8. Fig. 4 shows an example of descriptor construction for 4 4 blocks. The dimension of the descriptor is 128 bins (8bins 16blocks). Color images consist of three color channels (RGB), and the descriptor is computed separately for each color channel. For the example in Fig. (4), the size of the final descriptor should be 384 bins (3channels 8bins 16blocks). Fig. 3 The process of generating a binary bitmap from 1-bit mode to 2-bit mode and to 3-bit mode. Blocks are exposed to different complexities; hence each block will have a different k-bits mode. The 1-bit mode in our method does represent the texture feature for the block. 2 bits or 3 bits mode in our example will signify higher bit modes which are used to deal with highly textured blocks, the complicated ones. Bitwise transition is the method used to measure block ness for efficiency reasons as compared to other methods if they were to be implemented like variance. In our example, the number of bitwise transitions of each row of 0/1 or 1/0 of the 1-bit mode is 9; that is 2, 2, 3, and 2, respectively. The appropriate threshold value for bitwise transition is determined in the experiments. C. Descriptor Construction In the proposed method, if a block is it is represented by 1-bit mode; otherwise, the un block is represented by 2- or 3-bits mode to fit the block s feature Fig. 4 Schematic of descriptor construction. III. EXPERIMENTS A. Experimental Setup The performance of the proposed method is compared with state-of-the-art approach [7] using image data taken from the Amsterdam Library of Object Images (ALOI) data set [2], which contains images of 1,000 objects taken under various illumination conditions and noise.

Fig. 5 shows images of a sample object taken under four different illumination colors. Illumination colors were controlled by changing the illumination temperature, resulting in objects illuminated under reddish to white illumination color. Image taken under illumination color I110 was used as reference image, and images of the same object taken under other illumination colors were treated as test images. As shown in Fig. 5, illumination color I250 has the greatest difference from the reference color I110. Fig. 7 shows images of a sample object with different levels of Gaussian noise. The noisy images were generated using Matlab function imnoise (im, 'gaussian', m, v) with m set to zero and v = 0(a), 0..025(b), 0.075(c) and 1.125(d) respectively as shown in Fig. 7. (a) N000 (reference image) (b) N025 (a) I110 (reference image) (b) I150 (c) N075 (d) N125 Fig. 7 Sample images from ALOI of colored objects under different Gaussian noises. (c) I190 (d) I250 Fig. 5 Sample images from ALOI of colored objects under different illumination colors. Fig. 6 shows images of another object taken under four different illumination directions. Illumination directions were controlled by turning on only the light from the left (L5C1), from the center (L3C1), from the right (L1C1), and by turning on all lights (L8C1). Refer to [2] for detailed descriptions for the imaging setup. Image taken under illumination direction L8C1 was used as reference image, and images of the same object taken under other illumination directions were treated as test images. (a) L8C1 (reference image) (b) L5C1 (c) L3C1 (d) L1C1 Fig. 6 Sample images from ALOI of colored objects under different illumination directions. For each image, histogramss of RGB descriptor, Opponent descriptor, SIFT descriptor, and the proposed descriptor were constructed. A similar approach in construction of the descriptors was applied as suggested in SIFT [7]. First, the object in the image was segmented from the dark background and was normalized to a size of 160 160 pixels. The normalized image was then equally divided into 4 4 (16 cells) grid, with each cell contained 40 40 pixels. Next, histogram of the image descriptors was computed on a block and the final descriptor was constructed by concatenating all the histograms from each block. For all the descriptors, each color channel was quantized into 8 bins. Thus the dimension of the final descriptor was 384 (3channels 8bins 16cells) [9]. Image matching was done by matching the histograms generated by the color descriptors. L1 norm of the difference between histograms was use: L K 1 ( H1, H 2) = H1( i) i= 1 H ( (5) 2 i) A small value indicates that the two histograms are similar B. Experimental Results The following are the experimental results of this study using the ALOI image database. Three sets of tests were conducted in the experiments to evaluate the effects of illumination colors, illumination directions and noise levels on the performance of proposed color descriptor and various commonly used descriptors. For each set of experiment, color image descriptors for the reference images were constructed and stored. For each test image, its image descriptor was matched against the descriptors of the 1000 reference images using Eq. 5. A correct match was declared if the test image

and the most similar reference image belong to the same object. Table 1 shows the average percentages of correct matches for the image descriptors under different illumination conditions and noise. We can see from Table 1 that for the ALOI image dataset, the RGB descriptor, SIFT descriptor and the proposed descriptor are all unaffected by changes in illumination color as they produce perfect match. The Opponent descriptor is somewhat sensitive to changes in illumination color. TABLE I. AVERAGE MATCHING ACCURACY OF COLOR DESCRIPTORS UNDER DIFFERENT ILLUMINATION CONDITIONS AND NOISE. Average Percentage of Correct Matches (%) Illumination Illumination Noise Descriptor Average Color Direction Level RGB 99.0 77.5 99.6 92.0 Opponent 86.9 82.4 99.8 89.7 SIFT 100 93.8 98.8 97.5 Proposed 100 94.4 97.2 97.2 As for the illumination direction factor, the results show that RGB descriptor has the worst overall performance when dealing with illumination direction variations since it possesses less invariance properties followed by Opponent. The SIFT descriptor and the proposed are less sensitive to the changes. The proposed better has the best matching results. Lastly, for the noise factor, all descriptors seem rather tolerant of slight amount of noise in the images. Overall, SIFT descriptor and the proposed descriptor perform relatively well under all conditions. In terms of processing time, the average time taken for each set of experiment to match all 1000 objects was 1228s for the RGB descriptor, 1411s for the Opponent descriptor, 1960s for the SIFT descriptor, and 1302s for the proposed descriptor. All algorithms were implemented in MATLAB running on Windows 7 with a 2.93 GHz Core 2 Duo Intel processor and 2 GB of memory. The proposed method is almost as efficient as the RGB descriptor, and it takes about 34% less time to match the images than SIFT descriptor. IV. CONCLUSIONS This paper proposes a hierarchical coarse-to-fine texturebased image descriptor for image matching. Instead of accumulating gradient orientation histogram as in SIFT, which can be time consuming and susceptible to noise, the proposed method utilizes the mean values at multiscale levels and binary operations to enhance the performance and improve computational time, thereby achieving stable and fast image matching. In addition, the proposed image descriptor is not susceptible to lighting geometry and illumination color changes. We have tested the new image descriptor on matching of objects illuminated under different illumination conditions and noise levels and the performance of the proposed descriptor outperforms RGB, Opponent, and SIFT descriptors. Future research will include keypoint detection, and evaluate the performance of the proposed image descriptor on image matching under different object viewpoints. ACKNOWLEDGMENT This work was supported by Ministry of Science and Technology, Taiwan, under Grants NSC 102-2221-E-468-017. REFERENCES [1] H. Bay, T. Tuytelaars, and L. J. V. Gool, Speeded Up Robust Features (SURF), Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346 359, 2008. [2] J. Geusebroek, G. Burghouts, and A. Smeulders, The Amsterdam library of object images, International Journal of Computer Vision, vol. 61, no. 1, pp. 103 112, 2005. [3] C. Harris and M. Stephens, A combined corner and edge detector, in Fourth Alvey Vision Conference. Manchester, UK, pp. 147-151, 1988. [4] Y. Ke and R. Sukthankar, PCA-SIFT: a more distinctive representation for local image descriptors, in IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 506-513, 2004. [5] T. Lindeberg, Scale-space theory: a basic tool for analyzing structures at different scales., Journal of Applied Statistics, vol. 21, no. 1-2, pp. 225-270, 1994. [6] D. G. Lowe, Object recognition from local scaleinvariant features, The proceeding of the seventh IEEE International Conference on Computer Vision, Kerkyra, pp. 1150-1157, 1999. [7] D. G. Lowe, Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, vol. 60, no. 2, pp. 91 110, 2004. [8] K. Mikolajczyk and C. Schmid, A performance evaluation of local descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1615-1630, 2005. [9] H. F. Ng, I. C. Chen, and H. Y. Liao, An Illumination Invariant Image Descriptor for Color Image Matching, Scientometrics, vol. 25, no. 1, pp. 306-311, 2013. [10] C. H. Yeh, C. Y. Lin, K. Muchtar, and L. W. Kang, Real-time background modeling based on a multi-level texture description, Information Sciences, vol. 269, no. 10, pp. 106-127, 2014.