Image Processing & Communication, vol. 18,no. 1, pp.31-36 DOI: 10.2478/v10248-012-0088-x 31 OBJECT RECOGNITION ALGORITHM FOR MOBILE DEVICES RAFAŁ KOZIK ADAM MARCHEWKA Institute of Telecommunications, University of Technology & Life Sciences in Bydgoszcz, ul. Kaliskiego 7, 85-789 Bydgoszcz, Poland rkozik@utp.edu.pl Abstract. In this paper an object recognition algorithm for mobile devices is presented. The algorithm is based on a hierarchical approach for visual information coding proposed by Riesenhuber and Poggio [1] and later extended by Serre et al. [2]. The proposed method adapts an efficient algorithm to extract the information about local gradients. This allows the algorithm to approximate the behaviour of simple cells layer of Riesenhuber and Poggio model. Moreover, it is also demonstrated that the proposed algorithm can be successfully deployed on a low-cost Android smartphone. 1 Introduction This paper is a continuation of author s recent research on object detection algorithms dedicate for mobile devices. The tentative results have been presented in [4]. In contrast to previous work, modifications to overall method architecture have been proposed. Moreover, the experiments described in this paper have been conducted for extended database of images. The proposed approach follows the idea of HMAX model proposed by Riesenhuber and Poggio [1]. It adapts the hierarchical fed-forward processing approach. However, modifications are introduced in order to reduce computational complexity of the original algorithm. Moreover, in contrast to Mutch and Lowe [10] model an additional layer that mimics the "retina codding" mechanism is added. The results showed that this step increases the robustness of the proposed method. However, it is considered as an optional step of visual information processing, because it introduces additional computational burden that can increase the response time (the frame rate drops significantly) of at low-cost mobile devices. In contrast to original HMAX model, a different method for calculating the S 1 layer response is proposed. The responses of 4 Gabor filters are approximated with algorithm that mimic the behaviour of HOG descriptors. The HOG [3] descriptors are commonly use for object recognition problems. The image is decomposed into small cells. In each cell an histogram of oriented gradients is used. Typically for HOG features the result is normalised and the descriptor is returned for each cell. However, in contrast to original HOG descriptor proposed by Dalal and Triggs, in this method (in order to simplify the computations) the block of cells are not overlapping, and
32 R. Kozik, A. Marchewka the input image is not normalised. Moreover, in previous work [7] the S 2 and C 2 layers were replaced with machine-learned classifier. Such approach significantly simplifies the learning process and decreases the amount of computations. However, this impacts the invariance to shape changes of a recognised object and the feature vectors presented to classifier are relatively of grater dimensionality. 2 The HMAX model overview The HMAX Visual Cortex model proposed by Riesenhuber and Poggio [1] exploits a hierarchical structure for the image processing and coding. It is arranged in several layers that process information in a bottom-up manner. The lowest layer is fed with a grayscale image. The higher layers of the model are either called "S" or "C". These names correspond to simple the (S) and complex (C) cells discovered by Hubel and Wiesel [9]. Both type of cells are located in the striate cortex (called V1), which is the part of visual cortex that lies in the most posterior area of the occipital lobe. The simple cells located in "S" layers apply local filters that responses form a vector of texture features. As noted by Hubel and Wiesel the individual cell in the cortex respond to the presence of edges. They also discovered that these cells are sensitive to edge orientation (some of cells fire only when a given orientation of an edge is observed). The complex cells located in "C" layers calculate in a limited range of a previous layer the strongest responses of a given type (orientation). That way more complex combination of simple features are obtained combined from three simple features. The hierarchical HMAX model proposed by Mutch and Lowe [10] is a modification of the model presented by Serre et al. in [2]. It introduces two layers of simple cells (S 1 and S 2 ) and two layers of complex cells (see Fig. 1). It uses set of filters designed to emulate V1 simple cells. The Fig. 1: The structure of a hierarchical model proposed by Mutch and Lowe [10]. (used symbols: I - image, S - simple cells, C - complex cells, GF - Gabor Filters, F - prototype features vectors, - convolution operation) simple layers are computed using a hard max filter [11]. It means that the "C" cells responses are the maximum values of the associated "S" cells. As it is shown in the Fig. 1 the images are processed by the subsequent simple and complex cells layers and reduced to feature vectors, which are further used in the classification process. The set of features (F ) is shared across all images and object categories. Features are computed hierarchically in subsequent layers built from the previous one by alternating the template matching and the max pooling operations. The S 1 layer in the Mutch and Lowe [10] model adapts the 2D Gabor filters computed for four orientations (horizontal, vertical, and two diagonals) at each possible position and scale. The Gabor filters are 11 11 in size, and are described by: G(x, y) = e (X2 +γy 2 )/(2σ 2) cos( 2π λ ) (1) where X = x cos φ y sin φ and Y = x sin φ+y cos φ; x, y < 5; 5 >, and φ < 0; π >. The aspect ration (γ), effective width (σ), and wavelength (λ) are set to 0.3, 4.5 and 5.6 respectively. The response of a patch of pixels X to a particular filter G is computed using the formula (2). Xi G i R(X, G) = abs( ) (2) X 2 i The complex cells located in C 1 layer pool associated units in the S 1 layer. For each orientation, the S 1 re-
Image Processing & Communication, vol. 18, no. 1, pp. 31-36 33 sponses are convolved with a max filter, that is 10 10 of size in x, y dimension (position) and has 2 units of deep in scale. As it is shown in the Fig. 1, the intermediate S 2 layer is formed by convolving the C 1 layer response with a set of intermediate-level features (depicted as F in Fig. 1). The set of intermediate-level features is established during the learning phase. For a given set of learning images C 1 responses are computed. The most meaningful features are selected using SVM weighting. Mutch and Lowe [10] suggested to sub-sample the C 1 responses before feature selection. Therefore, authors select at random positions and scales of the C 1 layer patches of size 4 4, 8 8, 12 12, and 16 16. Selected and weighted features compose so called prototypes that are used in the Mutch and Lowe model as filters which responses create the S 2 layer. The C 2 layer composes a global feature vector which particular element corresponds to the maximum response to a given prototype patch. In order to identify the visual object on the basis of feature vector a classifier is learnt (e.g. SVM). 3 The proposed method In order to approximate the S 1 layer behaviour of the HMAX model the algorithms that mimics the HOG descriptors is used. However, to make this step as simple as possible (due to the low computation power of mobile devices) only gradient orientation binning schema is adapted. In fact, the intention here is no to compute HOG descriptor, but to approximate the response of four Gabor filters used in original HMAX model. In the S 1 layer there are N M 4 simple cells arranged in a grid of size N M blocks. In each block there are 4 cells. Each cell is assigned a receptive field (pixels inside the block). Each cell activates depend on a stimuli. In this case there are four possible stimulus, namely vertical, horizontal, left diagonal, and right diagonal edges. As a result the S 1 simple cells layer output has dimensionality of a size 4 (x, y, scale and 4 cells). In order to compute responses of all four cells inside a given block (receptive field), an Algorithm 1 is applied. Algorithm 1 Algorithm for calculating S 1 response Require: Grayscaled image I of W H size Ensure: S 1 layer of size N M 4 Assign G min the low-threshold for gradient magnitude. for each pixel (x, y) in image I do Compute horizontal G x and vertical G y gradients using Prewitt operator Compute gradient magnitude G x,y in point (x, y) n x N W ; m y M W if G < G min then go to next pixel else active get_cell_type(g x, G y ) S 1 [n, m, active] S 1 [n, m, active] + G x,y end if end for The algorithm computes the responses of all cells using only one iteration over the whole input image I. For each pixel at position (x, y) a vertical and horizontal gradients are computed (G x and G y ). Given the pixel position (x, y) and gradient vector [G x, G y ] the algorithm indicates the block position (n, m) and type of cell (active) that response has to be incremented by G. In order to classify given gradient vector [G x, G y ] as horizontal, diagonal or vertical the get_cell_type(, ) function uses simple schema for voting. If a point ( G x, G y ) is located between line y = 0.3 x and y = 3.3 x it is classified as a diagonal. If G y is positive then the vector is classified as a right diagonal (otherwise it is a left diagonal). In case the point ( G x, G y ) is located above line y = 3.3 x the gradient vector is classified as vertical and as horizontal when it lies below line y = 0.3 x. The processing in S 1 layer is applied for three scales (an original image, and two images scaled by a factor of 0.7 and 1.5). This is a necessary operation to achieve the response of C 1 layer. The C 1 complex layer response is
34 R. Kozik, A. Marchewka computed using max-pooling filter, that is applied to S 1 layer. This layer allows the algorithm to achieve scale invariance. The output of the C 1 layer cells is used in order to select features that are meaningful for a given task of an object detection. Firstly, randomly selected images of an object are reduced to a features vectors by a consecutive layers of the proposed model. Afterwards, the PCA is applied to calculate the eigenvectors. Each eigenvector is used as a single feature in a features vector. The response of S 2 layer is obtained using eigenvectors to compute the dot product with the C 1 layer. Finally, an classifier is learned to recognise an object using S 2 response layer. For this paper different machinelearnt algorithms are investigated. 4 Experiments and Results was possible to achieve 95,6% of detection rate while having 2.3% of false positives. It can be noticed that adding additional trees does not improve the effectiveness. The early prototype of proposed algorithm was deployed on an Android device. The code was written in pure Java and tested on Samsung Galaxy Ace device. This device is equipped with 800 MHz CPU, 278 MB of RAM and Android 2.3 operating system. Current version implements brute force Nearest Neighbour classifier, operates only for one scale (PC scans three scale - an original image, and two images scaled by a factor of 0.7 and 1.5), and achieves about 10 FPS when less than 10 training samples are provided. Some examples of object detection are shown in Fig. 3. During the testing it was noticed that the algorithm is able to recognise an object on a cluttered background even if only few learning samples are provided. In contrast to the previous work [4], the data base of images has been extended and new experiments have been conducted, which aimed at evaluating the effectiveness of different classifiers. In this work tree-based and rule-based classification approaches have been compared, namely N (5,10,15) Random Trees, J48 and PART (previously used by author in [5, 6, 4]) algorithms. The comparison has been presented in Fig.2. For the experiments the dataset was randomly split for training and testing samples (in proportion 70% to 30% respectively). For training samples the C 1 layers responses were calculated and eigenvectors were extracted. The eigenvectors were used to calculate the S 2 layers responses that were further used to learn classifier. During the testing phase the eigenvectors were again used in order to extract S 2 layers responses for testing samples. The whole procedure was repeated 10 times ( the dataset was randomly reshuffled before split) and the results were averaged. The best results were obtained for 10 Random Trees. It Fig. 3: Example of object detection (mug and doors) with proposed algorithm deployed on an Android device 5 Conclusions In this paper an algorithm for object recognition dedicated form mobile devices was presented. The algorithm is based on a hierarchical approach for visual information coding proposed by Riesenhuber and Poggio [1] and later
False Positive Rate Image Processing & Communication, vol. 18, no. 1, pp. 31-36 35 0,04 0,035 0,03 0,025 0,02 0,015 0,01 0,005 10 random trees 5 random trees 15 random trees J48 PART Expon. (10 random trees) Expon. (5 random trees) Expon. (15 random trees) Expon. (J48) Expon. (PART) 0 0,78 0,8 0,82 0,84 0,86 0,88 0,9 0,92 0,94 0,96 0,98 Detection Rate Fig. 2: Different classifiers effectiveness extended by Serre et al. [2]. The experiments show that the introduced algorithm allows for efficient feature extraction and a visual information coding. Moreover, it was shown that it is also possible to deploy proposed approach on a low-cost mobile device. The results are promising and show that described in this paper algorithm can be further investigated as one of assistive solutions dedicated for visually impaired people. References [1] M. Riesenhuber, T. Poggio, Hierarchical models of object recognition in cortex, Nat Neurosci, Vol. 2, pp.1019 1025, 1999 [2] T. Serre, G. Kreiman, M. Kouh, C. Cadieu, U. Knoblich T. Poggio, A quantitative theory of immediate visual recognition, In:Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, Vol. 165, pp. 33 56, 2007 [3] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, Computer Vision and Pattern Recognition, CVPR 2005, IEEE Computer Society Conference on, Vol.1, pp. 886 893, June 2005 [4] R. Kozik, A Simplified Visual Cortex Model for Efficient Image Codding and Object Recognition, Image Processing and Communications Challenges 5, Advances in Intelligent Systems and Computing, pp. 271 278, 2013 [5] R. Kozik, Rapid Threat Detection for Stereovision Mobility Aid System, In: T. Czachorski et al. (Eds.): Man-Machine Interactions 2, AISC 103, pp. 115 123, 2011 [6] R. Kozik, Stereovision system for visually impaired. Burduk, Robert (ed.) et al., Computer recognition systems 4, Advances in Intelligent and Soft Computing 95, pp. 459 468, 2011 [7] R. Kozik, A Proposal of Biologically Inspired Hi-
36 R. Kozik, A. Marchewka erarchical Approach to Object Recognition, Journal of Medical Informatics & Technologies, Vol. 22/2013, ISSN 1642-6037, pp. 171 176, 2013 [8] R. Kozik, A Simplified Visual Cortex Model for Efficient Image Codding and Object Recognition, Image Processing and Communications Challenges 5, Advances in Intelligent Systems and Computing S. Choras, Ryszard, pp. 271-278, 2013 [9] D.H. Hubel, T.N. Wiesel Receptive fields, binocular interaction and functional architecture in the cat s visual cortex, J Physiol, Vol. 160, pp. 106 154, 1962 [10] J. Mutch, D.G. Lowe, Object class recognition and localization using sparse features with limited receptive fields,international Journal of Computer Vision (IJCV), Vol. 80, No. 1, pp. 45 57, October 2008 [11] MAX pooling, http://ufldl.stanford.edu/wiki/index. php/pooling