QR CODE LOCALIZATION USING DEEP NEURAL NETWORKS

2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 21 24, 2014, REIMS, FRANCE QR CODE LOCALIZATION USING DEEP NEURAL NETWORKS Tamás Grósz Péter Bodnár László Tóth László G. Nyúl MTA-SZTE Research Group on Artificial Intelligence Hungarian Academy of Sciences and University of Szeged Department of Image Processing and Computer Graphics, University of Szeged {groszt, bodnaar, tothl, nyul}@inf.u-szeged.hu ABSTRACT Usage of computer-readable visual codes became common in our everyday life at industrial environments and private use. The reading process of visual codes consists of two steps, localization and data decoding. This paper introduces a new method for QR code localization using conventional and deep rectifier neural networks. The structure of the neural networks, regularization, and training parameters, like input vector properties, amount of overlapping at samples, and effect of different block sizes are evaluated and discussed. Results are compared to localization algorithms of the literature. Index Terms QR code, Object detection, Pattern recognition, Neural networks, Machine learning 1. INTRODUCTION QR code is a common type of visual code format that is used at various industrial setups and private projects as well. Its structure is well-defined and makes automatic reading available by computers and embedded systems. QR codes had a decent increase of usage in the last couple of years, more than other patented code types, like Aztec codes or Maxicodes. This is due to its well-constructed error correction scheme that allows recovery of damaged codes up to cca. % of damage. Image acquisition techniques and computer hardware have also improved significantly, that made automatic reading of QR codes available. State of the art algorithms do not require human presence and assumptions on code orientation, position and coverage rate in the image any longer. However, image quality and acquisition techniques vary considerably and each application has its own requirements for detection speed and accuracy, making the task more complex. The recognition process consists of two steps, localization and decoding. The literature already has a wide selection of papers proposing algorithms for efficient QR code localization [1 5], however, each has its own strengths and weaknesses. For example, while these methods are proven to be accurate, morphological operations, convolutions, corner detection and calculation of convexity defects can be a bottleneck for processing performance. Applications using various machine learning techniques can overcome this issue and produce efficient solutions respect to speed and accuracy. In the last few years, there has been a renewed interest in applying neural networks especially deep neural networks to various tasks. As the name suggests, deep neural networks (DNN) differ from conventional ones in that they consist of several hidden layers. However, to properly train these deep networks, the training method requires modifications as the conventional backpropagation algorithm encounters difficulties ( vanishing gradient and explaining away effects). In this case the vanishing gradient effect means that the error might vanish as it gets propagated back through the hidden layers [6]. In this way some hidden layers, in particular those that are close to the input layer, may fail to learn during training. At the same time, in fully connected deep networks, the explaining away effects make inference extremely difficult in practice [7]. Several solutions have been proposed to counter these problems. These solutions modify either the training algorithm by extending it with a pretraining phase [7, 8], or the architecture of the neural networks [9]. 2. THE PROPOSED METHOD The first step of the localization process is uniform partitioning of the image into square blocks. Each block is processed by the neural network individually and a measure is assigned which reflects the probability of presence of a QR code part in that block. After the evaluation of all image blocks, a matrix is formed with the probability values (Fig. 1). The next step of the process is to find clusters in this matrix that has sufficient size, compactness and high values of probability to form a QR code. The final step is to return the cluster centers that satisfy these conditions, and give bounding boxes for the probable QR code areas. For each block the image is divided into, a one-dimensional vector is formed by reading pixels in a circular pattern as sug- 978-1-4799-3694-6/14/$31.00 c 2014 IEEE

(a) example real-life image (b) feature image Fig. 1. Image captured by a phone camera and visualized feature image according to the output of the neural network. gested in [10]. This pattern provides prominent features while keeping the number of necessary pixels low, typically on 6-10 % of the image pixels. Furthermore, it can indicate presence of a QR code in any orientation. 2.1. The Neural Network Vectors are passed to the core of this approach, which is the neural network. In this paper, both conventional and Deep Rectifier Neural Networks (DRN) have been evaluated. DRNs alter the hidden neurons in the network and not the training algorithm by using rectified linear units. These rectified units differ from standard neurons only in their activation function, as they apply the rectifier function (max(0, x)) instead of the sigmoid or hyperbolic tangent activation. Owing to their properties, the DRN does not require any pre-training to achieve good results [9]. The rectifier function has two important properties, namely its hard saturation at 0 and its linear behaviour for positive input. The first property means that only a subset of neurons will be active in each hidden layer. For example, when we initialize the weights uniformly, around half of the hidden units output are zeros. In theory, this hard saturation at 0 could harm optimization by blocking gradient backpropagation. Fortunately, experimental results do not support this, showing that the hard non-linearities do no harm as long as the gradient can propagate along some path [9]. Owing to the other property of the rectified units, namely the linear behaviour of the active units, there is no vanishing gradient effect [9]. This linear behaviour also means that the computational cost will be smaller, as there is no need to compute the exponential function during the activation calculation, and the sparsity can also be exploited. Unfortunately, there is a disadvantage because of this linearity, the exploding gradient effect, which means that the gradients can grow without limit. To prevent this, we applied L1 normalization by scaling the weights such that the L1 norm of each layer s weights remained the same as it was after initialization. What makes this possible is that for a given input the subset of active neurons behaves linearly, so a scaling of the weights is equivalent to a scaling of the activations. Our deep networks consisted of three hidden layers and each hidden layer had 1000 rectified neurons, as DRN with this structure yielded the best results on the development sets. The shallow neural net was a sigmoid net with only one hidden layer, with the same number of hidden neurons (00) as that for the deep one. The output layer of the neural networks consisted of two softmax neurons, one for the positive and one for the negative label, allowing the networks to output not only classification decisions but also posterior probabilities. As error function we applied the cross entropy function. In our study were utilized two regularization methods to prevent overfitting, namely early stopping and weight decay. Early stopping was achieved by stopping the training when there was no improvement in two subsequent iterations on the validation set. As weight decay regularization the weights were scaled back after each iteration, forcing them to converge to smaller absolute values than they otherwise would. The neural networks were trained using semi-batch backpropagation, the batch size being 100. The initial learn rate was set to 0.001 and held fixed while the error on the development set kept decreasing. Afterwards, if the error rate did not decrease in the given iteration, then the learn rate was subsequently halved. We used our custom implementation for neural networks, which was implemented to run on GPU [11]. The training of a DRN on the synthetic dataset took less than 8 minutes using an NVIDIA GTX-770 graphics card. 2.2. Input Data and Block Size For the input data, various options were available and a separate training was performed with each type. The first choice was to train a neural network on raw pixel data, read in the discussed pattern for each block. Binary version of the vectors was also evaluated to compare efficiency of the neural networks on grayscale and binary images. Furthermore, since QR code has a well-defined, strict structure, it means blocks of QR code parts probably have very specific components in the frequency domain. This assumption motivated us to perform experiments with vectors in that domain. We transformed the training vectors to the DCT and DFT domains, and evaluated neural networks on both sets. Furthermore, a final neural network has been trained to the edge map, since the structure of QR codes also suggest very specific edge layouts that probably can be fed to a neural network and result in efficient training. Training vectors in that case consisted of pixels of the unified magnitude map of Sobel-X and Sobel-Y gradients (Fig. 2), read in circular pattern. For making a decision about block size, two conditions have to be fulfilled. Block size should be small enough to having many of them to build a QR code, so threshold would be chosen easily for dropping or keeping clusters in the probability matrix. On the other hand, block size has to be large enough to provide prominent characteristics in the input vec-

(a) example synthetic image (b) edge magnitude map (a) σ = 0.07, α = 0.39 (b) σ = 0.65, α = 0.77 (c) σ = 1.32, α = 0.05 (d) σ = 1.88, α = 0.95 (e) σ = 2.86, α = 0.65 (f) σ = 2.99, α = 0.69 Fig. 2. Example synthetic image and Sobel magnitude image. Edge maps seem to be reliable input for the neural network training, and it is reassured in the results section tors of the neural network. Papers on visual code localization suggest empirically [12], or using geometrical approach [10], that optimal tile size is about 1/3 of the smaller dimension of the examined visual code in case of features based on image partitioning. This choice for block size was our initial value, however, we evaluated neural networks with input vectors of different block sizes. After making a decision about a suitable block size, the amount of overlap also has to be decided. Different offsets for blocks having the optimal size are also evaluated empirically, because papers on this topic do not categorically declare if overlapping increases localization performance. 3. EVALUATION AND RESULTS The test database consists of 10 000 synthetic and 100 arbitrarily acquired images containing QR code. The synthetic examples are built with a computer-generated QR code containing all of the lower- and uppercase letters of the alphabet in random order. This QR code was placed on a random negative image, with perspective transformation. After that, Gaussian smoothing and noise have been gradually added to the images, ranging [0,3] for the σ of the Gaussian kernel. For noise, a noise image (In ) was generated with intensities ranging from [-127, 127] following normal distribution, and added gradually to the original 8-bit image (Io ) as I = αin + (1 α)io, with α ranging [0, 0.5]. Some samples with parameters being in the discussed ranges are present in (Fig. 3). A total number of 1.8 million vectors were extracted from those images, about 0 000 of them are labeled as positive. Real images were taken with a 3.2 Mpx Huawei handheld phone camera. Significant smoothing is present on those images due to the absence of flash and precise focusing capability. These images contain a printed QR code that mostly suffer from minor bending, placed on various textured backgrounds, like creased cloth, carpet, marble or cover page of a book (Fig. 1(a) and 4(d)), in order to create high background variability and a make the classification task more complex. Fig. 3. Samples of the training database with different amount of smoothing and noise. The arbitrarily acquired image set roughly had QR codes of the same size as the artificial set, and its extracted vector set consisted about 3 000 vectors, about 80 000 of them being positive. Images were converted to grayscale before the vector extraction. 3.1. Input Types The first group of neural network training was performed using different input data type and range as well, in order to determine the best available option for input vectors. This first test set did not contain full images, only the vectors of blocks that has 100 % coverage of QR code part, and an equal number of random negative blocks, about 0 000 vectors in total. Both normal and deep rectifier neural networks have been trained for the data, as shown in Table 1. For this first test, only two portions of the database were separated, for

Input data Input range Precision Hit rate F-measure ANN Raw pixels 8-bit 0.9892 0.9953 0.9922 Raw pixels binary 0.9705 0.9846 0.9774 DCT 8-bit 0.9889 0.9951 0.9920 DCT binary 0.9693 0.98 0.9776 DFT 8-bit 0.9904 0.9981 0.9943 DFT binary 0.9711 0.9808 0.97 Sobel-XY 8-bit 0.9979 0.9990 0.9984 DRN Raw pixels 8-bit 0.9947 0.9972 0.9959 Raw pixels binary 0.9704 0.9862 0.9782 DCT 8-bit 0.9941 0.9967 0.9954 DCT binary 0.9686 0.9873 0.9778 DFT 8-bit 0.9933 0.9958 0.9945 DFT binary 0.9621 0.98 0.9734 Sobel-XY 8-bit 0.9978 0.9991 0.9984 Table 1. Training results for different input data types and ranges. NN type Precision Hit rate F-measure ANN 0.58 0.9895 0.7077 DRN 0.4395 0.9979 0.6103 Table 2. Evaluation of the neural networks on arbitrarily acquired images. Input vectors were 8-bit raw pixels. Inferior DRN precision indicates that these neural nets need more training samples compared to ANNs. training and testing purposes in 8:2 proportion. For later tests, the smaller set has been further divided into test and validation parts. Results show that deep neural networks outperform regular ones in general, and 8-bit input data facilitates more accurate training. Application of DCT and DFT to the vectors do not influence the training significantly, and even though those transformations are easy to compute on GPUs, they do not seem to be worth the computational cost. However, vectors of Sobel magnitude images slightly improve training efficiency, which could be expected due to the QR code edge structure (Fig. 2). Neural networks of this first evaluation used px block size, which is roughly 1/3 of the expected QR code size. Block offset for overlapping blocks was set to 10 px, since element size of the generated QR code was also that large. Each vector consisted of 128 points sampled from the circular pattern within the block. After training these neural networks on the synthetic database, neural networks that were trained on the raw, 8-bit pixel data, have been evaluated on the arbitrarily acquired image set. Precision dropped to about 0.4, but hit rate is still acceptable. Both shallow and deep rectifier neural networks have been evaluated, as shown in Table 2. The precision drop is probably due to the various textures, like tablecloth, wood and marble, present in the real images. NN type Data type Precision Hit rate F-measure T + = 0.1 ANN Real 0.6454 0.9518 0.7692 DRN Real 0.6699 0.9419 0.7829 ANN Synthetic 0.5654 0.9901 0.7198 DRN Synthetic 0.56 0.9865 0.7169 T + = 0.5 ANN Real 0.9175 0.5994 0.7251 DRN Real 0.8962 0.8414 0.8679 ANN Synthetic 0.8703 0.8947 0.8823 DRN Synthetic 0.9059 0.9347 0.9201 Table 3. Evaluation of the neural networks also having vectors of partially covered blocks as input, with different thresholds for input labels Subset Opt. Thresh. max(f 1) AUC All vectors 0.62 0.9343 0.9957 N + N 0.63 0.9270 0.9949 N + b N b 0.82 0.8312 0.98 Table 4. DRN results on different subsets of input vectors of synthetic images. N + and N denote for positive and negative samples, while N + b and N b are subsets of N + and N with partially covered blocks excluded 3.2. Partially Covered Blocks In the first case, training was only performed using vectors of background and blocks fully covered with QR code parts. As the next step of this experiment, the input data has been extended for partially covered blocks, thresholded at 0.1 and 0.5 coverage ratios for positive labeling. No filtering has been applied to the amount of negatively labeled vectors. Both ANNs and DRNs were trained on vectors of both synthetic and real images, separately. The database has been divided to 8:1:1 ratio for training, testing and validation parts, respectively. Results (Table 3) show that allowing the partially covered samples for the training significantly drops the precision of the NNs. Furthermore, threshold of 0.1 for the positive labels keeps the hit rate in a satisfactory level, while 0.5 decreases it largely (Fig. 4). Still, the low hit rate is not a problem, since enough positively predicted blocks remain in the probability matrix for being able to indicate a QR code candidate even in those cases (Fig. 4(e)). It can be concluded that NNs trained on full images and on a reduced subset of vectors, where positively and negatively labeled samples are roughly in the same amount, do not differ significantly, while excluding vectors that come from partially covered blocks drops the performance noticeably. Table 4 summarizes results measured on the discussed training sets.

94 93.8 93.6 93.4 93.2 93 max(f) 70 80 90 (a) max(f ) (a) Synthetic image (b) ANN feature image (c) DRN feature image 99.7 99.65 99.6 99.55 99.5 99.45 99.4 (d) Real image AUC (e) ANN feature image (f) DRN feature image Fig. 4. Original and feature images built from the output of the neural networks. Even though ANNs have weaker classification power and miss several blocks, they can indicate the presence of QR code candidates strongly enough. 3.3. Parameters and Run Times We used our custom implementation for neural networks, which was implemented to run on GPU. The training of a DRN on the synthetic dataset took less than 8 minutes using an NVIDIA GTX-770 grapics card. The computation power of this setup allowed to process about 4 000 vectors per second, which means real-time processing of 800 0 px images with cca. 5 FPS using px for block size and 10 px for offset. The QR codes the evaluation has been performed on, had 10 px element size. Block sizes of the input vectors from px to 90 px have been evaluated. Block sizes smaller than px would lead the training to learn solid black and white blocks as positive, thus heavily raising false positive rate. Larger than 90 px block sizes would also decrease performance, since large block size drastically decreases the number of positive samples available for the training while introducing a vast amount of partially covered blocks that are harder to train onto. The numbers of Fig. 5 also show that the proposed method is robust for expected QR code size, since block size range for good performance is large enough according to code dimensions. Fig. 6 shows results for DRNs using different rate of overlapping, up to px (no overlap). It is notable that F-score only slightly differs considering overlap from 10 to px, while the amount of input vectors is drastically smaller. After all these experiments, a DRN has been trained with the best experimental training setup, in order to compare it with other localization methods of the literature. According to Table 5, NN training is proven to be a viable option for real-time efficient QR code localization. As the first approach to compare, an algorithm based on mathematical morphology has been selected [2], since it is also a general purpose toolset 70 80 90 (b) AUC Fig. 5. DRN results respect to input block size. 92 90 88 86 84 82 max(f) 0 10 20 (a) max(f ) 99.5 99 98.5 98 97.5 97 AUC 0 10 20 (b) AUC Fig. 6. DRN results respect to input block offset, using px block size. of image processing, as NNs are for machine learning tasks. This reference method is reliable, but due to the morphological operations, it requires the most computational capacity and gives the slowest processing speed, from 900 to 10 ms per image. Another work in QR code localization is using a specific, image-based classifier training. It proposes training a cascase of classifiers using Haar-like features [3]. 4. CONCLUDING REMARKS Usage of neural networks is a robust, reliable approach for fast QR code localization. We have examined performance of shallow and deep rectifier neural networks for the task as well, on various input types. We have proven efficiency of NNs on a large amount of input images, both synthetic and real. It is observed that DRNs perform better than ANNs in general. The best choice for input data is the edge magnitude map. Block size does not interfere with DRN accuracy signif-

Algorithm Precision Recall F-measure Proposed 0.93 0.9327 0.9343 REF-HAAR [3] 0.1535 0.9436 0.26 REF-MORPH [2] 0.6989 0.99 0.8042 Table 5. Performance comparison of the best NN to other algorithms icantly, and has a wide working range between 10 to percent of the expected QR code dimensions. Even with dense overlapping at blocks, real-time image processing is possible. Our future work includes experimenting with other patterns and features for the neural network training. Acknowledgement This publication is supported by the European Union and co-funded by the European Social Fund. Project title: Telemedicine-oriented research activities in the fields of mathematics, informatics and medical sciences. Project number: TÁMOP-4.2.2.A-11/1/KONV-2012-0073. 5. REFERENCES [1] Chung-Hua Chu, De-Nian Yang, Ya-Lan Pan, and Ming-Syan Chen, Stabilization and extraction of 2D barcodes for camera phones, Multimedia Systems, pp. 113 133. [7] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, no. 7, pp. 1527 1554, 2006. [8] Frank Seide, Gang Li, Xie Chen, and Dong Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription, in Proc. ASRU, 2011, pp. 24 29. [9] Xavier Glorot, Antoine Bordes, and Yoshua Bengio, Deep sparse rectifier networks, in Proc. AISTATS, 2011, pp. 315 323. [10] Péter Bodnár and László G. Nyúl, A novel method for barcode localization in image domain, in Image Analysis and Recognition. 2013, vol. 79 of Lecture Notes in Computer Science, pp. 189 196, Springer Berlin Heidelberg. [11] László Tóth and Tamás Grósz, A comparison of deep neural network training methods for large vocabulary speech recognition, in Proceedings of TSD, 2013, pp. 36 43. [12] Péter Bodnár and László G. Nyúl, Improving barcode detection with combination of simple detectors, in The 8th International Conference on Signal Image Technology (SITIS 2012), 2012, pp. 0 6. [2] Eisaku Ohbuchi, Hiroshi Hanaizumi, and Lim Ah Hock, Barcode readers using the camera device in mobile phones, in Cyberworlds, 2004 International Conference on, 2004, pp. 2 265. [3] Luiz F. F. Belussi and Nina S. T. Hirata, Fast QR code detection in arbitrarily acquired images, in Graphics, Patterns and Images (Sibgrapi), 2011 24th SIBGRAPI Conference on, 2011, pp. 281 288. [4] Gábor Sörös and Christian Flörkemeier, Blur-resistant joint 1D and 2D barcode localization for smartphones, in Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, New York, NY, USA, 2013, MUM 13, pp. 11:1 11:8, ACM. [5] István Szentandrási, Adam Herout, and Markéta Dubská, Fast detection and recognition of QR codes in high-resolution images, in Proceedings of the 28th Spring Conference on Computer Graphics, New York, NY, USA, 2013, SCCG 12, pp. 129 136, ACM. [6] Xavier Glorot and Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks, in Proc. AISTATS, 2010, pp. 249 256.