Human-Readable Fiducial Marker Classification using Convolutional Neural Networks

606 Human-Readable Fiducial Marker Classification using Convolutional Neural Networks Yanfeng Liu, Eric T. Psota, and Lance C. Pérez Department of Electrical and Computer Engineering University of Nebraska-Lincoln Lincoln, United States yanfengliux@gmail.com, epsota@unl.edu, lperez@unl.edu Abstract Many applications require both the location and identity of objects in images and video. Most existing solutions, like QR codes, AprilTags, and ARTags use complex machine-readable fiducial markers with heuristically derived methods for detection and classification. However, in applications where humans are integral to the system and need to be capable of locating objects in the environment, fiducial markers must be human readable. An obvious and convenient choice for human readable fiducial markers are alphanumeric characters (Arabic numbers and English letters). Here, a method for classifying characters using a convolutional neural network (CNN) is presented. The network is trained with a large set of computer generated images of characters where each is subjected to a carefully designed set of augmentations designed to simulate the conditions inherent in video capture. These augmentations include rotation, scaling, shearing, and blur. Results demonstrate that training on large numbers of synthetic images produces a system that works on real images captured by a video camera. The result also reveal that certain characters are generally more reliable and easier to recognize than others, thus the results can be used to intelligently design a human-readable fiducial markers system that avoids confusing characters. Keywords computer vision, convolutional neural network, machine learning, fiducial marker. I. INTRODUCTION Fiducial markers play an important role in systems that need to track the location and identity of multiple objects in a scene. Several methods have been presented over the past decade to solve the fiducial marker problem such as QR codes [14], AprilTags [11], ARTags [2], and circular dot patterns [10]. These markers usually feature black and white patterns to contain binary information, as shown in Figure 1. Regarding tag design, researchers generally focus on properties like minimum tag size, minimum distance from tag to camera, maximum viewing angle, and optimal shapes for detection [3, 7]. Specially designed fiducial markers have the advantage of low false positive rate, low false negative rate, and low inter-marker confusion rate, but these markers are not human-readable and require highly specialized detector/decoder algorithms [13]. In Figure 1. Examples of machine-readable fiducial markers Figure 2. Synthetized training image examples. The distortions include rotation, shearing, translating, scaling, contrast adjustment, motion blur, and Gaussian noise. From top to bottom: 6, A, C, Q. addition, they often require a relatively large portion of the overall resolution in the image. While these solutions work well under a variety of circumstances, they are not suitable to applications that require humans to identify markers or situations where a relatively low-resolution crop is assigned to each marker. Alternatively, the set of alphanumeric characters are ubiquitous as human-readable fiducial markers. For example, they are already being used to identify athletes, livestock, and automobiles. However, the existing analysis on machinereadable markers are design-dependent and their conclusions cannot be applied to other markers. There are no guidelines in terms of how to choose an optimal set of human-readable fiducial markers that are robust to variations in lighting, orientation, and size. One method of classification that is particularly well-suited to handling these variations is convolutional neural networks (CNNs). CNNs have achieved significant breakthroughs in recent years [4, 5, 8]. Compared to traditional classification methods, CNNs do not rely on heuristically designed algorithms for the target objects. With enough training data and sufficient complexity, CNNs can be taught to extract features on many levels and have even been demonstrated to exceed the ability of humans to recognize objects in images [5]. 978-1-5090-4767-3/17/$31.00 2017 IEEE

607 In this paper, a CNN is designed and trained to recognize human-readable alphanumeric characters as fiducial markers. The training uses a large set of synthetically generated images of distorted characters. The results demonstrate that, while training was performed on synthetically generated data, the CNN can recognize a highly challenging set of characters cropped from real images with more than 50% accuracy. Also, the results reveal inherent confusion between characters. Thus, for applications where only a subset of the characters is needed for identification, a set of easily differentiable characters can be chosen in order to maximize classification accuracy. An analysis and categorization of main causes for confusion is provided and demonstrates that certain characters are intrinsically difficult to differentiate regardless of the classification algorithm being used. II. RELATED WORK Previous research has explored methods for character recognition using convolutional neural networks. In [6], the authors proposed to treat each English word as an individual pattern, and trained a CNN on a data set of 90k words. Each word is synthetically generated by the computer, adding variations in view angle, orientation, blur, font, and noise. The locations of the words are hypothesized using a region proposal method inspired by [4]. Liu and Huang trained a CNN to recognize Chinese car plate characters in realistic scenes [9]. Chinese characters are morphologically different from alphanumeric characters, so the authors trained a separate softmax layer while sharing all the hidden layers with the softmax layer for alphanumeric characters. The authors also create their own database of Chinese characters due to the shortage of such data. Each image was captured in real scenes on the street and then hand labeled for training and testing. Radzi and Khalil-Hani implemented the CNN method to recognize Malaysian car plate characters and made speed and accuracy improvements in several stages of the technique [12]. The training images of characters are extracted from license plates, viewed from various angles. They are then binarized, resized, centered, and labeled. This paper differs from previous work on character/word recognition in two important ways. First, to our best knowledge, no research paper has studied the reliability of each individual alphanumeric fiducial marker as compared to other markers. We thoroughly compared all markers without leaving any characters out, whereas the study presented in [9] intentionally left out I and O and [12] left out I, O, and Z. Second, while [6] achieves impressive accuracy with text detection and recognition, their network is trained by treating each word as a whole. We propose to look at each character separately and measure their features and reliability under full scale variation and distortion. Moreover, the data augmentation methods used by [6], [9], and [12] are limited in terms of rotation and translation, making these methods poorly suited to generalized fiducial marker detection and tracking. There were no upside-down characters in them, and [9] and [12] centered the characters before training and testing. III. DATA AUGMENTATION Properly training a deep convolutional neural network requires a tremendous amount of highly variable data to prevent over-fitting. Collecting the data manually can be tedious and impractical. Therefore, data augmentation is often used to procedurally augment the training data. For each number (0 to 9) and English letter (A to Z) considered, we generate 5000 pictures with 400x400 resolution, and apply a total of seven types of randomized distortion to each image: rotation, shearing, translating, scaling, contrast, motion blur, and Gaussian noise. The first four categories of distortions are combined as a single affine transformation given by T = s $ 0 0 0 s & 0 0 0 1 1 0 0 0 1 0 t $ t & 1 1 0 0 1 α & 0 α $ 1 0 0 1 0 0 0 1 0 0 1 cos θ sin θ 0 sin θ cos θ 0, 0 0 1 where s x and s y are used to adjust scale in the horizontal and vertical directions, t x and t y are used to shift the center point, α x and α y allow for shearing, and θ rotates the image. The contrast adjustment modifies the lowest intensity value and the highest intensity value, effectively linearly mapping the intensity values in the original image to the new range. Motion blur is simulated by convolving the image with an oriented, uniform, line filter. The kernel is generated by assigning a random movement distance and movement angle. Finally, Gaussian noise is sampled from an additive, independent noise source that follows the distribution f x = 1 ($8:) < σ 2π e8 => <. To avoid the effects of aliasing, the augmentations are applied to the larger 400x400 image. After the distortions, the pictures are resized to 32x32 and fed into the convolutional neural network. Figure 2 illustrates some examples of the augmented pictures. Table 1. Data augmentation parameters range Parameter Range Rotation angle θ 0 ~ 360 Horizontal shearing α $ 0 ~ 0.5 Vertical shearing α & 0 ~ 0.5 Horizontal translation t $ 80 ~80 Vertical translation t & 80 ~ 80 Horizontal scaling s $ 0.3 ~ 1 Vertical scaling s & 0.3 ~ 1 Contrast lower bound 0 ~ 0.45 Contrast upper bound 0.55 ~ 1 Motion blur distance 3 ~ 7 Motion blur angle 0 ~ 360 Gaussian noise mean 0 Gaussian noise variance 0 ~ 0.05

608 Figure 4. Manually cropped images from real testing videos. Top to bottom: 6, A, C, G. Figure 3. The architecture of the convolutional neural network trained to detect alphanumeric markers. IV. CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE The convolutional neural network that was trained to recognize digits and letters has 15 layers, including 1 input layer, 3 groups of convolution rectifier max-pooling layers, 1 fully connected layer, 1 rectifier layer, another fully connected layer, 1 softmax layer, and 1 classification layer. This architecture was empirically found to provide a suitable balance between accuracy and overfitting. Figure 3 illustrates the convolutional neural network architecture. V. TRAINING PARAMETERS The convolutional neural network is trained using stochastic gradient descent with moment. The initial learning rate was set to 0.01 and dropped by a factor of 10 for every 20 epochs. The maximum epoch is set at 100. The data set was randomly divided into training set and testing set, so that on average the training set has roughly an equal amount of training pictures for each category. To train a neural network in a reasonable amount of time, we used paralleling computing with an NVIDIA Titan Black GPU. VI. RESULTS The training set and testing set are randomly divided at a ratio of 1:1 from a total of 36 5000 = 180000 images. The range of parameters used to generate these images are provided in Table 1. The training set and testing set used during the training stage are very similar to each other because they are both generated by the computer. Thus, an additional testing set was built and tested to analyze classification performance in real life situations. A series of videos of alphanumeric characters were captured in natural conditions and 150 character images were manually cropped out of each character video for testing purposes. Thus, a total of 5400 images crops were used for the analysis. They were then set to black and white and down-sampled to 32x32 to fit the neural network input. Characters that were partially occluded were not selected, since this was not considered by the original augmentations. Figure 4 shows examples of cropped images from real videos. To find the data augmentation settings that gives the best accuracy during both the training stage and during real video testing, two of the distortion options, Gaussian noise and contrast adjustment, were switched on and off. The affine transformation distortion is kept in all training settings because it simulates realistic image scale and angle variation. The setting that gives the highest accuracy (92.39%) on the computergenerated testing set is no Gaussian noise, no contrast adjustment. In the video testing set, there was not as much Gaussian noise as simulated in the training set and there are also effects like background reflection and bright glare, which are not considered in the training set. Despite these unaccounted-for image distortions, accuracy on the highly challenging manually cropped images is 59.18%. Figure 5 illustrates the major causes of confusion for the testing set by providing examples of each. The first type of confusion comes from scaling. Some alphanumeric markers are Figure 5. Common confusions during testing stage, with confusion type labeled.

609 Table 2. Success rate and top confusion rate by alphanumeric order. The most accurate marker (X) and the least accurate marker (9) are highlighted. Marker Success rate (%) Most confused with #1 (%) #2 (%) #3 (%) 0 90.92 O 8.28 D 0.28 8 0.20 1 97.32 Z 0.60 J 0.44 2 0.36 2 96.72 Z 1.80 1 0.56 3 0.40 3 94.16 E 3.92 J 0.48 C 0.28 4 96.12 P 0.96 A 0.64 F 0.44 5 87.00 S 11.28 8 0.68 6 0.36 6 51.28 9 45.28 G 2.24 5 0.36 7 84.32 L 12.36 V 1.48 A 0.44 8 93.00 B 5.68 I 0.48 5 0.20 9 51.24 6 46.04 G 1.64 5 0.32 A 96.20 V 1.72 Y 0.52 F 0.40 B 93.28 8 5.08 D 0.36 R 0.32 C 98.92 U 0.56 3 0.12 E 0.08 D 94.56 Q 3.20 G 0.60 0 0.56 E 95.68 3 3.20 F 0.44 V 0.16 F 96.52 4 0.60 J 0.60 4 0.60 G 93.24 6 2.28 9 1.44 D 1.40 H 97.84 M 0.80 N 0.68 1 0.32 I 99.52 8 0.20 Z 0.08 1 0.04 J 98.12 I 0.60 F 0.48 3 0.20 K 99.40 J 0.16 Y 0.12 1 0.08 L 84.44 7 14.76 E 0.16 J 0.16 M 95.00 N 1.44 W 1.32 H 0.92 N 96.20 H 1.12 Z 1.12 H 1.12 O 87.00 0 11.52 D 0.76 G 0.52 P 97.48 4 1.64 Q 0.36 V 0.12 Q 94.40 D 3.00 O 1.52 G 0.28 R 97.84 H 0.44 W 0.24 P 0.20 S 87.60 5 11.88 8 0.16 J 0.12 T 97.60 Y 1.40 L 0.56 1 0.12 U 98.20 C 0.80 9 0.24 D 0.16 V 95.00 7 1.72 A 1.12 W 0.72 W 95.48 M 2.28 4 0.64 V 0.40 X 99.72 Y 0.12 H 0.08 1 0.04 Y 98.60 T 0.88 A 0.24 1 0.12 Z 96.24 2 1.52 N 1.20 1 0.56 scale sensitive. For example, 0, Q and O all have an elliptical shape, and the main difference is simply that 0 is smaller and skinnier than Q and O. Scaling together with shearing can cause them to look very similar. The second type of confusion comes from rotation. For example, 9 and 6 are identical when rotated by 180 degrees. Since we set the random rotation to be from 0 to 360 in the data augmentation stage, the success rate for 9 and 6 would be expected to be near 50%, which is supported by the results. The third type of confusion comes from shearing. For example, 7 and L are not identical by rotation alone, but when stretched unevenly on horizontal and vertical directions (which Figure 6. Success rate by marker. The horizontal threshold line is drawn at 95%. happens when viewed from an off angle), they are difficult to differentiate from one another. The fourth type of confusion comes from image overexposure during the capture process. The exposure of an image is controlled by the aperture, shutter speed, and gain of the camera. If those settings are not carefully chosen, the image sensor saturates and the image loses a large amount of detail due to high average pixel intensity. This causes the marker to lose its exact shape and confuses the neural network. While it might be possible to explicitly train the network to handle overexposure, this variable was not considered in this work to reduce the number of data augmentation types. In general, the problem of overexposure can be solved by purposely underexposing the image during capture and maximizing local contrast in post processing. The fifth type of confusion comes from contrast. When the contrast ratio is low, there is little difference among the pixels of an image, causing them to have all low or all high values and resulting in a situation where all the neurons in the neural network are activated to similar extents. This will cause the network to give unpredictable results that are not bound to a particular marker. The sixth type of confusion comes from motion blurring. When objects move quickly relative to the camera during capture, its image is dragged along its path. This causes the image to blur. While blurring was considered by the augmentations of the training data, a uniform line filter was used to simplify the process. In practice, if the object does not move with a constant velocity in relation to the camera, a non-uniform blurring might occur. The seventh type of confusion comes from noise in the background, mainly presented as reflection of other objects in the scene. This type of noise is different from Gaussian noise. Gaussian noise is mainly created by the amplification of the sensor signal, while the noise here occurs in the form of reflections and glare.

610 In many cases, the confusions of the convolutional neural network are due to transformations that make it nearly impossible to differentiate between alphanumeric characters. For example, it is impossible to differentiate a rotated 6 from a rotated 9 when using Arial font. In contrast, even though some pairs are abstractly similar, the convolutional neural network is able to recognize subtle differences between them, such as C and U, Z, N and 2, W and M, 1 and I. However, this result is fontdependent. If the markers are presented in a different font that increases the similarity between them, the results would likely vary. Based on the success rates shown in Figure 6, top three most common confusion cases for each marker shown in Table 2, and the confusion type analysis provided above, we suggest the following rules when selecting alphanumeric fiducial markers: 1. Avoid using pairs of markers that are morphologically similar after a certain affine transformation or overexposure effect (6 9, L 7, S 5, 0 O Q, and 8 B). However, if only one marker in a pair is used, then there will be significantly less confusion. 2. Use markers that are not easily confused with others if possible (With a success rate threshold set at 95%, these markers are X, I, K, C, Y, U, J, H, R, T, P, 1, 2, F, Z, A, N, 4, E, and W). VII. CONCLUSION AND FUTURE RESEARCH AGENDA In this study, a convolutional neural network is trained to classify human readable alphanumeric fiducial markers, and the accuracy of classification for each marker is evaluated. It is demonstrated that some characters are more reliable and easier to classify, and provide some advice for selecting markers in future applications. We also demonstrated and categorized major types of confusion and provided rationale regarding the observed error rates. In future research, it is worthwhile to explore the effect of different fonts and distortion simulations on the accuracy rates. In the context of human-machine interaction, convolutional neural networks provide substantial improvement to machine detection and recognition success rates even in challenging environments, as demonstrated in this paper. As the industry 4.0 revolution progresses it can be expected that, in the future, workers will rarely be required to manually label and detect objects by themselves; they will instead be required to make high-level decisions to ensure that the application has the optimal settings for a specific scenario [1]. [5] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. [6] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, Reading text in the wild with convolutional neural networks, International Journal of Computer Vision 116.1 (2016): 1-20. [7] J. Köhler, A. Pagani, and D. Stricker, Detection and identification techniques for markers used in computer vision, OASIcs-OpenAccess Series in Informatics. Vol. 19. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2011. [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Advances in neural information processing systems. 2012. [9] Y. Liu and H. Huang, Car plate character recognition using a convolutional neural network with shared hidden layers, Chinese Automation Congress (CAC), 2015. IEEE, 2015 [10] L. Naimark and E. Foxlin, Circular data matrix fiducial system and robust image processing for a wearable vision-inertial self-tracker. In ISMAR 02: Proceedings of the 1st International Symposium on Mixed and Augmented Reality, page 27. IEEE Computer Society, 2002. [11] E. Olson, AprilTag: a robust and flexible visual fiducial system, Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011. [12] S. Radzi and M. Khalil-Hani, Character recognition of license plate number using convolutional neural network, Visual Informatics: Sustaining Research and Innovations (2011): 45-55. [13] A. C. Rice, R. Harle, and A. R. Beresford, Analysing fundamental properties of marker-based vision system designs," Pervasive and Mobile Computing 2.4 (2006): 453-471. [14] D. Wave, Quick response specification, http://www.densowave.com/qrcode/indexe.html, 1994. REFERENCES [1] F. Ansari and U. Seidenberg, A portfolio for optimal coolaboration of human and cyber physical production systems in problem-solving, CELDA: 311 [2] M. Fiala, ARTag, a fiducial marker system using digital techniques, Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 2. IEEE, 2005. [3] M. Fiala, Designing highly reliable fiducial markers, IEEE Trans. Pattern Anal. Mach. Intell.32 (7) (2010) 1317 1324. [4] R. Girshick, Fast R-CNN, Proceedings of the IEEE International Conference on Computer Vision. 2015.