Rare Chinese Character Recognition by Radical Extraction Network

2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC) Banff Center, Banff, Canada, October 5-8, 2017 Rare Chinese Character Recognition by Radical Extraction Network Ziang Yan, Chengzhe Yan, Changshui Zhang Department of Automation, Tsinghua University State Key Lab of Intelligent Technologies and Systems Tsinghua National Laboratory for Information Science and Technology (TNList) Beijing, P. R. China {yza15,yancz12}@mails.tsinghua.edu.cn, zcs@mail.tsinghua.edu.cn Abstract Building a modern Optical Character Recognition (OCR) system for Chinese is hard due to the large Chinese vocabulary list. Training images for rare Chinese characters are extremely expensive to obtain. Radical-based OCR systems tackle this problem by first extracting and recognizing basic graphical components (i.e., radicals) of a Chinese character. However, how to reliably recognize radicals still remains an open challenge. In this paper, we propose a novel Radical Extraction Network (REN) to extract and recognize radicals using deep Convolutional Neural Network (CNN). REN is end-to-end trainable, and it needs less hand-tunning compared with previous segmentation-based approaches. Deep appearance models for radicals are learned from data in a weakly supervised fashion, and no radical-level annotations are required. We learn to recognize different radicals on commonly used Chinese characters, and transfer the learned deep appearance models to rarely used Chinese characters. Experimental results show that the proposed method helps the classifier to recognize rare Chinese characters. Index Terms Radical-based Chinese OCR, Convolutional Neural Network, Weakly Supervised Learning I. INTRODUCTION Modern Optical Character Recognition (OCR) systems often use a feed-forward pipeline to detect and recognize text regions in an image. At the heart of an OCR system is a character-level image classifier with high accuracy. Chinese OCR is intricate because it involves a large number of characters and significant similarity between different characters [1]. There are about 50,000 different Chinese characters altogether, and only 6,000 of them are frequently used. Unfortunately, these rare characters often have great influence on the meaning of a sentence, especially in some scientific literatures (e.g., chemical literature). For rarely used Chinese characters, collecting enough training examples is costly, especially if a data-hungry deep Convolutional Neural Network (CNN) [2] is used. Chinese characters are formed by a combination of radicals (i.e., graphical component, or basic shape) [3]. Many Chinese characters are often visually similar, since they share the same radical. According to [4], the number of radicals is significantly fewer than the number of different Chinese characters. Thus, Chinese character recognition can be simplified by recognizing radicals and their relative positions. Radical-based Chinese character recognition approaches [3], [5] [9] recognize a Chinese character by first extracting and conv layers radical extractor.. global feature extractor radical score global feature Fig. 1. Radical-based Chinese character Recognition. An unseen Chinese character image is first processed by several convolutional layers to generate convolutional feature maps. Then, a radical extractor takes as input the feature maps, and recognizes individual radicals. A global feature extractor produces global feature maps based on the convolutional feature maps. Finally, we recognize the whole character by combining radical-level scores and the global feature maps. recognizing its radicals. The performance of radical recognition is crucial for a radical-based Chinese character recognition system. Despite the progress, learning to recognize radicals from Chinese character images still remains an challenging task. Recently, CNNs have boosted many computer vision fields at a dramatic pace, such as image classification [10], [11], object detection [12], [13], semantic segmentation [14]. CNNs have impressive ability to learn better hierarchical visual representations than traditional hand-crafted features such as SIFT [15], SURF [16], Haar-like features [17], and Fisher Vector [18], since all parameters in a CNN are learned from data. Recent studies on OCR [19], [20] show that CNN representations can also improve the performance of OCR systems. We use CNN to recognize radicals, since it can learn powerful representations from data. In this paper, we use CNN to learn robust appearance models of radicals. Our radical-based OCR system is shown in Figure 1. Aligned radical training images, which are often not readily available for practical Chinese OCR tasks, are time consuming and expensive to obtain for large Chinese character datasets. Compared with radical-level annotations, characterlevel annotations, indicating which character an image contains, are much easier to collect. Unlike traditional approaches which often need aligned radical-level training images to recognize different radicals [3], [21], we learn to localize.. 978-1-5386-1644-4/17/$31.00 2017 IEEE 924

ROI radical-level classification sm c ROI Pooling Layer rad x R radical-level detection sm d character-level classification cha conv global feature extraction glo radicals in a weakly supervised fashion: only character-level images are required in the training process. Because a radical can appear in many different positions at many different scales in a Chinese character, learning radicals from characters without any radical-level annotation is a highly challenging task. We address this task by incorporating recent progresses in the field of weakly supervised object detection (WSD) [22] [25]. These methods typically start from a set of candidate bounding boxes which may potentially contain objects tightly, and mine positive bounding boxes (i.e., bounding boxes that contain objects) from this set. Edge Boxes [26] and Selective Search [27] are often used to extract candidate bounding boxes from an image. WSDDN [22] by Bilen et al., which is a state-of-the-art WSD method, learns to select positive bounding boxes and classify different objects simultaneously. For radical recognition, we first extract a set of candidate bounding boxes, and then use REN to recognize radicals. We build REN upon WSDDN. REN has three data streams: 1) a radical-level classification stream to classify different radicals, 2) a radical-level detection stream to select positive candidate bounding box that tightly contain a particular radical, and 3) a character-level classification stream to classify different Chinese characters based on radical-level recognition results. The whole network is end-to-end trainable. In the training process, only character-level images are needed, and REN learns to extract and recognize different radicals from character-level annotations automatically. Since we need less annotation effort per image compared with traditional approaches [3], [21], REN scales better to larger datasets. Moreover, REN does not rely on radical templates or hand-crafted segmentation strategies, and all radical appearance models are learned from data. Experiments on rare Chinese characters show that REN can recognize radicals with a high accuracy, and improve the recognition performance on rare Chinese characters. Our main contributions are: We propose the Radical Extraction Network (REN), an end-to-end trainable deep convolutional neural network to extract and recognize radicals in a Chinese character image. REN does not need radical-level annotations of training images. We learn to decompose Chinese characters and discover radicals from only character-level annotations. Fig. 2. Architecture of Radical Extraction Network. 925 II. METHOD In this section, we introduce technical details of Radical Extraction Network (REN). The architecture of REN is shown in Figure 2. REN takes inputs as a Chinese character image and a set of candidate bounding boxes. REN has three streams to perform radical-level classification, radical-level detection, and character-level classification, respectively (Section II-B). REN is trained to localize radicals in an end-to-end fashion, under character-level supervision (Section II-C). A. Notation We have C different Chinese characters to recognize. Among C categories, C com categories are frequently used characters, and we have a larger number of training examples in these categories. The remaining C rare = C C com categories are rarely used characters, and we have a few training images in these categories. In our setting, an example is an image of a single Chinese character. Let x denote an image. We extract B bounding boxes from x, and denote the set of these bounding boxes by R. Let C rad denote the number of different radicals. We denote by b a bounding box in R, and denote by r a single radical in image x. We denote by φ the feature map generated by a particular layer. We denote by θ all the weights of REN, including the parameters of all the filters and bias. B. Radical Extraction Network Extracting radicals from Chinese characters under characterlevel supervision is essentially a weakly supervised learning problem. For each training image, we know which radicals it contains, but we do not know how these radicals look like or where they are. WSDDN by Bilen et al. [22] is a state-of-the-art weakly supervised object detection method, and it learns to localize and classify objects simultaneously under image-level (i.e., character-level in Chinese OCR task) supervision. Different from the object detection task which usually focus on object-level (i.e., radical-level in Chinese OCR task) accuracy, we focus on character-level accuracy in rare Chinese character recognition task. Thus, REN has one more stream than WSDDN to perform classification on character-level, as shown in Figure 2. REN is constructed from standard components in CNN: convolutional layers, nonlinear layers, fully connected layers

and dropout layers [10]. We insert a ReLU [10] nonlinear layer after each convolutional layer and fully connected layer. The inputs of REN consist of two parts: 1) a Chinese character image x, and 2) a set of bounding boxes R extracted from image x. We use Edge Boxes [26] to extract around B candidate bounding boxes for each image, as in WSDDN [22]. When B is large enough (e.g., B 200 in our experiments), we believe that for each radical r in character x, at least one bounding box b R contains r tightly. In REN, a character image x is first processed by several convolutional layers and a convolutional feature map φ conv (x; θ) is produced. Then, we branch off three data streams, described next. a) Radical-level classification data stream: The first and second data stream operate on radical-level. To achieve this, an ROI pooling layer [28] is inserted in the middle of REN, taking as input the convolutional feature map φ conv (x; θ) and the region set R, as shown in Figure 2. The ROI pooling layer performs pooling on each region, and it produces as output a matrix φ ROI (x, R; θ) R B droi, where d ROI is the dimension of pooled representation of each bounding box. The first data stream performs radical-level classification. In this data stream, the matrix φ ROI (x, R; θ) is processed by several fully connected layers, and each region is mapped to a C rad -dimensional vector. These fully connected layers output a score matrix φ c (x, R; θ) R B Crad, and a row-wise softmax operator is applied to it. The final output of this data stream φ sm c (x, R; θ) is given by exp [φ [φ sm c (x, R; θ)] ij c (x, R; θ)] ij = Crad k=1 exp [φ. (1) c(x, R; θ)] ik b) Radical-level detection data stream: The second data stream performs radical-level detection. As mentioned before, we learn to recognize radicals in a weakly supervised fashion: we do not know which bounding box contains a specific radical tightly. The aim of this data stream is to select a best bounding box for every radical. This stream starts from the pooled representation matrix φ ROI (x, R; θ). We map each region to a C rad -dimensional vector, by using several fully connected layers. These fully connected layers output a score matrix φ d (x, R; θ) R B Crad, and a column-wise softmax operator is applied to it. We do not share weights between these layers and the fully connected layers in the first data stream. The final output of this data stream φ sm d (x, R; θ) is given by exp [φ [φ sm d (x, R; θ)] ij d (x, R; θ)] ij = B k=1 exp [φ. (2) d(x, R; θ)] kj The radical score φ rad (x, R; θ) R Crad combining φ sm c (x, R; θ) and φ sm (x, R; θ) [ φ rad (x, R; θ) ] j = B k=1 d is obtained by [φ sm c (x, R; θ) φ sm d (x, R; θ)] kj, where is the element-wise product operator. Note that each element in φ rad is in the range of (0, 1). We interpret [φ rad ] j (3) as the confidence of the character x contains a j-th radical somewhere. c) Character-level classification data stream: The third data stream operates on character-level. The aim of this stream is to obtain the final character-level classification score. We classify a Chinese character image based on two kinds of information: 1) the character image itself, and 2) recognized radicals from this image. The character image can provide necessary global context, and the recognized radicals can capture substructure of the character image. In this data stream, we fuse these two kinds of information. This data stream starts from the convolutional feature map φ conv (x; θ), and maps it to a C glo -dimensional global context vector φ glo (x; θ) by using several fully connected layers. Then, we apply a linear map followed by a softmax operator φ cha = Softmax(W 1 φ glo + W 2 φ rad ), (4) where φ cha R C is the final character-level classification score, W 1, W 2 are weights to be learned, and W 1 φ glo + W 2 φ rad R C. C. Training REN In this section, we explain how to train the model. The training data is N Chinese character images {x 1,..., x N } with their character-level labels {y 1,..., y N }, where y i {1, 2,..., C}. We extract around B bounding boxes from x i using Edge Boxes, and denote the set of bounding boxes by R i. Moreover, we have a character-radical correspondence matrix T {0, 1} C Crad, indicating whether a character contains a particular radical. Note this matrix is irrelevant to the size of training set, and thus easy to obtain. From matrix T, we can construct a radical-level label yi rad {0, 1} Crad for image x i, indicating whether a particular radical is present in image x i. We do not need locations of radicals during the training. We have two goals: 1) recognize characters, and 2) recognize radicals. Thus we define two energy functions: and J cha (θ) = 1 N J rad (θ) = 1 N 1 N N C rad i=1 j=1 N 1{y rad i i=1 j=1 N C rad i=1 j=1 C 1{y i = j} log [ φ cha (x i, R i ; θ) ] j, 1{y rad i = 0} log = 1} log [ φ rad (x i, R i ; θ) ] j (5) ( 1 [ φ rad (x i, R i ; θ) ] ). (6) j The character-level loss J cha (θ) is a cross-entropy loss, and the radical-level loss J rad (θ) is a sum of C rad binary-logloss terms. We use stochastic gradient descent to optimize the following multi-task loss J(θ) = J cha (θ) + λ 1 J rad (θ) + λ 2 2 θ 2. (7) 926

III. EXPERIMENTS In this section, we evaluate REN on a challenging real world dataset. Experimental results show the proposed architecture provides significant performance improvement on rare Chinese character recognition task. A. Dataset Existing Chinese character datasets such as CASIA [29] usually consist of frequently used Chinese characters, thus, they are inappropriate for our task. In order to evaluate REN, we collect a rare Chinese character database called RCC. RCC is collected from Chinese Official Identity Card images and Chinese Official Invoice images. Example images in RCC are shown in Figure 3. RCC consists of C = 479 different characters, and 378,562 character images. Among 479 character categories, C com = 452 characters are frequently used, and C rare = 27 are rarely used. For each frequently used character category, we have 831.3 images in average. For each rarely used character category, we have 104.2 images in average. The 27 rare Chinese character categories consist of C rad = 10 different radicals in total. Names of these 10 radicals are listed in Table II. For each of the 479 characters, we manually labeled a 10-dimensional vector, indicating whether a specific radical is present in this character. We guarantee that for each of the 479 character categories, at least 1 of these 10 radicals is present. For frequently used characters, we use 60% images for training, 10% images for validation, and 30% images for test. For rarely used characters, we use 20% images for training, 10% images for validation, and 70% images for test. B. Evaluation metrics We evaluate both radical-level accuracy and character-level accuracy. For a character image x, we have a radical-level prediction score φ rad (x, R; θ) R Crad, and a character-level prediction score φ cha (x, R; θ) R C. We evaluate Average Precision (AP) on φ rad (x, R; θ) and φ cha (x, R; θ). We report VOC-style 11 points AP [30] in all experiments. We report average precision on the test set. C. Experimental setup d) Network architectures: Very deep neural networks such as VGG [31], GoogLeNet [32], or ResNet-101 [11] are very time consuming. Thus in this paper, we evaluate three shallow network architectures: a small model S, a medium model M, and a large model L. We set stride=1 for all convolutional layers. Detailed architectures of these three models are shown in Table I. We set C glo = 2C rad in all experiments. e) Training: We generate around 200 bounding boxes for each image. We set λ 1 = 0.1 in all experiments. For SGD, a momentum 0.9 and a weights decay of λ 2 = 0.0005 are used. A mini-batch is composed of 10 images, and we run SGD for 10 epochs. We set the learning to 0.001 in the first 5 epochs, and then lower the learning rate to 0.0001 in the next 5 epochs. Our implementation is based on the publicly Fig. 3. Example images in RCC. available MATLAB implementation of WSDDN by Bilen et al. [22]. It takes about 25 hours to train a model L on a Pascal TITAN X GPU. D. Radical recognition results Radical-level recognition results on the test set are summarized in Table II. We evaluate different CNN architectures on both frequently used characters (com on Table II) and rarely used characters (rare on Table II), and report average for each radicals. The last column of Table II reports mean average precision (map) on 10 radicals. Our best model, REN-L, obtains 93.5% radical-level classification map on frequently used characters, and 90.5% map on rarely used characters. It can be seen from the quantitative results that larger CNN models provide better performance than smaller models. Moreover, Table II shows that the classification performance on frequently used characters and rarely used characters are roughly the same, indicating that our models have great generalization ability on unseen rarely used characters, even we have only a small number of training images. This can be explained by the fact that radicals are shared between frequently used characters and rarely used characters. Our model can learn accurate appearance models of radicals from frequently used characters. Figure 4 show the response maps in the radical detection. We predict accurate radical locations on both unseen frequently used and rarely used Chinese character images. Note that we do not use any templates or other prior knowledge about the shapes of radicals, and all appearance models are automatically learned from data. E. Character recognition results Character-level recognition results on the test set are summarized in Table III. Since the data is highly imbalanced, we use map to evaluate our models. In order to highlight the effectiveness of our three data streams architecture, we remove the radical-level classification data stream and the radicallevel detection data stream, and train end-to-end characterlevel classification CNN models (i.e., CNN- models in Table III). We evaluate average precision for each character category. As in Section III-D, we report map on both frequently used characters (com on Table III) and rarely used characters (rare on Table III). The last column of Table III reports map on all categories. 927

TABLE I CONVNET CONFIGURATIONS. model S model M model L Input 60 60 150 150 250 250 conv 5 5-64 conv 7 7-64 conv 7 7-96 max pool 2 2 max pool 2 2 max pool 2 2 conv 3 3-128 conv 5 5-128 conv 5 5-256 max pool 2 2 max pool 2 2 max pool 2 2 Before branch conv 3 3-128 conv 3 3-256 conv 3 3-384 ROI pool 5 5 conv 3 3-256 max pool 2 2 ROI pool 6 6 conv 3 3-256 conv 3 3-256 ROI pool 6 6 radical cls stream fc 512 fc 1024 fc 2048 dropout 0.5 dropout 0.5 dropout 0.5 fc C rad fc C rad fc C rad radical det stream fc 512 fc 1024 fc 2048 dropout 0.5 dropout 0.5 dropout 0.5 fc C rad fc C rad fc C rad character cls stream dropout 0.5 dropout 0.5 dropout 0.5 fc C fc C fc C Our best model, REN-L, obtains 92.5% character-level classification map on rare characters, which is much better than CNN-L (65.1% map). Table III shows that REN models with three data streams always provide better performance on rarely used characters than character-level classification CNN models. On frequently used characters, REN models and CNN models achieve comparable performance. This indicates the proposed architecture can improve character-level recognition performance, which is of great importance to an OCR system. IV. RELATED WORK A. Radical-based Chinese character recognition Since classification on radicals is much easier than that on raw Chinese characters, many studies focus on radical extraction and recognition. Wang and Fan [9] propose a hierarchical matching approach for radical extraction. Chung and Ip [5] use a deformable model to decompose Chinese characters. Shi et al. [3] use Active Shape Models (ASMs) to extract radicals. Kernel PCA is used to learn the appearance models of radicals, and radical-level training images are required. Ni et al. [21] use cascade classifier [33] to detect radicals. Haar-like features and AdaBoost [34] are core components of their detectors. Many radical-based approaches use segmentation based methods, and important components of them are hand-crafted. Since many components are not learnable, they are usually not robust to noise. Several approaches [3], [21] achieve excellent performance by learning to recognize radicals from radical-level training images. Our method differs from existing radical extraction approaches in two aspects: 1) We do not rely on prior knowledge about the shapes of radicals, and all appearance models are learned from data. 2) We only need TABLE II COMPARISON OF RADICAL-LEVEL CLASSIFICATION AVERAGE PRECISION. radicals dataset model mu you ma yue dao ren qie tu cun xin map REN-S 87.3 92.1 89.0 94.7 79.5 81.9 91.8 92.3 87.1 83.4 87.9 com REN-M 89.2 94.6 91.9 97.1 86.6 85.7 96.2 91.8 90.5 85.6 90.9 REN-L 92.4 95.0 92.1 98.9 90.4 92.0 95.9 95.2 93.8 89.8 93.5 REN-S 83.6 87.0 89.4 89.1 72.7 75.6 83.2 87.9 81.4 74.0 82.4 rare REN-M 82.0 90.3 91.8 91.2 73.4 78.9 86.1 90.9 85.2 76.9 84.6 REN-L 87.1 94.7 91.5 95.2 84.8 91.2 92.4 92.9 87.6 87.2 90.5 TABLE III COMPARISON OF CHARACTER-LEVEL CLASSIFICATION MEAN AVERAGE PRECISION. model map com rare all CNN-S 92.4 51.9 90.1 REN-S 91.1 87.6 90.9 CNN-M 94.7 57.8 92.6 REN-M 94.8 89.0 94.5 CNN-L 96.4 65.1 94.6 REN-L 96.4 92.5 96.2 character-level annotations during training, which is much cheaper to collect than radical-level annotations. B. Weakly supervised object detection Weakly supervised object detection is an important problem in computer vision, since bounding box annotations are costly to obtain. Song et al. [25] propose a multiple instance learning framework to mine positive bounding boxes in a noisy collection of object proposals. Li et al. [23] propose a progressively domain adaption approach for bounding box mining. Bilen et al. [22] and Kantorov et al. [24] use an end-to-end trainable WSDDN to perform region selection and classification simultaneously. WSDDN have two data streams: classification stream (radical-level in REN) and detection (radical-level in REN) stream. We add a character-level classification data stream to WSDDN, since the final goal of REN is to classify different characters. V. CONCLUSION We present the Radical Extraction Network (REN) for rare Chinese character recognition. REN learns to recognize radicals under character-level supervision, and no prior knowledge about the shapes of radicals are needed during training. Experimental results show that the proposed method can recognize rare Chinese characters at a high accuracy. ACKNOWLEDGMENT This work is funded by NSFC (Grant No.61473167, No. 91420203 and NO. 61621136008) and the German Research Foundation(DFG) in Project Crossmodal Learning, DFC TRR- 169. 928

(a) (b) (c) (d) (e) (f) (g) Fig. 4. Examples of detection response maps on radical tu. (a): A standard radical tu. Note this template is only used for visualization, not training. (b)(d)(f): Test images. (c)(e)(g): Visualization of φ sm d corresponding to test images. The radical extractor localizes the radical tu in test images. Best viewed in color. REFERENCES [1] F. Kimura, K. Takashina, S. Tsuruoka, and Y. Miyake, Modified quadratic discriminant functions and the application to chinese character recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 1, pp. 149 153, 1987. [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278 2324, 1998. [3] D. Shi, S. R. Gunn, and R. I. Damper, Handwritten chinese radical recognition using nonlinear active shape models, IEEE transactions on pattern analysis and machine intelligence, vol. 25, no. 2, pp. 277 280, 2003. [4] C.-W. Liao and J. S. Huang, A transformation invariant matching algorithm for handwritten chinese character recognition, Pattern Recognition, vol. 23, no. 11, pp. 1167 1188, 1990. [5] F.-L. Chung and W. W. Ip, Complex character decomposition using deformable model, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 31, no. 1, pp. 126 132, 2001. [6] K. Chellapilla and P. Simard, A new radical based approach to offline handwritten east-asian character recognition, in Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft, 2006. [7] L.-L. Ma and C.-L. Liu, A new radical-based approach to online handwritten chinese character recognition, in Pattern Recognition, 2008. ICPR 2008. 19th International Conference on. IEEE, 2008, pp. 1 4. [8] D. Shi, S. R. Gunn, and R. I. Damper, Handwritten chinese character recognition using nonlinear active shape models and the viterbi algorithm, Pattern Recognition Letters, vol. 23, no. 14, pp. 1853 1862, 2002. [9] A.-B. Wang and K.-C. Fan, Optical recognition of handwritten chinese characters by hierarchical radical matching method, Pattern Recognition, vol. 34, no. 1, pp. 15 35, 2001. [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems, 2012, pp. 1097 1105. [11] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770 778. [12] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779 788. [13] S. Ren, K. He, R. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in Advances in neural information processing systems, 2015, pp. 91 99. [14] J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431 3440. [15] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision, vol. 60, no. 2, pp. 91 110, 2004. [16] H. Bay, T. Tuytelaars, and L. Van Gool, Surf: Speeded up robust features, in European conference on computer vision. Springer, 2006, pp. 404 417. [17] P. Viola and M. J. Jones, Robust real-time face detection, International journal of computer vision, vol. 57, no. 2, pp. 137 154, 2004. [18] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, Image classification with the fisher vector: Theory and practice, International journal of computer vision, vol. 105, no. 3, pp. 222 245, 2013. [19] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, Synthetic data and artificial neural networks for natural scene text recognition, arxiv preprint arxiv:1406.2227, 2014. [20] R. Messina and J. Louradour, Segmentation-free handwritten chinese text recognition with lstm-rnn, in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 171 175. [21] E. Ni, M. Jiang, and C. Zhou, Radical extraction for handwritten chinese character recognition by using radical cascade classifier, in Electrical, Information Engineering and Mechatronics 2011. Springer, 2012, pp. 419 426. [22] H. Bilen and A. Vedaldi, Weakly supervised deep detection networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2846 2854. [23] D. Li, J.-B. Huang, Y. Li, S. Wang, and M.-H. Yang, Weakly supervised object localization with progressive domain adaptation, in IEEE Conference on Computer Vision and Pattern Recognition, 2016. [24] V. Kantorov, M. Oquab, M. Cho, and I. Laptev, Contextlocnet: Contextaware deep network models for weakly supervised localization, in European Conference on Computer Vision. Springer, 2016, pp. 350 365. [25] H. O. Song, R. B. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, T. Darrell et al., On learning to localize objects with minimal supervision. in ICML, 2014, pp. 1611 1619. [26] C. L. Zitnick and P. Dollár, Edge boxes: Locating object proposals from edges, in European Conference on Computer Vision. Springer, 2014, pp. 391 405. [27] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, Selective search for object recognition, International journal of computer vision, vol. 104, no. 2, pp. 154 171, 2013. [28] R. Girshick, Fast r-cnn, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440 1448. [29] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, Casia online and offline chinese handwriting databases, in Document Analysis and Recognition (ICDAR), 2011 International Conference on. IEEE, 2011, pp. 37 41. [30] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, http://www.pascalnetwork.org/challenges/voc/voc2007/workshop/index.html. [31] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv preprint arxiv:1409.1556, 2014. [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1 9. [33] P. Viola and M. Jones, Rapid object detection using a boosted cascade of simple features, in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1. IEEE, 2001, pp. I I. [34] Y. Freund and R. E. Schapire, A desicion-theoretic generalization of on-line learning and an application to boosting, in European conference on computational learning theory. Springer, 1995, pp. 23 37. 929