2016 15th International Conference on Frontiers in Handwriting Recognition Handwritten Chinese Character Recognition by Joint Classification and Similarity Ranking Cheng Cheng, Xu-Yao Zhang, Xiao-Hu Shao and Xiang-Dong Zhou Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences Institute of Automation, Chinese Academy of Sciences Email: {chengcheng, shaoxiaohu, zhouxiangdong}@cigit.ac.cn, xyz@nlpr.ia.ac.cn Abstract Deep convolutional neural networks (DCNN) have recently achieved state-of-the-art performance on handwritten Chinese character recognition (HCCR). However, most of DCNN models employ the softmax activation function and minimize cross-entropy loss, which may loss some inter-class information. To cope with this problem, we demonstrate a small but consistent advantage of using both classification and similarity ranking signals as supervision. Specifically, the presented method learns a DCNN model by maximizing the inter-class variations and minimizing the intra-class variations, and simultaneously minimizing the cross-entropy loss. In addition, we also review some loss functions for similarity ranking and evaluate their performance. Our experiments demonstrate that the presented method achieves state-of-the-art accuracy on the well-known ICDAR 2013 offline HCCR competition dataset. Index Terms Similarity Ranking Character Recognition Deep Convolutional Neural Networks Handwritten Chinese character recognition (HCCR) has been intensively studied in the past forty years and is of practical importance for bank check reading, taxform processing, book and handwritten notes transcription, and so on. Although many studies have been conducted, it remains a challenging problem due to the diversity of writing styles, large character set and the presence of many confusing character pairs. Some samples with different writing styles and confusing character pairs are show in Fig. 1 and Fig. 2, respectively. Existing HCCR methods can be mainly classified into two categories: Traditional methods and DCNN based methods. In the first category, there are typically four basic steps: shape normalization, feature extraction, dimensionality reduction and classifier construction. To improve the recognition performance, many effective methods, include nonlinear normalization [17], pseudo two dimensional normalization [14], gradient direction feature extraction [13], modified quadratic discriminant function [12] and discriminative learning quadratic discriminant function [15], have been proposed for these steps. In the second category, a DCNN model composed of layers of convolution, rectification and pooling is trained via back propagation. Unlike traditional methods, they substitute the separate steps, i.e. feature extraction, dimensionality reduction and classifier construction with a single deep architecture and only require shape normalization in the four steps. These expressivity and robust training algorithms allow for learning powerful object representations without the need of handcrafted features. However, most of DCNN based methods use the softmax activation function (also known as multinomial logistic regression) for classification, which we find may loss some inter-class information. Fig. 1: Characters with different writting styles. I. INTRODUCTION Fig. 2: Examples of confused character pairs. In this paper, we contribute to the second category and present a deep triplet network (DTN) method of which the basic idea is illustrated in Fig. 3. Unlike most existing methods, the presented method learns a DCNN model using both classification and similarity ranking signals as supervision. Classification is to classify an input image into a large number of identity classes, while similarity ranking is to minimize the intra-class distance while maximizing the inter-class distance. In addition, we also investigate the loss functions of similarity ranking algorithms and aim to improve the performance using a new form of loss function. CNN fc softmax Triplet Ranking Fig. 3: The structure of the proposed model. The rest of this paper is organized as follows: Section II briefly introduces the related previous work Section III re- 2167-6445/16 $31.00 2016 IEEE DOI 10.1109/ICFHR.2016.92 506 507
views the softmax and similarity ranking Section IV presents the loss function for similarity ranking Section V presents our experimental results and the last section concludes the paper. II. RELATED WORK A. HCCR by DCNN In recent years, DCNN has received increasing interests in computer vision and machine learning, a number of DCNN methods have been proposed for HCCR in the literatures [2], [3], [4], [21], and continued their success by winning both online and offline HCCR competitions at the ICDAR 2013 [23]. Generally, DCNN aims to learn hierarchical feature representations by building high-level features from lowlevel ones. There are two notable breakthroughs. The first is large-scale character classification with DCNN [25], [26]. Meanwhile, the domain-specific knowledge, such as Gabor or normalization-cooperated direction-decomposed feature map, is used for enhancing the performance of DCNN. The second is supervised DCNN with both character reconstruction and verification tasks [2]. The reconstruction task minimizes the distance between features of the same category. In this paper, we extend DCNN model using classification and similarity ranking signals as supervision. B. Similarity Ranking The present method falls under the big umbrella of similarity ranking. Similarity ranking based DCNN methods has been proved effective in a wide range of tasks, such as face recognition [10], [18], person re-identification [5], [24] and image retrieval [20]. The framework of the above mentioned papers is to organize the training images into a batch of triplet samples, each sample containing two images with the same label and one with a different label. With these triplet samples, they tend to minimize the intra-class distance while maximizing the inter-class distance for each triplet unit using Euclidean distance metric. In the field of character recognition, the closest method to our approach is the discriminative DCNN by Kim et al. [11]. The discriminative DCNN focuses on the differences among similar classes, and thereby improves the discrimination ability of the DCNN. III. METHODOLOGY We aim to train a DCNN model using both classification and similarity ranking signals as supervision. The first is character classification signal, which classifies each character image into n (e.g., n = 3755) different categories. It is achieved by following the fully connected layer with an n-way softmax layer, which outputs a probability distribution over the n classes. The network is trained to minimize the cross-entropy loss, which is denoted as n L(f i,k,θ cls )= p i log p i (1) i=1 in which k is a true class label, L(f i,k,θ cls ) is the standard cross-entropy/log loss, and θ cls denotes the softmax layer parameters. p i is the target probability distribution, where p i = 0 for all i except p i = 1 for the target class k. The second is similarity ranking signals, which project character pairs into the same feature subspace. The distance of each positive character pair is less than a smaller threshold and that of each negative pair is higher than a larger threshold, respectively. We adopt the following loss function, which was originally proposed by Wang et al [20] and widely used in face recognition [18], person re-identification [5] and image retrieval [24]. L(f i,f j,f k,θ tri )=max( f i f j 2 2 f i f k 2 2+Δ, 0) (2) in which f i,f j are features of two character images have the same label, f i,f k are features of two mismatched character images, Δ is a margin that is enforced between positive and negative image pairs, and θ tri is the parameter to be learned in the similarity ranking loss function. All the two loss functions are evaluated and compared in our experiments. Our goal is to learn the parameters θ con in the DCNN model, while θ tri and θ cls are parameters introduced to propagate the classification and similarity ranking signals during training. In the testing stage, only θ cls and θ con are used for classification. The parameters are updated by stochastic gradient descent on each triplet unit. The gradients of two supervisory signals are weighted by a hyperparameter λ. Our learning algorithm is summarized in Algorithm 1. Algorithm 1 The learning algorithm Require: training set χ = {x i,y i }, initialized parameters θ cls, θ con and θ tri, hyperparameter λ Ensure: parameters θ cls, θ con and θ tri 1: for m =1to iter do 2: sample a triplet units (x i,y i ), (x j,y j ) and (x k,y k ) from χ, in which x i,x j have the same label 3: f i = Conv(x i,θ con ), f j = Conv(x j,θ con ), f k = Conv(x k,θ con ) 4: θ cls = L(fi,yi,θ cls) + L(fj,yj,θ cls) + L(f k,y k,θ cls ) 5: θ tri = λ L(fi,fj,f k,θ tri) 6: f i = L(fi,yi,θ cls) 7: f j = L(fj,yj,θ cls) 8: f k = L(f k,y k,θ cls ) θ tri 9: θ con = Conv(xi,θcon) Conv(x k,θ con) 10: end for + Conv(xj,θcon) + IV. LOSS FUNCTION FOR SIMILARITY RANKING In this section, we describe and compare three different loss functions, which can be used in the proposed framework. 507 508
A. Euclidean Distance In the absence of prior knowledge, most similarity ranking use simple Euclidean distance to measure the dissimilarities between examples represented as vector inputs. The cost function over the distance metrics parameterized by eq. 2 has two competing terms. The first term penalizes large distances between each input and its target neighbors, while the second term penalizes small distances between each input and all other inputs that do not share the same label. Specifically, the cost function is given by: N [ ] L = f i f j 2 2 f i f k 2 2 +Δ (3) where N is the number of triplet. It is easy to calculate the derivative of the loss with respective to characters as: L = f i f j 2 f i f k 2 L = f i f j 2 (4) L = f i f k 2 B. Logistic Discriminant Based The Euclidean distance is sensitive to the scale, and is blind to the correlation across dimensions. To overcome the deficiency of Euclidean distance, we use a standard linear logistic discriminant function to model the triplet units as: 1 p n =1 = σ(d) (5) 1+e d in which d p = f i f j 2 2, d n = f i f k 2 2 and d = d p d n. We model the probability p n that triplet n =(i, j, k) is positive (fulfill the constraint in Eq. 2). If d<0, it is misclassified, we use maximum log-likelihood to optimize the parameters of the model. The log-likelihood L can be written as: N [ ] L = t n ln p n +(1 t n ) ln(1 p n ) (6) C. Conditional Log-likelihood Loss In [9], the generalization of limitations of the above loss are discussed, and a regularization term is added for the above loss function to avoid over-fitting in training as well as to maximize the hypothesis margin. Follow [9], we rewrite the loss function as: p n = σ(d)+α f i f j 2 2 (7) where α is the regularization coefficient. Intuitively, the regularizer pays more attention to the intra-class variations. V. EXPERIMENTS To verify the effectiveness of presented method we conduct experiments on the offline HCCR databases [16], including D- B1.0, DB1.1 and test set of ICDAR-2013 Chinese handwriting recognition competition [23] (denoted as ICDAR-2013). Fig. 4: The network architecture of the presented method. A. Implementation Details We implement the present methods using the caffe [8] with our modifications. All experiments are run on four GPU. All the models are trained based on the same implementation as follows. 1) Data Augmentation: During training, the 128 128 image is perturbed by the single model or the combined model, as in [1]. A half of random samples are flipped horizontally. We also adopt some augmentations that were proposed by the previous work [22], such as add a random integer ranging from 20 to +20 to the image, grey shifting, Gaussian blur, and so on. 2) Network and Settings: Deep residual networks and Inception architecture were independently proposed in [6] and [19]. Both of them achieved high performance in ImageNet challenges. Integrated the tricks of these two papers, we design an architecture as show in Fig. 4. It consists of 2 convolutional layers, 9 Inception layers, 5 pooling layers, 2 fully connected layer, 1 similarity loss layer and 1 softmax loss layer. The first four pooling layers use max operator and the last pooling layer is average. The outputs of 9 Inception layers, are added to the 508 509
TABLE I: Recognition rates (%) on DB1.1 and ICDAR-2013 trained with DB1.1 loss function DB1.1 ICDAR-2013 top-1 top-10 top-1 top-10 softmax 95.72 99.43 96.07 99.59 softmax + similarity ranking (UD) 95.76 99.54 96.27 99.71 softmax + similarity ranking (LD) 95.85 99.55 96.22 99.69 softmax + similarity ranking (CLL) 96.13 99.58 96.44 99.71 TABLE II: Recognition rates (%) on ICDAR-2013 trained with DB1.0 and DB1.1 loss function Ensemble method top-1 top-10 Dic.size softmax no GoogleNet [26] 96.35 99.80 27.77MB softmax no directmap [25] 97.37 NULL 23.50MB softmax no ours 96.50 99.78 36.20MB softmax + CLL no ours 97.07 99.85 36.20M softmax + CLL yes (4) ours 97.64 99.91 144.80M outputs of the last fc layer. Follow [7], the batch normalization is used after each convolution and before ReLU activation. We train the DCNN models using SGD with a mini-batch size of 360. The learning rate is set to 5e-2 initially and reduce to 1e-5 gradually. The models are initialized from random from zero-mean Gaussian distributions, and trained on four Titan X GPU for 300 hours. B. Results on DB1.1 and ICDAR-2013 In this experiment, we used the 240 writers (no.1001-1240) of DB1.1 databases for training, the test dataset from remain 40 writers and the 2013 ICDAR Chinese handwriting recognition competition, respectively. In Section III, we introduced the method to training a DCNN model using both classification and similarity ranking signals as supervision, and three loss function for similarity ranking, namely, Euclidean distance (Ed), Logistic Discriminant Based (LD) and Conditional loglike lihoodloss (CLL). The recognition results on DB1.1 and ICDAR-2013 test set are shown in Table I. First, compared to the results of baseline DCNN method using only softmax function in Table I, we can see that the recognition accuracy is improved further by combined two type of signals, especially the similarity ranking with CLL loss function. This demonstrates the benefits of the proposed method, improving top 1 recognition rate from 96.07 percent to 96.44 percent. Next, we compare the results of three loss function for similarity ranking. Table I shows that the results of CLL methods are better than those of ED and LD for similarity ranking. C. Comparison with other State-of-the-art Methods In this experiment, we used the DB1.0 and DB1.1 [16] databases for training, and ICDAR-2013 for testing. To show the outstanding performance of the proposed method, we compare the performances with HCCR [26] which reports very good results on the ICDAR-2013 datasets. The recognition results on ICDAR-2013 test set are shown in Table II. First, compared to the recognition results of GoogleNet and presented network architecture with same loss function from Table II. It is observed that the presented architecture yield higher recognition rate than the basedline GoogleNet architecture. Next, compared with the state-of-the-art result of the previous working [25], our method achieved a significant improvement with a relative 19.45% error rate reduction. It is worth noting that the memory cost of the presented model is 36.20MB, which is bigger than the baseline GoogleNet model (27.77MB) and the directmap model (23.50MB). That is because the proposed model combine of all Inception layers with their output filter banks concatenated into a fully connected layer. VI. CONCLUSION This paper shows that the character classification and similarity ranking supervisory signals are complementary for each other, which can increase inter-class variations and reduce intra-class variations, and therefore much better classification performance can be achieved. Combination of the two supervisory signals leads to significantly better results than only softmax based character classification. Experiments on the ICDAR 2013 offline HCCR competition dataset show that our best result is superior to all previous works. The best testing error rate we achieved is 2.36%, which is a new state-of-the-art record according to our knowledge. ACKNOWLEDGMENT This work is supported by the National Natural Science Foundation of China under Grants Nos. 61472386, 61403380 and Chongqing Research Program of Basic Research and Frontier Technology (No. cstc2016jcyja0011). The two Titan X GPUs used for this research were donated by the NVIDIA Corporation. REFERENCES [1] B. Chen, B. Zhu, and M. Nakagawa. Training of an on-line handwritten japanese character recognizer by artificial patterns. Pattern Recognition Letters, 35:178 185, 2014. [2] L. Chen, S. Wang, S. Wang, J. Sun, and J. Sun. Reconstruction combined training for convolutional neural networks on character recognition. In International Conference on Document Analysis and Recognition, pages 431 435, 2015. [3] C. Dan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In International Conference on Computer Vision and Pattern Recognition, pages 3642 3649, 2012. 509 510
[4] C. Dan and J. Schmidhuber. Multi-column deep neural networks for offline handwritten chinese character classification. In arxiv, 2013. [5] S. Ding, L. Lin, G. Wang, and H. Chao. Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition, 48(1):2993 3003, 2015. [6] K. He, X. Zhang, X. Zhang, and X. Zhang. Deep residual learning for image recognition. In arxiv, 2015. [7] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015. [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, pages 675 678, 2014. [9] X.-B. Jin, C.-L. Liu, and X. Hou. Regularized margin-based conditional log-likelihood loss for prototype learning. Pattern Recognition, 43(7):2428 2438, 2010. [10] L. Jinguo, Y. Deng, and C. Huang. Targeting ultimate accuracy: Face recognition via deep embedding. In arxiv, 2015. [11] I.-J. Kim, C. Choi, and C. Choi. Improving discrimination ability of convolutional neural networks by hybrid learning. IJDAR, 19(1):1 9, 2016. [12] F. Kimura, K. Takashina, S. Tsuruoka, and Y. Miyake. Modified quadratic discriminant functions and the application to chinese character recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(1):149 153, 1987. [13] C.-L. Liu. Normalization-cooperated gradient feature extraction for handwritten character recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(8):1465 1469, 2007. [14] C.-L. Liu and K. Marukawa. Pseudo two-dimensional shape normalization methods for handwritten chinese character recognition. Pattern Recognition, 38(12):2242 2255, 2005. [15] C.-L. Liu, H. Sako, and H. Fujisawa. Discriminative learning quadratic discriminant function for handwriting recognition. IEEE Transactions on Neural Networks, 15(2):430 444, 2004. [16] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang. Online and offline handwritten chinese character recognition: Benchmarking on new databases. Pattern Recognition, 46(1):155 162, 2012. [17] T. V. Phan, J. Gao, B. Zhu, and M. Nakagawa. Effects of line densities on nonlinear normalization for online handwritten japanese character recognition. In International Conference on Document Analysis and Recognition, pages 834 838, 2011. [18] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In International Conference on Computer Vision and Pattern Recognition, pages 815 823, 2015. [19] C. Szegedy, V. Vanhoucke, S. Ioffe, and J. Shlens. Rethinking the inception architecture for computer vision. In arxiv, 2015. [20] J. Wang, J. Wang, J. Wang, J. Wang, J. Wang, J. Wang, J. Wang, and Y. Wu. Learning fine-grained image similarity with deep ranking. In International Conference on Computer Vision and Pattern Recognition, pages 1386 1393, 2014. [21] C. Wu, W. Fan, Y. He, J. Sun, and S. Naoi. Handwritten character recognition by alternately trained relaxation convolutional neural network. In International Conference on Frontiers in Handwriting Recognition, pages 291 296, 2014. [22] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. In arxiv, 2015. [23] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu. Icdar 2013 chinese handwriting recognition competition. In International Conference on Document Analysis and Recognition, pages 1464 1470, 2013. [24] R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Transa Image Processing, 24(12):4766 4779, 2015. [25] X.-Y. Zhang, Y. Bengio, and C.-L. Liu. Online and offline handwritten chinese character recognition: A comprehensive study and new benchmark. Accepted by Pattern Recognition, 2016. [26] Z. Zhong, L. Jin, and Z. Xie. High performance offline handwritten chinese character recognition using googlenet and directional feature maps. In International Conference on Document Analysis and Recognition, pages 846 850, 2015. 510 511