END-TO-END CHINESE TEXT RECOGNITION

Size: px

Start display at page:

Download "END-TO-END CHINESE TEXT RECOGNITION"

Harold Cameron
5 years ago
Views:

1 END-TO-END CHINESE TEXT RECOGNITION Jie Hu 1, Tszhang Guo 1, Ji Cao 2, Changshui Zhang 1 1 Department of Automation, Tsinghua University 2 Beijing SinoVoice Technology November 15, 2017 Presentation at IEEE Global Conference on Signal and Information Processing (GlobalSIP)

2 Introduction Importance Applications of Chinese Character Recognition Business Card Bank Card ID Card Car Number

Segmentation with Very Little Training Data, NIPS 2016.

3 Introduction English Text Recognition English Text Recognition[1][2] 1 Xinghua Lou et al, Generative Shape Models: Joint Text Recognition and Segmentation with Very Little Training Data, NIPS Chen-Yu Lee, Simon Osindero,Recursive Recurrent Nets with Attention Modeling for OCR in the Wild, CVPR 2016.

4 Introduction English vs Chinese English vs Chinese English Alphabet size: 52 Normal Structure Arabic numerals: 10 Punctuations... Chinese Common words: 6k More complicated structure Arabic numerals: 10 Punctuations...

5 Introduction English vs Chinese English vs Chinese

6 Introduction Traditional and deep approaches Traditional and deep approaches Traditional methods Handcrafted features Traditional Classifier... Deep learning approaches Learn features automatically Highly nonlinear transformations Success in various domains Need Massive Labeled Data Our approach Create a large Chinese text dataset from real business cards and use deep neural networks to perform text recognition

7 Method Method

8 Method Create Chinese text dataset Create Chinese text dataset Human Collecting 260k business card Synthetic Generation 390k images Font rendering Noising Background colors Examples Noising Background Font

9 Method Whole Structure Whole Structure Network Structure

10 Method Convolutional Neural Network Convolutional Neural Network Network Structure Conv Layers : 2 Kernel size: 2 * 2 Kernels : 16 Max Pooling Layers : 1 Kernel size : 2 * 2 zero-padding for short images ReLU activation for Conv layers [3] Feature Map split as word embedding [4] Conv Layers : 2 Kernel size: 2 * 2 Kernels : 64 Max Pooling Layers : 1 Kernel size : 2 * 2 Conv Layers : 2 Kernel size: 2 * 2 Kernels : 128 Max Pooling Layers : 1 Kernel size : 2 * 2 feature map 3 Krizhevsky et al, ImageNet classification with deep convolutional neural networks, NIPS Mikolov et al, Efficient Estimation of Word Representations in Vector Space, Computer Science 2013.

11 Method Long Short Term Memory Long Short Term Memory[5] Network Structure LSTM LSTM LSTM Fully Connected Fully Connected Fully Connected LSTM cell type: BiLSTM all LSTM cells share weights Prediction results may contains - LSTM Hidden size: Hochreiter, S., Schmidhuber, J., Long Short-Term Memory, Neural computation

12 Method Connectionist Temporal Classification Connectionist Temporal Classification [6] CTC rule CTC rule CTC rule p(z k t ) = exp (y k t ) j exp (y j t ), p(π x) = p(l x) = π:f (π)=l p(π x) T t=1 z πt t F (a ab ) = F ( aa abb) = aab where p(z k t ) is the probability of been word k at time step t, π is the word sequence and L is the real label. 6 Alex Graves et al, Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks ICML 2006.

13 Method Loss function Final loss The final loss for training our model is: Loss = log p(l x) Where p(l x) can be calculated efficiently through dynamic programming algorithm and the loss can be optimized End-To-End. p(l x) = p(π x) π:f (π)=l

14 Experiments Experiments

15 Experiments Dataset and Evaluation Dataset and Evaluation Business Card Images Real world: 260k Synthetic: 390k training set: 80% validation set: 10% test set: 10% Evaluation Line Recognition Accuracy (LRA) every character recognized correctly LRA = C N C : correctly recognized image number N : total image number Comparison methods Google Tesseract v3.03 ABBYY

16 Experiments LRA comparison LRA comparison Method Tesseract ABBYY Our-1 Our-2 LRA 28.51% 74.57% 94.12% 98.22% Table: Comparison with other software. [6] 6 Our-1 is trained only on synthetic data for a fair comparison. Our-2 is trained on both synthetic data and real world data.all the experiments are tested on real world business card data.

17 Experiments LRA comparison LRA comparison Experiment training data test data LRA 1 real world real world 96.27% 2 synthetic real world 92.99% 3 real world synthetic 94.12% 4 real world + synthetic real world 98.22% Table: Comparison with networks trained on different datasets

18 End Thank you!

Text Recognition in Videos using a Recurrent Connectionist Approach

Author manuscript, published in "ICANN - 22th International Conference on Artificial Neural Networks, Lausanne : Switzerland (2012)" DOI : 10.1007/978-3-642-33266-1_22 Text Recognition in Videos using