A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition

Similar documents
Deep Neural Networks Applications in Handwriting Recognition

Deep Neural Networks Applications in Handwriting Recognition

Handwritten Text Recognition

Comparison of Bernoulli and Gaussian HMMs using a vertical repositioning technique for off-line handwriting recognition

Lecture 7: Neural network acoustic models in speech recognition

Boosting Handwriting Text Recognition in Small Databases with Transfer Learning

Deep Neural Networks for Recognizing Online Handwritten Mathematical Symbols

27: Hybrid Graphical Models and Neural Networks

Deep Learning. Volker Tresp Summer 2014

Handwritten Gurumukhi Character Recognition by using Recurrent Neural Network

Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition

Text Recognition in Videos using a Recurrent Connectionist Approach

Unsupervised Feature Learning for Optical Character Recognition

Handwritten Text Recognition

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling

arxiv: v3 [cs.cv] 23 Aug 2016

Making Deep Belief Networks Effective for Large Vocabulary Continuous Speech Recognition

Scanning Neural Network for Text Line Recognition

A comprehensive study of hybrid neural network hidden Markov model for offline handwritten Chinese text recognition

Hidden Markov Models. Gabriela Tavares and Juri Minxha Mentor: Taehwan Kim CS159 04/25/2017

Machine Learning 13. week

An Arabic Optical Character Recognition System Using Restricted Boltzmann Machines

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

GMM-FREE DNN TRAINING. Andrew Senior, Georg Heigold, Michiel Bacchiani, Hank Liao

Natural Language Processing with Deep Learning CS224N/Ling284

Keras: Handwritten Digit Recognition using MNIST Dataset

Variable-Component Deep Neural Network for Robust Speech Recognition

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Deep Learning. Volker Tresp Summer 2015

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Why DNN Works for Speech and How to Make it More Efficient?

An Efficient End-to-End Neural Model for Handwritten Text Recognition

Recurrent Neural Nets II

Supervised Sequence Labelling with Recurrent Neural Networks. Alex Graves

Perceptron: This is convolution!

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

CSC 578 Neural Networks and Deep Learning

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Training LDCRF model on unsegmented sequences using Connectionist Temporal Classification

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

CUED-RNNLM An Open-Source Toolkit for Efficient Training and Evaluation of Recurrent Neural Network Language Models

A Deep Learning primer

Convolution Neural Networks for Chinese Handwriting Recognition

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

LOW-RANK MATRIX FACTORIZATION FOR DEEP NEURAL NETWORK TRAINING WITH HIGH-DIMENSIONAL OUTPUT TARGETS

End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow

Empirical Evaluation of RNN Architectures on Sentence Classification Task

Bag-of-Features Representations for Offline Handwriting Recognition Applied to Arabic Script

Date Field Extraction from Handwritten Documents Using HMMs

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Discriminative training and Feature combination

A Novel Approach to On-Line Handwriting Recognition Based on Bidirectional Long Short-Term Memory Networks

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

EECS 496 Statistical Language Models. Winter 2018

Keras: Handwritten Digit Recognition using MNIST Dataset

Deep Learning for Computer Vision II

Multi-Dimensional Recurrent Neural Networks

Faster Segmentation-Free Handwritten Chinese Text Recognition with Character Decompositions. Théodore Bluche & Ronaldo Messina

Deep Generative Models Variational Autoencoders

Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition

Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition

An evaluation of HMM-based Techniques for the Recognition of Screen Rendered Text

Voice command module for Smart Home Automation

arxiv: v1 [cs.ai] 14 May 2007

The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

FastText. Jon Koss, Abhishek Jindal

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Measuring similarities in contextual maps as a support for handwritten classification using recurrent neural networks. Pilar Gómez-Gil, PhD ISCI 2012

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

ABC-CNN: Attention Based CNN for Visual Question Answering

Sparse Non-negative Matrix Language Modeling

Deep Neural Networks in HMM- based and HMM- free Speech RecogniDon

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

Start, Follow, Read: End-to-End Full-Page Handwriting Recognition

A Hybrid NN/HMM Modeling Technique for Online Arabic Handwriting Recognition

CS231N Section. Video Understanding 6/1/2018

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

IDIAP. Martigny - Valais - Suisse IDIAP

A Fast Learning Algorithm for Deep Belief Nets

Stacked Denoising Autoencoders for Face Pose Normalization

저작권법에따른이용자의권리는위의내용에의하여영향을받지않습니다.

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Speech Technology Using in Wechat

Structured Perceptron. Ye Qiu, Xinghui Lu, Yue Lu, Ruofei Shen

Offline Handwriting Recognition on Devanagari using a new Benchmark Dataset

ImageNet Classification with Deep Convolutional Neural Networks

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Writer Identification and Retrieval using a Convolutional Neural Network

An Offline Cursive Handwritten Word Recognition System

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Can Active Memory Replace Attention?

Transcription:

A Comparison of Sequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Modeling For Handwriting Recognition Théodore Bluche, Hermann Ney, Christopher Kermorvant SLSP 14, Grenoble October 16, 2014

Outline Handwriting Recognition with Hybrid NN/HMM Offline Handwriting Recognition Experimental Setup Deep Neural Networks (DNN) DNN training, Sequence-Discriminative Training Results Recurrent Neural Networks (RNN) LSTM, CTC, Depth, Dropout Results Conclusions System Combination and Final Results Future Work

Offline Handwriting Recognition Gay nodded. I know that... Je me permets de vous écrire...

Handwriting Recognition Preprocessing slope and slant correction, contrast enhancement, region-dependent height normalization (72px) Feature extraction sliding window, extraction of vectors of handcrafted features, or extraction of pixel intensities HMM LM Optical Modeling state emission probabilities, with GMMs (generative), or transformed posteriors from neural networks (discriminative) Language Modeling Lexicon (sequences of characters sequences of words) + statistical n- gram language model (probability of word given history of n-1 words)

Hybrid NN / HMM Hidden Markov Model (HMM) Characters are modeled by state sequence (firstorder Markov chain) 6 for IAM, 5 for Rimes Associated with a transition model and an emission model: generative, usually mixtures of Gaussians (GMMs) Hybrid Neural Network (NN) / HMM NN computes state posterior probability given input vector (usually a concatenation of several consecutive frames) Rescaling by state priors, we get a discriminative NN emission model to replace GMMs

Experimental Setup Databases IAM - English Pages Lines Words (7,843) Characters (79) Train 747 6,482 55,081 287,727 Validation 116 976 8,895 43,050 Test 336 2,915 25,920 128,531 Rimes - French Pages Lines Words (8,061) Characters (99) Train 1,351 10,203 73,822 460,201 Validation 149 1,130 8,380 51,924 Test 100 778 5,639 35,286

Experimental Setup LM Optical model recognizes text lines, but the LM is incorporated at the paragraph level (makes more sense w.r.t sentence boundaries) 1-3% absolute WER improvement Validation Test Database Voc.Size Corpus LM OOV% PPL OOV% PPL IAM 50,000 LOB* + Brown + Wellington 3-gram with mod. KN 4.3% 298 3.7% 329 Rimes 12,000 Training set annotations 4-gram with mod. KN 2.9% 18 2.6% 18 * the lines of the LOB corpus present in the validation and evaluation data have been ignored in the training of the language model

Experimental Setup Features Handcrafted features Sliding window of 3px, with 3px step 56 handcrafted features extracted from each frame 3 pixel density measures in the frame and different horizontal regions 2 measures of the center of gravity 12 pixel configuration relative counts (6 from the whole frame and 6 in the core region) 3 pixel density in vertical regions HoG in 8 directions + deltas (= 28 + 28) Pixels Sliding window of 45px, with 3px step Rescaled to 20 x 32px (keeps aspect-ratio) Extraction of the 640 gray-level pixel intensities per frame

Deep Multi-Layer Perceptrons Deep Multi-Layer Perceptrons are MLPs with many (3+) hidden layers state-of-the-art and now standard in HMM-based speech recognition widely used in Computer Vision, e.g. for isolated character or digit recognition not yet applied to HMM-based unconstrained handwritten text line recognition

DNN Training Forced alignments with a bootstrapping system training set of frames with correct state Train DNN to classify each frame among all different states Stochastic Gradient Descent with classification costs (e.g. Negative Log- Likelihood, Cross-Entropy) o2

DNN Training Weight initialization with 1 epoch of Contrastive Divergence unsupervised training of Restricted Boltzmann Machine, layer by layer (Hinton, 2006) Finally, standard supervised training of the whole network (cross-entropy, SGD) Hinton, G., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554.

DNN Depth and Context Pixels Depth Rimes IAM 1 17.0% 16.7% 2 15.1% 15.5% 3 14.9% 15.4% 4 14.7% 15.4% 5 14.6% 15.7% 6 15.0% 15.4% Features 7 14.8% 15.6%

DNN Sequence-Discriminative Training Goal: improve nnet response for the correct state sequence, while decreasing the prediction of concurrent hypotheses (smbr: state-level Minimum Bayes Risk) Compute forced alignments Extract lattices with unigram LM Compute the cost. The accuracy A is 1 if the considered state is in the forced alignments at this position, 0 otherwise Gradients are obtained with the forward-backward algorithm WER (%) Framewise + smbr Features 14.1 13.5 (-4.2%) RIMES Pixels 13.6 13.1 (-3.7%) Features 12.4 11.7 (-5.6%) IAM Pixels 12.4 11.8 (-4.8%) Kingsbury, Brian. Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. (2009) International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3761-3764

Recurrent Neural Networks Recurrent = hidden layer at time t receives input from previous layer, and from hidden layer at t-1 May also go through the sequence of inputs backwards Bidirectional RNNs (BRNNs) Special recurrent units to avoid training problems (vanishing gradient) and learn arbitrarily long dependencies: Long Short-Term Memory units (LSTM) BLSTM-RNNs Instead of a sequence of features vectors, one may use the image as input Multi-Dimensional (MD)LSTM-RNN

RNNs for Handwriting Recognition RNNs in handwritten text recognition contests : winner of ICDAR 09 (recognition of French and Arabic words), ICDAR 11 (French text lines), OpenHart 13 (Arabic paragraphs), Maurdor 14 (Multilingual paragraphs), HTRtS 14 (English lines)... RNNs used : MDLSTM-RNNs : image inputs, few features (2x4) at the bottom, increasing number of features along with maps subsampling BLSTM-RNNs : sequence of feature vector inputs, few LSTM layers with many features This work : Deep BLSTM-RNNs with many features

BLSTM-RNNs Architecture Input LSTM(x2) Feed-Fwd LSTM(x2) Feed-Fwd LSTM(x2) Softmax LSTM Cell

RNNs CTC training Connectionist Temporal Classification (CTC; Graves, 2006) RNN has one output for each character, plus one blank (Ø) output blank = optional between two successive and different characters, mandatory if the characters are the same [Ø ] T [Ø ] E [Ø ] A TEA During training, consider all possible labelings/segmentations of the input sequence Minimize the Negative Log-Likelihood of the correct label sequence (w/o blank) Computed efficiently with forward-backward Label truth (here: TEA) RNN outputs

RNNs Depth More than one LSTM hidden layer is generally better For pixel intensities inputs: one hidden LSTM is largely insufficient many hidden layer yield RNNs competitive with those trained with features support the idea that deep networks learn useful representations of input in early layers

RNNs Dropout Regularization technique prevents co-adaptation: make units useful on their own, not in combination with outputs of others Training: randomly drop hidden activations with probability p sample for 2 N architectures sharing weights Decoding: keep all activations but scale them by 1-p to compensate geometric mean of 2 N networks In RNNs, dropout is applied to the outputs of LSTM units WER (%) 7x200 + dropout Features 14.1 12.7 (-9.9%) RIMES Pixels 14.7 13.6 (-7.5%) Features 12.9 11.9 (-7.8%) IAM Pixels 13.1 11.8 (-9.9%) Vu Pham, Théodore Bluche, Christopher Kermorvant, Jérôme Louradour (2014) Dropout improves recurrent neural networks for handwriting recognition. In International Conference on Frontiers in Handwriting Recognition (ICFHR), 285-290.

Results IAM WER (%) CER (%) GMM-HMM 19.6 9.0 DNN features 14.7 5.8 DNN pixels 14.7 5.9 RNN features 14.3 5.3 RNN pixels 14.8 5.6 ROVER combination 11.9 4.9 Doetsch et al., 2014 * 12.2 4.7 Kozielski et al., 2013 * 13.3 5.1 Pham et al., 2014 13.6 5.1 Messina et al., 2014 * 19.1 - * open-vocabulary

Results Rimes WER (%) CER (%) GMM-HMM 15.8 6.0 DNN features 13.5 4.1 DNN pixels 12.9 3.8 RNN features 12.7 4.0 RNN pixels 13.8 4.3 ROVER combination 11.8 3.7 Pham et al., 2014 12.3 3.3 Doetsch et al., 2014 12.9 4.3 Messina et al., 2014 13.3 - Kozielski et al., 2013 13.7 4.6

Conclusions With deep (MLP or recurrent) neural networks, features do not seem important mere pixel intensities yield competitive if not better results Deep MLP are also very good for handwriting recognition RNNs are not the only option Even for BLSTM-RNNs, depth improves the final performance Don t stop at one or two LSTM layers Both approaches are complementary ROVER combination Key aspects : context, depth, sequence-training, dropout, LM on paragraphs

Future Work Deep MLPs dropout, CTC training Recurrent Neural Networks sequence-discriminative training Going further include other types of NN, such as Convolutional Neural Networks, also very popular in Computer Vision, and MDLSTM-RNNs Tandem combination: extract features from NNs

Thank you! Théodore Bluche tb@a2ia.com and thanks to A.-L. Bernard, V. Pham, J. Louradour for some of the illustrations of this presentation