DEEP STRUCTURED OUTPUT LEARNING FOR UNCONSTRAINED TEXT RECOGNITION

Similar documents
Return of the Devil in the Details: Delving Deep into Convolutional Nets

SAFE: Scale Aware Feature Encoder for Scene Text Recognition

TEXTS in scenes contain high level semantic information

Simultaneous Recognition of Horizontal and Vertical Text in Natural Images

Simultaneous Recognition of Horizontal and Vertical Text in Natural Images

Gated Recurrent Convolution Neural Network for OCR

Deep Neural Networks for Scene Text Reading

Bilinear Models for Fine-Grained Visual Recognition

arxiv: v1 [cs.cv] 12 Nov 2017

Improving Face Recognition by Exploring Local Features with Visual Attention

Robust Scene Text Recognition with Automatic Rectification

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Yiqi Yan. May 10, 2017

arxiv: v1 [cs.cv] 6 Sep 2017

Pattern Recognition xxx (2016) xxx-xxx. Contents lists available at ScienceDirect. Pattern Recognition. journal homepage:

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

arxiv: v1 [cs.cv] 2 Jun 2018

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Lecture 5: Object Detection

CSE 559A: Computer Vision

arxiv: v1 [cs.cv] 23 Apr 2016

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

Temporal HeartNet: Towards Human-Level Automatic Analysis of Fetal Cardiac Screening Video

Two-Stream Convolutional Networks for Action Recognition in Videos

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Lab meeting (Paper review session) Stacked Generative Adversarial Networks

AON: Towards Arbitrarily-Oriented Text Recognition

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

arxiv: v3 [cs.cv] 2 Apr 2019

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

A Further Step to Perfect Accuracy by Training CNN with Larger Data

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform. Xintao Wang Ke Yu Chao Dong Chen Change Loy

Learning to Read Irregular Text with Attention Mechanisms

LEARNING TO INFER GRAPHICS PROGRAMS FROM HAND DRAWN IMAGES

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction

Lecture 7: Semantic Segmentation

Human Pose Estimation with Deep Learning. Wei Yang

Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes

Spatial Localization and Detection. Lecture 8-1

Content-Based Image Recovery

arxiv: v1 [cs.cv] 13 Jul 2017

CAP 6412 Advanced Computer Vision

Structured Prediction using Convolutional Neural Networks

END-TO-END CHINESE TEXT RECOGNITION

arxiv:submit/ [cs.cv] 13 Jan 2018

Detecting and Recognizing Text in Natural Images using Convolutional Networks

Learning Deep Structured Models for Semantic Segmentation. Guosheng Lin

Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore),

Object detection with CNNs

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Computer Vision Lecture 16

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

arxiv: v2 [cs.cv] 23 May 2016

Automatic Script Identification in the Wild

Supplementary Material for SphereNet: Learning Spherical Representations for Detection and Classification in Omnidirectional Images

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

arxiv: v1 [cs.cv] 14 Jul 2017

Efficient indexing for Query By String text retrieval

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION. Shin Matsuo Wataru Shimoda Keiji Yanai

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Fast CNN-Based Object Tracking Using Localization Layers and Deep Features Interpolation

Feature Fusion for Scene Text Detection

Microscopy Cell Counting with Fully Convolutional Regression Networks

Computer Vision Lecture 16

Final Report: Smart Trash Net: Waste Localization and Classification

Computer Vision Lecture 16

Transfer Learning. Style Transfer in Deep Learning

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Scene text recognition: no country for old men?

Supplementary Material: Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

Recognizing Text in the Wild

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Deeply Cascaded Networks

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

arxiv: v1 [cs.cv] 27 Jul 2017

arxiv: v1 [cs.cv] 6 Jul 2016

Edit Probability for Scene Text Recognition

Patch Reordering: A Novel Way to Achieve Rotation and Translation Invariance in Convolutional Neural Networks

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

A Deep Learning Framework for Authorship Classification of Paintings

Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Synscapes A photorealistic syntehtic dataset for street scene parsing Jonas Unger Department of Science and Technology Linköpings Universitet.

Mixtures of Gaussians and Advanced Feature Encoding

arxiv: v1 [cs.cv] 12 Sep 2016

arxiv: v1 [cs.cv] 26 Jul 2018

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

arxiv: v1 [cs.cv] 1 Sep 2017

Part Localization by Exploiting Deep Convolutional Networks

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos

Part-Based Models for Object Class Recognition Part 3

Detecting Bone Lesions in Multiple Myeloma Patients using Transfer Learning

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

Transcription:

DEEP STRUCTURED OUTPUT LEARNING FOR UNCONSTRAINED TEXT RECOGNITION Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, Department Engineering Science, University of Oxford, UK 1

TEXT RECOGNITION Localized text image as input, character string as output COSTA DENIM DISTRIBUTED FOCAL

TEXT RECOGNITION State of the art constrained text recognition! word classification [Jaderberg, NIPS DLW 2014]! static ngram and word language model [Bissacco, ICCV 2013] APARTMENTS

TEXT RECOGNITION State of the art constrained text recognition! word classification [Jaderberg, NIPS DLW 2014]! static ngram and word language model [Bissacco, ICCV 2013] Random string? New, unmodeled word?

TEXT RECOGNITION Unconstrained text recognition! e.g. for house numbers [Goodfellow, ICLR 2014] business names, phone numbers, emails, etc Random string RGQGAN323 New, unmodeled word TWERK

OVERVIEW Two models for text recognition [Jaderberg, NIPS DLW 2014] Character Sequence Model Bag-of-N-grams Model! Joint formulation CRF to construct graph Structured output loss Use back-propagation for joint optimization! Experiments Generalize to perform zero-shot recognition When constrained recover performance!

CHARACTER SEQUENCE MODEL Deep CNN to encode image. Per-character decoder. 32 100 1 32 100 64 16 50 128 8 25 256 8 25 512 4 13 512 1 1 4096 x 1 1 4096 5 convolutional layers, 2 FC layers, ReLU, max-pooling 23 output classifiers for 37 classes (0-9,a-z,null)! Fixed 32x100 input size distorts aspect ratio

CHARACTER SEQUENCE MODEL Deep CNN to encode image. Per-character decoder. 0 e z Ø char 1 P (c 1 (x)) 32 100 1 x CHAR CNN s 1 1 37 char 5 char 6 char 23 P (c 23 (x))

BAG-OF-N-GRAMS MODEL Represent string by the character N-grams contained within the string spires s! p! i! r! e! sp! pi! ir! re! es! spi! pir! ire! res! spir! pire! ires 1-grams 2-grams 3-grams 4-grams

BAG-OF-N-GRAMS MODEL Deep CNN to encode image. N-grams detection vector output. Limited (10k) set of modeled N-grams. N-gram detection vector 1 1 10000 32 100 1 32 100 64 16 50 128 8 25 256 8 25 512 4 13 512 1 1 4096 1 1 4096 a b ak ke ra aba rake raze

JOINT MODEL Can we combine these two representations? 0 r z Ø char 1 32 100 1 CHAR CNN e 1 1 37 char 4 char 5 char 23 32 100 1 NGRAM CNN 1 1 10000 a b ak ke ra aba rake raze

JOINT MODEL CHAR CNN f(x) a e k q r

JOINT MODEL maximum number of chars CHAR CNN f(x) a e k q r NGRAM CNN g(x)

JOINT MODEL CHAR CNN f(x) w = arg max w S(w, x) beam search a e k q r NGRAM CNN g(x)

STRUCTURED OUTPUT LOSS Score of ground-truth word should be greater than or equal to the highest scoring incorrect word + margin.! where Enforcing as soft constraint leads to a hinge loss

STRUCTURED OUTPUT LOSS

EXPERIMENTS

DATASETS All models trained purely on synthetic data! [Jaderberg, NIPS DLW 2014] Font rendering Border/shadow & color Composition Projective distortion Natural image blending Realistic enough to transfer to test on real-world images

DATASETS Synth90k! Lexicon of 90k words. 9 million images, training + test splits Download from http://www.robots.ox.ac.uk/~vgg/data/text/

DATASETS ICDAR 2003, 2013! Street View Text IIIT 5k-word

TRAINING Pre-train CHAR and NGRAM model independently.! Use them to initialize joint model and continue jointly training.

EXPERIMENTS - JOINT IMPROVEMENT Train Data Test Data CHAR JOINT Synth90k Synth90k 87.3 91.0 IC03 85.9 89.6 SVT 68.0 71.7 IC13 79.5 81.8 joint model outperforms character sequence model alone CHAR: grahaws! JOINT: grahams! GT: grahams CHAR: mediaal! JOINT: medical! GT: medical CHAR: chocoma_! JOINT: chocomel! GT: chocomel CHAR: iustralia! JOINT: australia! GT: australia

JOINT MODEL CORRECTIONS edge down-weighted in graph edges up-weighted in graph

EXPERIMENTS - ZERO-SHOT RECOGNITION Train Data Test Data CHAR JOINT Synth90k 87.3 91.0 large difference for CHAR model when not trained on test words Synth72k-90k 87.3 - Synth90k Synth45k-90k 87.3 - IC03 85.9 89.6 SVT 68.0 71.7 IC13 79.5 81.8 Synth1-72k Synth72k-90k 82.4 89.7 Synth1-45k Synth45k-90k 80.3 89.1 joint model recovers performance

EXPERIMENTS - COMPARISON No Lexicon Model Type Model IC03 SVT IC13 Unconstrained Baseline (ABBYY) - - - Wang, ICCV 11 - - - Bissacco, ICCV 13-78.0 87.6 Language Constrained Yao, CVPR 14 - - - Jaderberg, ECCV 14 - - - Gordo, arxiv 14 - - - Jaderberg, NIPSDLW 14 98.6 80.7 90.8 Unconstrained CHAR 85.9 68.0 79.5 JOINT 89.6 71.7 81.8

EXPERIMENTS - COMPARISON Model Type Model No Lexicon IC03 SVT IC13 IC03- Full Fixed Lexicon SVT-50 IIIT5k -50 Unconstrained Baseline (ABBYY) - - - 55.0 35.0 24.3 - IIIT5k- 1k Wang, ICCV 11 - - - 62.0 57.0 - - Bissacco, ICCV 13-78.0 87.6-90.4 - - Language Constrained Yao, CVPR 14 - - - 80.3 75.9 80.2 69.3 Jaderberg, ECCV 14 - - - 91.5 86.1 - - Gordo, arxiv 14 - - - - 90.7 93.3 86.6 Jaderberg, NIPSDLW 14 98.6 80.7 90.8 98.6 95.4 97.1 92.7 Unconstrained CHAR 85.9 68.0 79.5 96.7 93.5 95.0 89.3 JOINT 89.6 71.7 81.8 97.0 93.2 95.5 89.6

SUMMARY Two models for text recognition! Joint formulation Structured output loss Use back-propagation for joint optimization! Experiments Joint model improves accuracy on language-based data. Degrades elegantly when not from language (Ngram model doesn t contribute much) Set benchmark for unconstrained accuracy, competes with purely constrained models.

jaderberg@google.com