Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore),

Similar documents
arxiv: v1 [cs.cv] 15 May 2017

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Lecture 5: Object Detection

Object detection with CNNs

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Object Detection. Part1. Presenter: Dae-Yong

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Spatial Localization and Detection. Lecture 8-1

Object Detection Based on Deep Learning

Yiqi Yan. May 10, 2017

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

WeText: Scene Text Detection under Weak Supervision

Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

arxiv: v1 [cs.cv] 1 Sep 2017

arxiv: v1 [cs.cv] 23 Apr 2016

Feature-Fused SSD: Fast Detection for Small Objects

arxiv: v2 [cs.cv] 27 Feb 2018

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches

YOLO9000: Better, Faster, Stronger

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Real-time Object Detection CS 229 Course Project

arxiv: v1 [cs.cv] 4 Jan 2018

arxiv: v1 [cs.cv] 31 Mar 2016

A pooling based scene text proposal technique for scene text reading in the wild

Feature Fusion for Scene Text Detection

Fuzzy Set Theory in Computer Vision: Example 3

VISION & LANGUAGE From Captions to Visual Concepts and Back

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

arxiv: v1 [cs.cv] 12 Sep 2016

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Text Extraction from Natural Scene Images and Conversion to Audio in Smart Phone Applications

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Aggregating Local Context for Accurate Scene Text Detection

Deep Neural Networks for Scene Text Reading

arxiv: v1 [cs.cv] 15 Oct 2018

Recognizing Text in the Wild

Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

TextBoxes++: A Single-Shot Oriented Scene Text Detector

CAP 6412 Advanced Computer Vision

EFFICIENT SEGMENTATION-AIDED TEXT DETECTION FOR INTELLIGENT ROBOTS

SEMANTIC SEGMENTATION AVIRAM BAR HAIM & IRIS TAL

arxiv: v1 [cs.cv] 14 Dec 2017

Conditional Random Fields as Recurrent Neural Networks

Deep Learning for Object detection & localization

Geometry-aware Traffic Flow Analysis by Detection and Tracking

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Detection and Recognition of Text from Image using Contrast and Edge Enhanced MSER Segmentation and OCR

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

VEHICLE CLASSIFICATION And License Plate Recognition

Predicting ground-level scene Layout from Aerial imagery. Muhammad Hasan Maqbool

Classifying a specific image region using convolutional nets with an ROI mask as input

Photo OCR ( )

Chinese text-line detection from web videos with fully convolutional networks

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Amodal and Panoptic Segmentation. Stephanie Liu, Andrew Zhou

Detecting and Recognizing Text in Natural Images using Convolutional Networks

Available online at ScienceDirect. Procedia Computer Science 96 (2016 )

Classification of objects from Video Data (Group 30)

Hand Detection For Grab-and-Go Groceries

arxiv: v2 [cs.cv] 22 Nov 2017

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Visual features detection based on deep neural network in autonomous driving tasks

arxiv: v1 [cs.cv] 27 Jul 2017

Bus Detection and recognition for visually impaired people

DEEP STRUCTURED OUTPUT LEARNING FOR UNCONSTRAINED TEXT RECOGNITION

arxiv: v3 [cs.cv] 1 Feb 2017

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Scene text recognition: no country for old men?

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

SAFE: Scale Aware Feature Encoder for Scene Text Recognition

Object Detection in Sports Videos

Machine Learning 13. week

arxiv: v2 [cs.cv] 10 Oct 2018

arxiv: v2 [cs.cv] 28 May 2017

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

Using Object Information for Spotting Text

2017 2nd International Conference on Software, Multimedia and Communication Engineering (SMCE 2017) ISBN:

RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN

A process for text recognition of generic identification documents over cloud computing

Improving Small Object Detection

Efficient indexing for Query By String text retrieval

A Novel Framework for Text Recognition in Street View Images

Learning to Segment Object Candidates

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

Reading Text in the Wild from Compressed Images

Multi-Oriented Text Detection with Fully Convolutional Networks

Transcription:

WordFences: Text Localization and Recognition ICIP 2017 Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore), Dr Sergio Escalera (Universitat de Barcelona, Barcelona, Spain), Dr Shijian Lu (A*STAR Institute for Infocomm Research, Singapore), Dr Jianfei Cai (Nanyang Technological University, Singapore) 1

Sponsors Institute for Infocomm Research (I2R), at Singapore s Agency for Science, Technology and Research (A*STAR) CERCA Program, Government of Catalonia 2

Problem Description Text detection and recognition in natural scene imagery. Good test case problem for AI research and for uses in industry: mapping business from StreetView, translating menus or billboards, etc. 3

Motivation OCR can be used on scanned text. Natural images have a ton of variety in fonts, scales, kerning and features. 4

Proposed Solution 2-stage deep learning network Find locations/rois (text localization) with CNN. Detect characters (end-to-end text recognition) with RNN. 1st stage is the more difcult one. It is related to object recognition and semantic segmentation problems in Computer Vision. 5

Contributions Two main contributions: Use a word separator a WordFence for deliniating individual words. Using a novel weighted pixelwise softmax function for training the semantic segmentation. 6

Sample Detections 7

Related Work (CV) Maximally Stable Extremal Regions by Huang et al.(2014) works by frst using an MSER transform and then training a CNN. Stroke Width Transform by Epshtein et al. (2010) an edge detection method, that relies on the fact that a given text font should have similar width/thickness for each stroke in a character. Edge Boxes by Zitnick and Dollár (2014) simple object score based on the number of edges within a given sliding window. Sparse and fast to evaluate, but results could be better. 8

Related Deep Learning Single shot ROI detection: YOLO (2015) by Redmon et al. and SSD: Multibox (2015) by Liu et al. - both of these work by splitting image into grid and calculating probabilities Fully convolutional networks (2015) by Long and Shelhamer replace fully connected layers with deconv. DeepLab (2015) Liang-Chieh et al. and ResNet101 (2016) by He et al. are the bases for this work. 9

Related Text Detection SynthText (2016) by Gupta, Vedaldi, and Zisserman. Synthetic dataset for text recognition and one-shot box approach to learning. Jaderberg et al. (2014) used CNN to generate high number of proposals and flter with Random Forest (HoG). He et al. (2016) cascaded CNN with false positive rejection to detect textlines (no split). 10

Model Overview 11

Word Localization as Semantic Segmentation Semantic segmentation is a well known problem. Able to handle diferent scales using wide felds of view and multi-scale inference. Ground truth word separators created by dilation. 12

ResNet of Exponential Receptive Fields ResNets help to train huge networks without a vanishing gradient. Receptive felds can be enlarged using convolutional dilations in the ConvNets. Deep network + exponential receptive felds efective multi-scale detection of diferent sized text. 13

Weighted Pixelwise Softmax Loss Background pixels are the majority. More emphasis for text and WordFence pixels is needed. Algorithm allows us to rebalance per image weights on the fy. 14

WordFence vs No-WordFence WordFences act as penalization for merged words. 15

Text Datasets ICDAR 2011 and ICDAR 2013 International Conference on Document Analysis and Recognition COCO-Text subset of the popular MS-COCO dataset for object recognition (21 classes) SVT Google Street View data SynthText synthetic text mixed with scene images from Gupta et al. 16

Localization Results Methods marked with * use multi-stage false-positive detectors. 17

Recognition Results The recognition stage uses proposals given from the detection network. Recognition stage is based on the CRNN network by Shi et al. (2016). 18

ICDAR 2011 19

ICDAR 2013 20

Problems Encountered Going from segmentations to bounding boxes. Noisy detections and many false positives: Many small detected regions. Camera artifacts such as glare. Need for balancing precision vs recall. 21

Conclusion Text recognition as semantic segmentation WordFences as penalization SOTA recall on detection, which provides high quality samples to the recognition stage (which in itself is able to throw away false positives). SOTA F-scores on recognition 22

Future Work WDN relies on visual information to split words. Humans also use word semantics and memory. Raeding wrods with jubmled letetrs. A smarter system would be able to read text directly from images without a midpoint CV representation. Learn directly from neural net to a dictionary word output? 23

Questions? 24