Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Similar documents
Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Spatial Localization and Detection. Lecture 8-1

YOLO9000: Better, Faster, Stronger

Object detection with CNNs

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

arxiv: v1 [cs.cv] 12 Sep 2016

VEHICLE CLASSIFICATION And License Plate Recognition

Detecting and Recognizing Text in Natural Images using Convolutional Networks

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Yiqi Yan. May 10, 2017

Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore),

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Efficient Segmentation-Aided Text Detection For Intelligent Robots

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Lecture 5: Object Detection

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. By Joa õ Carreira and Andrew Zisserman Presenter: Zhisheng Huang 03/02/2018

Object Detection Design challenges

Rich feature hierarchies for accurate object detection and semantic segmentation

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Object Detection. Part1. Presenter: Dae-Yong

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

arxiv: v1 [cs.cv] 1 Sep 2017

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Deformable Part Models

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

Semantic Segmentation

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

OBJECT DETECTION HYUNG IL KOO

CS231N Section. Video Understanding 6/1/2018

WeText: Scene Text Detection under Weak Supervision

Aggregating Local Context for Accurate Scene Text Detection

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Learning to Detect Activity in Untrimmed Video. Prof. Bernard Ghanem

Deep Automatic Licence Plate Recognition system

Fully Convolutional Networks for Semantic Segmentation

Yield Estimation using faster R-CNN

Dynamic Routing Between Capsules

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

MULTI-LEVEL 3D CONVOLUTIONAL NEURAL NETWORK FOR OBJECT RECOGNITION SAMBIT GHADAI XIAN LEE ADITYA BALU SOUMIK SARKAR ADARSH KRISHNAMURTHY

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Deep Learning for Virtual Shopping. Dr. Jürgen Sturm Group Leader RGB-D

CAP 6412 Advanced Computer Vision

Photo OCR ( )

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Efficient indexing for Query By String text retrieval

DeepIndex for Accurate and Efficient Image Retrieval

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Neural Networks for Scene Text Reading

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Ing. Lukáš Neumann Prof. Dr. Ing. Jiří Matas

Unsupervised Feature Learning for Optical Character Recognition

Detection of a Single Hand Shape in the Foreground of Still Images

How to Build Optimized ML Applications with Arm Software

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

DETECTING TEXTUAL INFORMATION IN IMAGES FROM ONION DOMAINS USING TEXT SPOTTING

Lecture: Deep Convolutional Neural Networks

Final Report: Smart Trash Net: Waste Localization and Classification

Asynchronous Parallel Learning for Neural Networks and Structured Models with Dense Features

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Bus Detection and recognition for visually impaired people

Exploiting scene constraints to improve object detection algorithms for industrial applications

Scene text recognition: no country for old men?

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Object Category Detection: Sliding Windows

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

arxiv: v1 [cs.cv] 4 Jan 2018

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Visual features detection based on deep neural network in autonomous driving tasks

Xiaowei Hu* Lei Zhu* Chi-Wing Fu Jing Qin Pheng-Ann Heng

Lecture 7: Semantic Segmentation

Object Detection Based on Deep Learning

Category vs. instance recognition

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss

Convolutional Neural Network Layer Reordering for Acceleration

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Internet of things that video

TRAFFIC SIGN RECOGNITION USING A MULTI-TASK CONVOLUTIONAL NEURAL NETWORK

Recognizing people. Deva Ramanan

DEEP STRUCTURED OUTPUT LEARNING FOR UNCONSTRAINED TEXT RECOGNITION

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Bilinear Models for Fine-Grained Visual Recognition

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Towards Visual Words to Words

arxiv: v3 [cs.cv] 2 Jun 2017

Deconvolution Networks

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Bidirectional Recurrent Convolutional Networks for Video Super-Resolution

Word Hypotheses for Segmentation-free Word Spotting in Historic Document Images


Kaggle Data Science Bowl 2017 Technical Report

Category-level localization

Transcription:

Scene Text Recognition for Augmented Reality Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Outline Research area and motivation Finding text in natural scenes Prior art Improving the bottomup detector Endtoend text detection Future work Acknowledgements Conclusion

Research area and motivation Augmented reality now viable thanks to advances in display technology and compute power Computer vision now actually works What APIs could developers use to build rich AR applications? Desktop : Browser :: AR :?

Finding text in natural scenes Lots of objects are labelled with text Useful API for Augmented Reality devs Number plate, ID card scanning Reduces need for QR codes

Prior art Automatic Number Plate Recognition (ANPR) Extremal Region (ER) based character detector M. Jaderberg, A. Vedaldi, A. Zisserman, Deep Features for Text Spotting, ECCV 2014. Endtoend convolutional neural network to localize text L. Neumann, J. Matas, "Realtime Lexiconfree Scene Text Localization and Recognition", TPAMI 2015. Bottomup convolutional network to detect characters OpenALPR: http://www.openalpr.com/ A. Gupta, A. Vedaldi, A. Zisserman, Synthetic Data for Text Localisation in Natural Images, CVPR 2016. Recurrent neural network to localize text Z. Tian et. al, Detecting text in natural image with connectionist text proposal network, ECCV 2016.

Automatic Number Plate Recognition Region proposal Binarization Character analysis for early rejection Edge detection Deskewing Character segmentation OCR Post processing

ER based text detection Explore all connected regions at all thresholds efficiently using unionfind Text/notext classifier (cascade) looks at region

ER based text detection First stage of text/notext classifier cascade uses only incrementally computable features: area, perimeter, aspect ratio, euler number. Tuned for high recall Second stage uses an SVM

ER based text detection Detector is scale invariant ERs provide segmentation as well Can find tiny low contrast watermarks too! Difficult to accelerate on GPUs Sensitive to blur, occlusion

Bottomup CNN based text detector Slide over 24x24 windows Convolutional layers share compute with adjacent windows Attach 3 different heads after conv layers

Bottomup CNN based text detector The detector is fast Easy to accelerate on GPU High false positive rate

Endtoend CNN for text localization Generate training dataset from natural images Composite text with background after affine transformation and adding shadow Segment the image and have text respect segment boundaries

Endtoend CNN for text localization Deep CNN to directly predict word bounding boxes CR645x5, MP2, CR1285x5, MP2, CR1283x3, CR1283x3, MP2, CR2563x3, CR2563x3, MP2, CR5123x3, CR5123x3, CR5125x5, tensor 7x1 object presence and Network is fully convolutional (i.e. no FC layer that looks at the whole image for each prediction) Moderately fast (70 ms per image on a desktop GPU).

Recurrent neural network for text localization Generic object detector has trouble getting word boundaries right They don t take advantage of the fact that text usually appears in lines Look for text lines and fine vertical text pieces

Recurrent neural network for text localization Densely slide 3x3 window over VGG16 conv map Feed the row to a BLSTM Predict k text/notext scores, yaxis coordinates, and side refinements 15 / 22

Recap: Sliding window detector CNN based sliding window detector for characters is fast But false positive rate is high because there is too little context Lots of background patches look like characters!

Improving the sliding window detector Proposal: Look for bigrams (character pairs) rather than single characters Naive network has twice the number of inputs at the first FC layer 64x32x1 30x14x16 13x5x16 1x1x64 conv5x5pool2 conv5x5pool2 conv26x10 1x1x256 1x1x2

Improving the sliding window detector Proposal: Look for bigrams (character pairs) rather than single characters Have a layer that looks at the output of conv map having character features Shares more computation between adjacent windows

Improving the sliding window detector Bigram detector has ~25% more MAC ops/px than character detector False positive rate reduced from 28.4% to 20.4% at 90% precision on the ICDAR 2015 dataset

Number plate recognition using character detector Do false positives matter so much if a language model is available? A regex match can eliminate all false positive number plates in most images!

Encoderdecoder network Encoderdecoder recurrent networks have performed well for machine translation The entire source sentence is captured in the state of a recurrent network which then generates the translated sequence

Endtoend text recognition Encode an entire image into a vector that contains all the information about text in the image. 128x64x1 25x9x32 62x30x8 conv5x5pool2 225x32 conv3x3pool2 30x14x16 conv6x6

Future work Explore possibility of endtoend text detection Simultaneous detection and localization Incorporate language model (constructed from nonimage text corpus) Extend detection to videos

Acknowledgements Thanks to NVidia for providing the GPU used in this work

Conclusion Finding text in natural scenes is a useful tool to build AR apps Performance of sliding window detector can be improved by looking for bigrams rather than individual characters Careful choice of network architecture can reduce the overhead of detecting characters pairs Endtoend detection has the benefit of more context and can perform better than a multistage pipeline with separate cost per stage

Extra slides LSTM Illustration: http://colah.github.io/posts/201508understandinglstms/

Extra slides GRU Illustration: http://colah.github.io/posts/201508understandinglstms/