arxiv: v1 [cs.cv] 16 Nov 2018

Similar documents
CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Object Detection on Self-Driving Cars in China. Lingyun Li

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Finding Tiny Faces Supplementary Materials

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

arxiv: v1 [cs.cv] 2 Sep 2018

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Classification of objects from Video Data (Group 30)

Spatial Localization and Detection. Lecture 8-1

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Final Report: Smart Trash Net: Waste Localization and Classification

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Channel Locality Block: A Variant of Squeeze-and-Excitation

Automatic detection of books based on Faster R-CNN

Skin Lesion Attribute Detection for ISIC Using Mask-RCNN

Real-time Object Detection CS 229 Course Project

Machine Learning 13. week

Visual features detection based on deep neural network in autonomous driving tasks

R-FCN: Object Detection with Really - Friggin Convolutional Networks

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Deep Learning for Object detection & localization

TextBoxes++: A Single-Shot Oriented Scene Text Detector

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

arxiv: v1 [cs.cv] 31 Mar 2016

Automatic Ship Detection of Remote Sensing Images from Google Earth in Complex Scenes Based on Multi-Scale Rotation Dense Feature Pyramid Networks

One Network to Solve Them All Solving Linear Inverse Problems using Deep Projection Models

TEXT SEGMENTATION ON PHOTOREALISTIC IMAGES

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

YOLO9000: Better, Faster, Stronger

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Kaggle Data Science Bowl 2017 Technical Report

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

A Generalized Method to Solve Text-Based CAPTCHAs

Joint Object Detection and Viewpoint Estimation using CNN features

Classifying a specific image region using convolutional nets with an ROI mask as input

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Generic Face Alignment Using an Improved Active Shape Model

Mobile Camera Based Calculator

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Fully Convolutional Networks for Semantic Segmentation

Multi-Glance Attention Models For Image Classification

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

arxiv: v1 [cs.cv] 20 Dec 2016

How to draw and create shapes

Multi-View 3D Object Detection Network for Autonomous Driving

Object detection with CNNs

Supplementary Material: Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

SEMANTIC SEGMENTATION AVIRAM BAR HAIM & IRIS TAL

Content-Based Image Recovery

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Efficient Segmentation-Aided Text Detection For Intelligent Robots

arxiv: v1 [cs.cv] 16 Nov 2015

Deep Face Recognition. Nathan Sun

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Project 3 Q&A. Jonathan Krause

Visual object classification by sparse convolutional neural networks

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Large-scale Video Classification with Convolutional Neural Networks

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Computer Science 474 Spring 2010 Viewing Transformation

Time Stamp Detection and Recognition in Video Frames

Deep Residual Learning

arxiv: v3 [cs.cv] 2 Jun 2017

Your Flowchart Secretary: Real-Time Hand-Written Flowchart Converter

Character Recognition from Google Street View Images

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs Supplementary Material

FACE DETECTION AND RECOGNITION OF DRAWN CHARACTERS HERMAN CHAU

arxiv: v1 [cs.cv] 26 Jul 2018

Pose estimation using a variety of techniques

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Correcting User Guided Image Segmentation

03 Vector Graphics. Multimedia Systems. 2D and 3D Graphics, Transformations

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

An Accurate Method for Skew Determination in Document Images

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Lecture 7: Semantic Segmentation

Regionlet Object Detector with Hand-crafted and CNN Feature

Deep Learning With Noise

Eye Detection by Haar wavelets and cascaded Support Vector Machine

Detecting Printed and Handwritten Partial Copies of Line Drawings Embedded in Complex Backgrounds

Using Machine Learning for Classification of Cancer Cells

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Cascade Region Regression for Robust Object Detection

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Yield Estimation using faster R-CNN

Dot Text Detection Based on FAST Points

WITH the recent advance [1] of deep convolutional neural

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Object Detection Based on Deep Learning

Transcription:

Improving Rotated Text Detection with Rotation Region Proposal Networks Jing Huang 1, Viswanath Sivakumar 1, Mher Mnatsakanyan 1,2 and Guan Pang 1 1 Facebook Inc. 2 University of California, Berkeley November 20, 2018 arxiv:1811.07031v1 [cs.cv] 16 Nov 2018 Abstract A significant number of images shared on social media platforms such as Facebook and Instagram contain text in various forms. It s increasingly becoming commonplace for bad actors to share misinformation, hate speech or other kinds of harmful content as text overlaid on images on such platforms. A scene-text understanding system should hence be able to handle text in various orientations that the adversary might use. Moreover, such a system can be incorporated into screen readers used to aid the visually impaired. In this work, we extend the scene-text extraction system at Facebook, Rosetta [1], to efficiently handle text in various orientations. Specifically, we incorporate the Rotation Region Proposal Networks (RRPN) [2] in our text extraction pipeline and offer practical suggestions for building and deploying a model for detecting and recognizing text in arbitrary orientations efficiently. Experimental results show a significant improvement on detecting rotated text. 1 Introduction Understanding text in images shared on platforms such as Facebook and Instagram along with the context in which it appears makes it possible to proactively identify inappropriate or harmful content and keep our community safe. While over the years we had gotten good at handling policy-violating text composed in posts and captions, we were exposed to hate speech, clickbait, policy-violating ads, and other low quality content that manifested as part of an image. Motivated by this problem, Rosetta [1] was built to extract overlaid and scene-text from images and video frames and identify policy violating content. Rosetta extracts text from more than a billion public Facebook and Instagram images and video frames (in a wide variety of languages), daily and in real time, and feeds the output to upstream classifiers to understand the context of the text and the image together. Rosetta employs a two-step approach. The text detection model, based on Faster-RCNN [3] with the ResNet convolutional body replaced with a ShuffleNet-based [4] architecture, is responsible of detecting regions of the image that contain words. Each detected region is cropped and fed to a fully-convolutional character-based text recognition model to extract the word. Widely used object detection models such as Faster-RCNN are designed for generic cases and thus output rectangular bounding boxes. For scene-text extraction, this is insufficient as text may come in arbitrary orientations while the second-stage text recognition model typically expects an image patch with horizontal text. According to an error analysis on Rosetta soon after it was deployed, oriented text was found to be the most common source of mistakes with orientations as minimal as 20 failing to be recognized correctly. Therefore, it s important to enable the system to correctly handle rotated text amongst all the adversarial cases we might face. In traditional Faster-RCNN detection, while most words are detected with a bounding box, we cannot correctly infer the actual words due to lack of textual orientation information (Figure 1). One solution is to apply a trained Spatial Transformer Network on the rectangular patch, which benefits from being standalone module. Another approach is to predict an oriented bounding box in detection stage, which benefits from being able to train end-to-end. The two approaches can be combined together if needed. We follow the idea of [2] to replace the Region Proposal Network (RPN) in our Faster-RCNN-based detection pipeline with RRPN. There are a few differences in our implementation compared to original RRPN: 1

Figure 1: Comparison of text extraction between non-rotated model and rotated models. RRoI transformation: We use Rotated Region of Interest Align (RRoI-Align) that applies bi-linear interpolation, instead of Rotated Region of Interest Pooling (RRoI-Pooling) as RoI transformation to avoid misaligned result on the boundaries due to rounding. Boundary breaking anchors: The original RRPN will filter out boundary breaking R-anchors during training and they had to use a border padding of 0.25 times each side to reserve more positive proposals. Our approach automatically performs on-demand padding in the RRoI-Align operator as well as bounding box transformation between the detection and recognition pipeline. Orientation coverage: Due to trade-offs between orientation coverage and computational efficiency, the original RRPN crops the angle range to be within [ 45, 135). However, it s not a symmetrical range as the text rotated by ( 90, 45) degrees would be treated as text rotated by (90, 135) degrees. Therefore, we use a more natural orientation coverage of ( 90, 90) degrees so that any text that is not rotated by more than 90 degrees can be correctly identified. The RPN anchors are chosen accordingly. 2 2.1 Rotational Region Proposal Network Rotated box Representation Traditionally, a non-rotated bounding box is represented as (x, y, w, h) where (x, y) is the top-left corner of the box, or (x1, y1, x2, y2 ), where (x1, y1 ) is the top-left corner and (x2, y2 ) is the bottom-right corner of the box. RRPN [2] uses (xc, yc, h, w, θ) as the representation, where (xc, yc ) represents the geometric center of the bounding box. The orientation θ is the angle from the positive direction of x-axis to the direction parallel to the width of the rotated box. We follow a similar representation. 2.2 Rotated Anchors To fit the objects to different sizes, RPN uses two parameters to control the size and shape of anchors, i.e., scale and aspect ratio. For rotated text detection, anchor generation should include orientation as well. The original RRPN uses (8, 16, 32) as scale anchors (essentially it s equivalent to (128, 256, 512) in our pipeline with a scaling multiplier of 16), (0.125, 0.2, 0.5) as aspect ratio anchors, and ( 30, 0, 30, 60, 90, 120) as angle anchors. Other than the trade-off between orientation coverage and computational efficiency leading to using half of the angle space (180 degrees), it didn t give an explanation for the specific cropping angle of ( 45, 135) degrees, which means boxes in range of ( 90, 45) degrees would be treated as text rotated by (90, 135) degrees. From theoretical and empirical perspective, we believe that it s because the convergence of angles would become unstable around the cropping angles and they try to place the boundary at some non-right-angle degrees. On 2

the other hand, in our latest experiments we use the same scales and aspect ratios as our non-rotated model (5 scale anchors (32, 64, 128, 256, 512), 7 aspect ratio anchors (0.03125, 0.0625, 0.125, 0.25, 0.5, 1, 2)), while the angle anchors are between 90 and +90 degrees: ( 90, 60, 30, 0, 30, 60, 90). While the range of ( 90, 90) degrees is used for now, our pipeline also supports detecting arbitrary angles (covering all 360 degrees) with a config change. The extra dimension for angle anchors multiplies the number of anchors by 7 and thus increases the memory usage during training by a non-negligible amount, which we handle by using in-place operators. 2.3 Rotated RoI Transformation Given the region proposals, we d like to obtain a fixed-size feature map (e.g. 7 7). Operators like RoI Pooling allows us to achieve this goal. In the RRPN paper [2], the authors extended RoI Pooling to RRoI Pooling by applying rotation transformation. On the other hand, quantization in RoI and RRoI pooling creates misaligned result on the boundaries. To handle this, we implemented Rotated RoI Align operator that applies bi-linear interpolation. 3 Proximity Estimation for Rotated Box Assuming we have two proposals, how should we evaluate which one is closer to the ground truth? This is the key question for proposal selection as well as non-maximum suppression (NMS). For horizontal boxes, it s sufficient to use IoU (intersection over union) between two boxes. However, IoU alone is not enough to estimate the proximity of rotated boxes. Given a horizontal box, we can have up to 4 ways of representing it: a box rotated by 0, 90, 180, 270. They occupy exactly the same pixels of the image, but they mean different things in OCR when processed by the text-recognition model. For example, the letter V might be recognized as < if the predicted box has 90 angle, or > if the box has 90 angle. Therefore, angle difference between the rotated boxes needs to be taken into account. We apply the same strategy as [2]: positive R-anchors are those with the highest IoU overlap or an IoU larger than 0.7 with respect to the ground truth, and an intersection angle with respect to the ground truth of less than 30 degrees. Negative R-anchors are those with an IoU lower than 0.3, or an IoU larger than 0.7 but with an intersection angle with a ground truth larger than 30 degrees. Regions not belonging to either positive or negative are not used during training. 3.1 Training While the key concept is not complicated, the change to support the 5-dimensional bounding box representation without breaking the current pipeline turned out to be pretty invasive. Besides the core changes mentioned above, various efforts were made in supporting training with mixed non-rotated and rotated datasets to bias the model towards the predominant scenario of non-rotated text without sacrificing too much accuracy, on-the-fly rotated data augmentation, and supporting evaluation of rotated boxes. 3.2 Inference For efficient inference, we perform int8 quantization on the model to reduce runtime and memory-bandwidth usage. Also, we implemented efficient CPU versions of Caffe2 operators to handle rotated boxes including RotatedRoIAlign, BBoxTransform, Non-Maximum Suppression, GenerateProposals and BoxWithNMSLimit, leading to no increase in inference time compared to baseline non-rotated model. These Caffe2 operators have been open-sourced at https://github.com/pytorch/pytorch/blob/master/caffe2/operators. Finally, we need to perform rotated box transformation in order to generate the image patch based on predicted rotated box parameters and feed it to the recognition pipeline. Traditional Faster-RCNN clips any boxes that might overflow image boundaries, but doing so for rotated boxes isn t straight-forward and would lead to cutting-out some characters since the detection output is still a rectangle as opposed to a polygon. To handle this, we pad the original image until the boxes don t overflow the boundaries anymore 2 in the following steps: First, we use the horizontal bounding rectangle of the predicted rotated box to crop out the region from the original image. Then, we add zero-padding so that the center of the image patch is the same as the predicted rotated box. Finally, we use warp-affine transformation in OpenCV to simultaneously rotate the patch and crop out the region of interest with the correct width and height from the prediction. 3

Figure 2: Bounding box tranformation on predicted rotated boxes between detection and recognition. 4 Experiments Figure 3: Quantitative evaluation. 4.1 Qualitative Results Using the RRPN approach, we can correctly predict the rotated bounding box for the example in the beginning (the image and text detection and recognition result in the right of Figure 1). 4.2 Quantitive Evaluation We performed end-to-end evaluation on two datasets, Rotated (with uniform random rotation in [ 90, 90] degrees) and Non-rotated datasets, on five models. (1) Baseline model without RRPN; (2) RRPN (Ratio = 3) trained on an 3 : 1 rotation-augmented multilingual dataset; (3) RRPN (Ratio = 3) model after int8 quantization; (4) RRPN (Ratio = 5) trained on 5 : 1 rotation-augmented multilingual dataset; (5) RRPN (Ratio = 5) model after int8 quantization. All of them are trained for around 675k iterations. Results are summarized in Figure 3. The RRPN models perform significantly better on Rotated dataset, while slightly worse on Non-rotated dataset. It would be more interesting to train for as many iterations as possible (for example, RRPN with ratio = 3 should be trained for > 2M iterations for it to see as many non-rotated examples as the baseline non-rotated model) and evaluate the results. If the result shows significant trade-off between the models, it might be an evidence of hitting the model capacity. We also compared the latency of int8-quantized rotated model with the baseline non-rotated model during inference and found no significant increase, which means we can process images at similar throughput as before. 5 Conclusions We adapted Rotation Region Proposal Network for detecting oriented text, and made important improvements based on it. This further improves the quality Rosetta, the OCR system at Facebook, to protect against adversarial text in images including engagement bait, policy violating ads and images with profanity and hate speech, as well as improves the accuracy of screen readers to make Facebook more accessible for the visually impaired. In the future, this framework could be extended to more adversarial scenarios such as text with general affine transformations. 4

Acknowledgement The authors would like to thank Albert Gordo, Peizhao Zhang, Manohar Paluri, Tyler Matthews, Mahalia Miller and others who contributed, supported and collaborated with us during the development and deployment of our system. References [1] F. Borisyuk, A. Gordo, and V. Sivakumar. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 71 79. ACM, 2018. [2] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue. Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia, 2018. [3] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91 99, 2015. [4] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017. arxiv preprint arxiv:1707.01083, 2017. 5