TEXT SEGMENTATION ON PHOTOREALISTIC IMAGES

Similar documents
Lecture 7: Semantic Segmentation

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation

arxiv: v1 [cs.cv] 31 Mar 2016

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

Team G-RMI: Google Research & Machine Intelligence

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Classification of objects from Video Data (Group 30)

Multi-channel Deep Transfer Learning for Nuclei Segmentation in Glioblastoma Cell Tissue Images

Gradient of the lower bound

Visual features detection based on deep neural network in autonomous driving tasks

Machine Learning 13. week

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Presentation Outline. Semantic Segmentation. Overview. Presentation Outline CNN. Learning Deconvolution Network for Semantic Segmentation 6/6/16

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Time Stamp Detection and Recognition in Video Frames

Object Detection on Self-Driving Cars in China. Lingyun Li

Yiqi Yan. May 10, 2017

Convolution Neural Networks for Chinese Handwriting Recognition

Conditional Random Fields as Recurrent Neural Networks

String distance for automatic image classification

DeepBIBX: Deep Learning for Image Based Bibliographic Data Extraction

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Deep Face Recognition. Nathan Sun

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Deconvolutions in Convolutional Neural Networks

CHAPTER 1 INTRODUCTION

Semantic Soft Segmentation Supplementary Material

Spatial Localization and Detection. Lecture 8-1

Deep Learning for Object detection & localization

Text Area Detection from Video Frames

arxiv: v2 [cs.cv] 30 Sep 2018

Object Detection Based on Deep Learning

Thinking Outside the Box: Spatial Anticipation of Semantic Categories

Text Enhancement with Asymmetric Filter for Video OCR. Datong Chen, Kim Shearer and Hervé Bourlard

Classifying a specific image region using convolutional nets with an ROI mask as input

Detecting and Recognizing Text in Natural Images using Convolutional Networks

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

Text Information Extraction And Analysis From Images Using Digital Image Processing Techniques

An Exploration of Computer Vision Techniques for Bird Species Classification

VEHICLE CLASSIFICATION And License Plate Recognition

Advanced Video Analysis & Imaging

CS 523: Multimedia Systems

Supplemental Document for Deep Photo Style Transfer

Keywords: Thresholding, Morphological operations, Image filtering, Adaptive histogram equalization, Ceramic tile.

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

arxiv: v4 [cs.cv] 6 Jul 2016

A Fast Caption Detection Method for Low Quality Video Images

Image Matching Using Run-Length Feature

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-7)

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

arxiv: v2 [cs.cv] 18 Jul 2017

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Flow-Based Video Recognition

Kaggle Data Science Bowl 2017 Technical Report

RSRN: Rich Side-output Residual Network for Medial Axis Detection

Yield Estimation using faster R-CNN

END-TO-END CHINESE TEXT RECOGNITION

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

Convolutional Neural Networks

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

RON: Reverse Connection with Objectness Prior Networks for Object Detection

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Character Recognition from Google Street View Images

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Topics for thesis. Automatic Speech-based Emotion Recognition

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Structured Prediction using Convolutional Neural Networks

arxiv: v3 [cs.cv] 2 Jun 2017

CAP 6412 Advanced Computer Vision

Know your data - many types of networks

arxiv: v2 [cs.cv] 19 Nov 2018

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

A Keypoint Descriptor Inspired by Retinal Computation

Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features. Wei-Ta Chu

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

Presented at the FIG Congress 2018, May 6-11, 2018 in Istanbul, Turkey

arxiv: v1 [cs.cv] 8 Mar 2017 Abstract

Final Report: Smart Trash Net: Waste Localization and Classification

EE-559 Deep learning Networks for semantic segmentation

Digital Image Processing Fundamentals

Detection of Edges Using Mathematical Morphological Operators

A THREE LAYERED MODEL TO PERFORM CHARACTER RECOGNITION FOR NOISY IMAGES

Text Recognition in Videos using a Recurrent Connectionist Approach

Solving Word Jumbles

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Machine Learning Methods for Ship Detection in Satellite Images

Dynamic Video Segmentation Network

Statistical Approach to a Color-based Face Detection Algorithm

N.Priya. Keywords Compass mask, Threshold, Morphological Operators, Statistical Measures, Text extraction

A Generalized Method to Solve Text-Based CAPTCHAs

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

An Approach to Detect Text and Caption in Video

arxiv: v1 [cs.cv] 16 Nov 2018

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Keras: Handwritten Digit Recognition using MNIST Dataset

Transcription:

TEXT SEGMENTATION ON PHOTOREALISTIC IMAGES Valery Grishkin a, Alexander Ebral b, Nikolai Stepenko c, Jean Sene d Saint Petersburg State University, 7 9 Universitetskaya nab., Saint Petersburg, 199034, Russia E-mail: a valery-grishkin@yandex.ru, b aleksandr.ebr@gmail.com, c n.stepenko@spbu.ru, d senejeanvalery@yahoo.fr The paper proposes an algorithm for segmentation of text, applied or presented in photorealistic images, characterized by a complex background. The algorithm is able to determine the exact location of image regions containing text. It implements the method for semantic segmentation of images, while the text symbols serve as detectable objects. The original images are pre-processed and fed to the input of the pre-trained convolutional neural network. The paper proposes a network architecture for text segmentation, describes the procedure for the formation of the training set, and considers the algorithm for pre-processing images, reducing the amount of processed data and simplifying the segmentation of the object "background". The network architecture is a modification of well-known ResNet network and takes into account the specifics of text character images. The convolutional neural network is implemented using CUDA parallel computing technology at the GPU. The experimental results for evaluating quality of text segmentation with IoU (Intersection over Union) criterion have proved effectiveness of the proposed method. Keywords: text segmentation, semantic segmentation, convolution neural network 2018 Valery Grishkin, Alexander Ebral, Nikolai Stepenko, Jean Sene 369

1. Introduction Photorealistic images containing text information comprise a large part of the multimedia content presented on the Internet. These include real pictures with the text applied to them later, the video in which there are titles, pictures with inscriptions, signs, ads, etc. For high-quality text recognition in such images, it is necessary to separate the text from a rather complex background, in other words, to segment the image. Two approaches to segmentation are possible. The first approach is based on finding the image areas containing text and defining, for these areas, the parameters of the minimum bounding rectangles. Within this approach, classifiers using heuristic features are constructed [1-3]. Commonly classifiers like SVM and Random Forest is used. The second approach is based on semantic segmentation methods that localize regions containing objects that are present in the image. These methods use classifiers based on convolutional neural networks [4-6]. In this paper, we propose to use the second approach for text segmentation. It allows to achieve high quality segmentation of various objects on a complex background. The methods of semantic segmentation are good in localizing objects representing closed and sufficiently large areas of the image. However, text characters in most cases consist of relatively thin lines and occupy a small area. Therefore, it is not possible to use neural networks designed for segmentation of large objects directly for text segmentation. However, it is possible to modify the architecture of well-proven deep convolutional neural networks for working with images containing text characters. 2. Convolutional neural network for text segmentation Quality of segmentation of objects is estimated by criterion of IoU (Intersection over Union) Currently, the DeepLabV3 neural network [7] shows the best segmentation results. For this network, IoU criterion reaches 77% on the PASCAL VOC 2012 data set [8]. Therefore, we chose architecture of this network as the basis for developing a new network for symbol segmentation. 2.1. Network topology Figure 1 shows the proposed topology of network for text segmentation. This topology is similar to the topology of DeepLabv3 network. It contains sequential convolutional layers with average pooling. Network output unit performs atrous spatial pyramid pooling (ASPP). Figure 1. Topology of network for text segmentation The ASPP block uses several parallel layers of extended convolutions with different spatial steps (atrous rate) between convolution weights. These expanded convolutions allow you to increase the perceptual field without reducing the spatial dimension. The block also calculates image level 370

features. These low-level features are obtained by processing the output of the convolutional block using global average pooling. The result of the pooling is then processed by convolution 1x1. Output features of all five branches are combined and passed through a 1x1 convolution. The resulting image at the output of the ASP block is the segmentation mask. Convolution of the original DeepLabv3 network is configured so that output stride i.e. the value of the ratio of the resolution of the input image to the resolution of the output image equals 16. This approach is not valid for text segmentation because the geometric structure of text symbols is not preserved. Considering these differences, as well as the fact that the text symbols are of a simpler geometric structure than the objects of the PASCAL VOC 2012 dataset, we suggest reducing the output stride to 4. Each convolutional block of the original DeepLabv3 network contains two paths of image passage: short and long through several consecutive convolutions. This allows us not to lose the data that turned into 0 at convolutions. This property is useful for presenting text, since the loss of part of the fine lines that make up the text can affect the overall quality of segmentation. Usually for segmentation of complex objects of many classes several blocks of the structure shown in Figure 2 are used. To ensure the output stride value of 4, we suggest using only one block of this type. The proposed block contains two consecutive convolutions, to the outputs of which the activation function ReLu is applied, followed by averaging pooling with a 4x4 window and a shift parameter of two. Figure 2. Convolutional block of the original DeepLabv3 network The ASPP block is similar to the same DeepLab V3 block. It consists of four parallel convolutions one convolution with a 1 1 core, three expanded convolutions with a 3 3 core, and has an image level features calculation unit. The difference is that for extended convolutions, the spatial steps between the convolution weights have been reduced. We chose the following values of atrous rates - 2, 4, and 6, while in the original network their values were - 6, 12, and 18. We apply triple reduction in rate because the text consists of relatively thin lines. Note that convolution with a 3 3 kernel and a maximum rate of six covers a line width of 15 pixels, which seems sufficient for text segmentation. Like in the original network, we combine the resulting feature maps and pass them through convolution 1x1. 2.2. Dataset To train the proposed network, a pre-marked set of photo-realistic images containing text is required. Currently, only one public dataset is known that is suitable for the subject area - Chars74K [9]. However, the quantity and quality of images, as well as annotations of the text in it, are quite low. Therefore, we have created our own data set to solve the problem. This set is generated on the basis of a collection of a sufficiently large number of existing fonts that take into account the various options for displaying the characters of an alphabet. When generating a set, we formed the text, which was then superimposed, with a certain level of transparency, on the photorealistic image. The text itself can be of any size, color, and angle of inclination, displayed in one or several fonts, and located anywhere in the image. To generate text, we used lower case letters of the Russian alphabet, numbers, and common punctuation marks ('.', ',', ';', '-', ':', '!') 49 characters overall. The algorithm randomly selects characters from this alphabet and then picks nouns of the Russian language from the dictionary. Then it applies the text to the image using one of 30 free TrueType fonts. The font type and its size and color are all selected randomly. As background images, 5,000 photorealistic images that do not contain text are used. The algorithm of applying text to the original image is like the following: we divide the image into 4 blocks and apply the pre-shaped text to each image block. Again, the position and orientation of the text in the block is randomly selected. When applying text to a background image, two masks are also formed that serve to mark up the dataset being created. The first mask is a binary image of white text on a black background. The 371

second mask is a grayscale image of the same text, where the brightness of the symbol corresponds to the position of the symbol in the alphabet. We use the first mask for binary segmentation between the text and the background, and the other mask for multi-class segmentation of character type vs. background. 2.3. Preprocessing Before serving to the input of the neural network, we convert the original color image to grayscale. Then, using the Canny operator on this halftone image, the boundaries of the brightness differences are searched. Since the text is different in brightness from the background, this allows to highlight the borders of the characters. At the same time, all other rather sharp differences in brightness in the halftone image are also highlighted. The generated binary image of all boundaries is processed using the morphological dilation operation. Then, element-wise multiplication of the resulting binary mask by a halftone image is performed. This processing allows for filtering out most of the background, leaving just information about the brightness near the borders, which simplifies the task of separating the background borders from the text borders. Figure 3 shows the structure of the preprocessing algorithm. Figure 3. The preprocessing algorithm The proposed preprocessing algorithm simplifies the segmentation of the background class, which is usually diverse and therefore requires large computational costs for network training. For segmentation of the background consisting of pixels of zero brightness, less computation effort is. 3. Experimental results The proposed neural network implemented in the Python language using the TenorFlov machine learning library for the GPU platform. Software for generating the necessary dataset is also implemented in the Python language. We trained our network model on 4000 training images and evaluated it on 1000 verification images from the generated dataset. The segmentation quality is evaluated by the metric IoU. Figure 4 shows the results of the background - text segmentation, and in Fig 5 the results of the background - symbols segmentation. All of these graphs show the dependence of the IoU metric value during network training. Figure 4. Background text segmentation. Changing IoU metrics during training: a) the average value of the metric; b) metric value for the text class; c) metric value for the background class. Figure 5. Background symbols segmentation. Changing average IoU metrics during training 372

Note that in the case of background symbols segmentation, the accuracy is significant falls. However, both approaches can be combined and used for clarify the results of each other. 4. Conclusion We propose the algorithm for segmentation of text regions on photorealistic images. It consists of a preprocessing step, a recognition step, and a localization step. The second step uses the modified convolutional network DeepLabV3 for recognition. Unlike the original network, the modified neural network saves the geometric structure of text characters into the feature maps. The third step determines the exact localization of recognized text characters and finds areas of the image containing text. Experimental results show the effectiveness of the proposed algorithm. The segmentation quality is evaluated by the metric IoU and reaches 78%, which is sufficient for further processing of the image text using OCR systems. The use of parallel processing technology significantly reduces the processing time of large series of images 5. Acknowledgement References The authors acknowledge Saint-Petersburg State University for a research grant 11756691. [1] Zhang Jing, Dong Wei, Zhang Youhui. An Algorithm for Scanned Document Image Segmentation Based on Voronoi Diagram // Computer Science and Electronics Engineering (ICCSEE), 2012 International Conference on(volume:1), 2012, pp. 156 159. [2] H. S. Baird, M. A. Moll, Chang An, Document Image Content Inventories // Proc. of SPIE/IS&T Document Recognition & Retrieval, 2007. [3] Grishkin, V. Document Image Segmentation Based on Wavelet Features // CSIT 2015-10th International Conference on Computer Science and Information Technologies. pp. 82-84. 2015. - DOI:: 10.1109/CSITechnol.2015.7358255 [4] S. Chandra and I. Kokkinos. Fast, exact and multi-scale inference for semantic image segmentation with deep Gaussian CRFs. Available at: https://arxiv.org/abs/1603.08368 (accessed 11.05.2018) [5] L.-C. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In CVPR, 2016. [6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015. [7] L.-C.Chen, G.Papandreou, F.Schroff, and H.Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. Available at: https://arxiv.org/abs/1706.05587 (accessed 18.04.2018) [8] Pascal VOC data set mirror. Available at: https://pjreddie.com/projects/pascal-voc-dataset-mirror/ (accessed 19.04.2018) [9] The Chars74K dataset. Available at: http://www.ee.surrey.ac.uk/cvssp/demos/chars74k/ (accessed 20.03.2018) 373