An Efficient Method for Text Detection from Indoor Panorama Images Using Extremal Regions

Size: px

Start display at page:

Download "An Efficient Method for Text Detection from Indoor Panorama Images Using Extremal Regions"

Philomena Dorsey
5 years ago
Views:

1 Proceeding of the 2015 IEEE International Conference on Information and Automation Lijing, China, August 2015 An Efficient Method for Text Detection from Indoor Panorama Images Using Extremal Regions Yuan Liu, Kao Zhang, Jian Yao, Tong He, Yahui Liu, and Jinge Tu School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, Hubei, P.R. China Web: Abstract Text detection in complex real images, such as panorama images, remains great challenging in Computer Vision. A general method often focuses on the small test images with single background which makes it easier to do the detection and recognition. In this paper, we find a novel approach, as it can automatically deal with the indoor panorama images which contains distortion and illumination problems to extract the multi-scale trademark. Our method fuses edge information, color probability detection and geometric characteristics to segment the text and non-text part, and exploits Extremal Regions (ERs) which is robust to blur, illumination, color and texture variation to deal with low contrast text and find the accurate localization. Effectiveness of algorithm has been discussed in the experimental result section where the performance has been compared for different number of feature used. Index Terms Text Detection, Extremal Panorama images, Multi-Scale trademark. I. INTRODUCTION Regions, indoor Text detection in real-world scene images is a hot topic which has been receiving significant attention in recent decades since it is a critical step for a number of Computer Vision applications, such as translation by taking photo of images, content-based web image search, extracting business information from the panorama map (e.g. Google Street View). Unlike traditional text recognition system (OCR) [1], [2], [3], text detection based on real scenes faced with the complex background, multi-scale sizes, various fonts, color and orientation. So there is no very efficient and accurate method detecting text from natural scenes until now. Especially, we lack of a method acting on the panorama images directly which can help us save several times in disposing the large-scale and high-resolution images. Text localization usually is a computationally very expensive task as generally any of the 2N subsets can correspond to text (where N is the number of pixels). General methods for natural scene text detection can roughly divided into three categories based on how they solve the issue. The first kind of methods exploit a sliding-window approach to localize individual characters or whole words [4], [5], [6]. Sliding window based methods, also called as region based method, search for the possible text regions in a subset of image rectangles. Because of the multiple scales of the images, the methods always tend to be slow in searching. The second methods may exploit Convolutional Neural Network (CNN) [7], Support Vector Machine (SVM) [8], Boosting algorithm, Artificial Neural Network (ANN) and so on to separate the text regions from others. But all of these classifiers need a complete sample database which contains enough positive and negative samples (typically more than five thousand of training images) to guarantee the reliability of classification. The third methods based on the connected components [9], [10], [11]. Use the color, edge destiny, texture or some other information to extract the single character or word. There are two art-of-state method named Stroke Width Transform (SWT) [12], [13], [14] and Maximally Stable Extremal Regions (MSERs) [15], [16], [17] belongs to these kind of methods. SWT was put forward by Microsoft Corporation firstly, it uses edge map to find the boundary of strokes and searches each edge in the direction of every gradient, then finds the strokes which are at the same width to confirm the text regions. But SWT has a strict requirement of texts background and fonts, when the background is complex or the texts are not written in the uniform font, the method will come to be unstable. MSERs have achieved great success in scene text detection. It is stable for affine transformation, gray change and so on. However, the low-level pixel operation inherently limits its capability for handing complex text information efficiently, leading to the difficulty in distinguishing texts. And big frame images also bring the problems for these method in detecting the texts directly and automatically, such as image distortion and self-illumination both lead to incorrect edge detection and bring a big challenge in text extraction. In this paper a new method of text detection process has been proposed which combines edge density, gray gradient, geometrical information and extremal regions, aiming at detect the text in indoor panorama images directly and automatically. The experiment result shows that the method is suitable for multiple scales of the images, and the precision can reach to 57%. Before detecting the text regions precisely, we eliminate the interference of illumination and background, which makes it possible to improve the precision significantly /15/$ IEEE 781

255 in different channels. H sum i = H sum i 1 +H i (1) Secondly, calculate the top one percent of the pixels values R min, G min, B min, and find the minimal one among them.

2 255 in different channels. H sum i = H sum i 1 +H i (1) Secondly, calculate the top one percent of the pixels values R min, G min, B min, and find the minimal one among them. Then, do the same thing when find the maximal one of the bottom one percent of the pixels values. { Cmin = min{r min,g min,b min } (2) C max = max{r max,g max,b max } Fig. 2. Overview of text detection method. II. ALGORITHM In this section the detail for text detection method is described. The major parts of the algorithm are image preprocessing, candidate region extraction, accurate localization, image post-processing. A. Image Acquisition Fig. 1 is one of our test images which is taken from Nikon D7100 at four direction angles 0, 90, 180, 270, then stitched by Panorama Tools, at last, it conducts a panorama image with pixels. Fig. 1 shows that most part of the image is useless so that we can cut those parts in preprocessing. B. Preprocessing Fig. 2 shows the overview of our method. The first step in the preprocessing stage is to enhance image. In this paper, a proposed illumination compensation algorithm based on gamma curvilinear guaranteeing some good results both on illumination compensation and on preserving the color constancy. The enhance method includes three step, firstly, statistic each channel of histogram, and sum the value of each channel. Statistic the frequency of each luminance values in the new histogram of is the sum of old ones, marked as Hsum R, HG sum, HB sum. In (1), i means the values from 0 to Finally, use the Gamma curve to adjust illumination, then calculate the value C g min and Cg max in the Gamma histogram when the variables come to bec min andc max. Additionally, when the pixel value i is less than C min, the value turns to be 0, or i great than C max, it turns to be 255. Then process the other pixel values by selective line stretch aimed at enhancing the brightness range of interest contrast. In (3) and (4), i means the brightness ranging from 0 to 255. Hdst i means the frequency of histogram when the brightness is i. Hγ i = 255 ( i 255 ) 1 γ (3) C max = max{r max,g max,b max } 0 i C min Hdst i = H C min γ c g min 255 C c g max c g min < i < C max (4) min 255 i C max After enhancement, for further processing, the image will pass through a set of filters. Due to the structure of panorama camera sensors and complex background, the random noise interfere the quality of images seriously. In order to enhance the quality of input images, a set of filters have been used in this paper to clean the noise. First, the input image is smoothed by Gaussian blur which uses a template to scan the whole image. Each pixel is replaced by weighted average value of its neighboring pixels values after executing a convolution with the template. Gaussian blur is a fast method but will lead to edge fuzziness, so median filter is considered. It is a nonlinear smoothing algorithm which can protect the edge information. If you want to enhance the details of the images, you can also try guided filter [18] at the same time, but in this paper, we did not try it for it will interfere edge extraction with too many details. C. Candidate Region Extraction In this part, we use the gray image edge detection, edge density detection and silhouette detection algorithms to aim at reserving the text candidate regions. 782

Fig. 1. The sample of text images whose useful part is only 0.4 of the image. (a) (b) Fig. 3. The flowchart of candidate region extraction stage.

Then, use Sobel operator which uses two 3 3 kernels (one for horizontal changes and one for vertical) to convolve with the input image.

gradient map. Then using piecewise linear normalization function on the gradient map can help to improve the speed of solving the optimal solution and increase the precision.

3 Fig. 1. The sample of text images whose useful part is only 0.4 of the image. (a) (b) Fig. 3. The flowchart of candidate region extraction stage. 1) Edge Gradient Detection: Edge detection depends on the gray difference between two adjacent pixels, so we convert the input image to gray image. Then, use Sobel operator which uses two 3 3 kernels (one for horizontal changes and one for vertical) to convolve with the input image. In order to extract the more complete edge lines, we expect the gradient between two pixels is more salient, which means that we should choose the bigger one in horizontal gradient map and vertical gradient map. Then using piecewise linear normalization function on the gradient map can help to improve the speed of solving the optimal solution and increase the precision. Convert the gradient map to binarization map which conserves the edge information as far as possible. Set a higher threshold value T 1 of 0.75 and a lower threshold value T 2 of 0.35 for the binarization. Then, use the box blur at size of 61 and do some morphology. Now, the edges in the image have been extracted mostly. 2) Edge Density Detection: Considered the characteristics of spatial distribution in the input image, density of the edges will reduce the noises which also contain apparent edges, because text regions include abundant characters, so they have bigger edge densities. The process of edge density detection is similar to the previous stage. Calculate the gradient maps both in horizontal and vertical direction, and convert the gradient map to binary map at a threshold of 0.25, finally, Gaussian blur and normalization processing are carried out to handle the image. 3) Silhouette Detection: Silhouette detection plays an important role in the text detection, because the characters often consist of short lines but the background always consists of (c) Fig. 4. The result of candidate region extraction. (a) edge gradient map, (b) edge density map, (c) silhouette map. long lines. So it is helpful for segmenting the text regions. This stage takes advantage of EDline [19] method which contains four steps: firstly, Suppression of the noise by Gaussian filtering. Secondly, computation of the gradient magnitude and edge direction maps. Thirdly, extraction of the anchors. Fourthly, connecting the anchors by smart routing. The details of EDline algorithm can be found in the reference paper. Fig. 4 shows us the stage result in each link. After the above process, many of the non-text regions have been filtered out, but there are still some non-text regions reserved. Here we use some constraint conditions to remove the non-text regions. Draw the outer boundary box of each line and save the corner coordinates to calculate the length, width, area, and the aspect ratio. Delete the regions that don t comfort to those conditions. According to the experience, minimum length set as 20 pixels, minimal width is 50 pixels and minimum area comes to be 1500 pixels. D. Accurate Localization An efficient text detection algorithm named Extremal Regions (ERs) who used image moments as features for a monolithic neural network is stable for blur, illumination, hue, texture changes and low contrast. This method was described in Neumann and Matas, for more details see [5]. In this method, the selection of suitable ERs is carried out 783

4 Fig. 6. Situation between adjacent detection regions. Fig. 5. The mask image after reduce some non-text regions. by a sequential classifier on the basis of novel features which are specific for character detection. Moreover, the classifier is trained to output probability and thus extracts several segmentations of a character. 1) Extremal Regions: Considering the color image I as a map which contains three channels, and a channel C of the image I is a mapping C where the pixel values are totally ordered set. In this paper we consider 4-connected pixels, i.e. pixels with coordinates (x ± 1, y) and (x, y ± 1) are adjacent to the pixel (x, y). Region R of an image I is a contiguous subset of the image. Outer region boundary is a set of pixels adjacent which is not belong to R. Extremal Region (ER) is a region whose outer boundary pixels have strictly higher values than the region itself. Define a threshold θ to judge which region belongs to ERs. i.e. C(q)>θ C(p). All the pixels q belongs to ERs, while the assemblage of pixels p is the boundary of the region R. 2) Incrementally Computable Descriptors: The precision of classifier depends on the speed of regional computable descriptors. In the paper, we take advantage of the inclusion relationship between extremal regions. R θ 1 means the extremal region at threshold θ-1. An ER r is a union of pixels of regions at threshold θ-1 and pixels of value θ. The author design a descriptor φ(u)=( φ(u) ) ( ψ(p) ), means adding the adjacent regions, ψ(p) means initialization function. Now let us consider the following incrementally computed descriptors, such as area, bounding box, perimeter, Euler number, horizontal crossings. 3) Sequential Classifier: In order to improve the computational speed, the classification is broken down into two stages. In the first stage, the probability of each ER being a character is estimated using features calculated with O(1) complexity per region tested. Only ERs with locally maximal probability are selected for the second stage, where the classification is improved using more computationally expensive features. A highly efficient exhaustive search with feedback loops is then applied to group ERs into words and to select the most probable character segmentation. Finally, text is recognized in an OCR stage trained using synthetic fonts. 4) Exhaustive Search: Finally, the method uses efficiently pruned search to exhaustively search the space of all character sequences. It uses higher-order properties of text such as word text lines and its robust grouping stage is able to compensate errors of the character detector. E. Post-processing The text regions have mostly been extracted through the previous process. But there are also some problems left, such as the broken words whose letters are separated from each other will mislead the detection. In Fig. 6, x 1, x 2, x 3 refer to the distance between two adjacent detection boxes. Region 1 and Region 2 are close to each other, when the x 1 is less than the threshold, we can mark them as one region. Judge the relationship between other regions by the same rule. x 3 is larger than the threshold, so we can ignore this region. Use the geometrical features can improve the quality of the extraction. After the geometrical constrain, non-maximum suppression (NMS) is a kind of spatial constrain method. Two trademarkers must have some spare area between them. Let us consider a rectangle D whose center is the same as detection box, and its width and height are both twice as the original detection box. If D doesn t contain any other text regions, D has no fake text regions. When there are other detection boxes in D, we should sort the boxes by energy, reserve the high-energy one. In this method, we can delete most fake text regions, especially the indication patterns besides the major text regions. III. EXPERIMENTAL RESULTS In this section, we presented the experimental results of the proposed scene text detection method on 180 indoor panorama images taken from Nikon D7100 professional fisheye camera. This paper has three main contributions: firstly, process the panorama images directly; secondly, extract various fonts text regions including texts in the trademarks and signs, thirdly, extract the self-luminous text regions automatically. The following images display some parts of our work. Fig. 7 shows the detection results of some public signs which are useful in the mall. 784

TABLE I T HE ACCURACY OF OUR METHOD. Precision Our method Fig. 7. 0.60 Recall 0.54 F-measure 0.57 of text regions correctly extracted. Missing is a number of texts failed to detect.

To adjust our method more rigorously, Fig. 10 shows the difference among detection result of MSER method, hybrid method and our proposal.

Hybrid method includes MSER and CNN methods, it can detect the text regions precisely when the region only contains texts and the quality of image is great, so it can t use to the panorama images

We can see that the proposed method overcome the problems such as large inclined angle, self-luminous objects, multi-scale signs, various fonts.

In the figure, (a) is a trademarks, (b) is a blurry region, (c) is severe tilted, and (d) has various textures. Results are summarized in Table I.

Color continuity, gray-level variation geometrical relationship and color variance are used as image features, and extremal regions are used as text features.

5 TABLE I T HE ACCURACY OF OUR METHOD. Precision Our method Fig Recall 0.54 F-measure 0.57 of text regions correctly extracted. Missing is a number of texts failed to detect. False is a number of non-text regions extracted. A region is counted as correct when its size is more than 2/3 of the correct region. Otherwise, it is counted as false. To adjust our method more rigorously, Fig. 10 shows the difference among detection result of MSER method, hybrid method and our proposal. It s obvious that MSER is not appropriate for panorama images, there are too much noise. Hybrid method includes MSER and CNN methods, it can detect the text regions precisely when the region only contains texts and the quality of image is great, so it can t use to the panorama images directly either. The results of public signs. IV. C ONCLUSION Fig. 8. Detection results of trademarks. Fig. 8 presents various kinds of trademarks results. We can see that the proposed method overcome the problems such as large inclined angle, self-luminous objects, multi-scale signs, various fonts. In most instances, our method works well, but there are still some situation we can t handle. We recite typical examples Fig. 9. In the figure, (a) is a trademarks, (b) is a blurry region, (c) is severe tilted, and (d) has various textures. Results are summarized in Table I. Correct is a number (a) In this paper, we proposed a method that extracts text regions in indoor panorama images using image features and text features. Color continuity, gray-level variation geometrical relationship and color variance are used as image features, and extremal regions are used as text features. The proposed method was tested with various kinds of the indoor panorama images and confirmed that results are better than the traditional methods. But there still are some problems left, since the trade-markers in our images are in the highlighted area with halo, so they usually can detect double edges which makes it difficult to extract the text regions by traditional methods. And the big data also limit the speed of the proposed method, our next goal is to improve the method to handle those problems. Fig. 11 shows the results of our method. ACKNOWLEDGMENT This work was partially supported by the Natural Science Foundation of Hubei Province of China (Project No. 2013CFB296), the Open Research Fund of The Academy of Satellite Application under grant NO CXJJ-YG 13, and the South Wisdom Valley Innovative Research Team Program. (b) R EFERENCES (c) Fig. 9. [1] Jianhong Xie, Optical character recognition based on least square support vector machine, in Third International Symposium on Intelligent Information Technology Application, [2] Alessandro Bissacco, Mark Cummins, Yuval Netzer, and Hartmut Neven, Photoocr: Reading text in uncontrolled conditions, in IEEE International Conference on Computer Vision (ICCV), (d) Unsuccessful detection situation. 785

(a) MSER (b) MSER+CNN Fig. 10. (c) Our method Comparison among different methods. Fig. 11. Detection result of trademarks.

learning, in International Conference on Document Analysis and Recognition (ICDAR), 2011.

Matas, Real-time scene text localization and recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

6 (a) MSER (b) MSER+CNN Fig. 10. (c) Our method Comparison among different methods. Fig. 11. Detection result of trademarks. [3] Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, David J Wu, and Andrew Y Ng, Text detection and character recognition in scene images with unsupervised feature learning, in International Conference on Document Analysis and Recognition (ICDAR), [4] Kai Wang, Boris Babenko, and Serge Belongie, End-to-end scene text recognition, in IEEE International Conference on Computer Vision (ICCV), [5] L. Neumann and J. Matas, Real-time scene text localization and recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [6] Hao Wang and Jari Kangas, Character-like region verification for extracting text in scene images, in 12th International Conference on Document Analysis and Recognition, [7] Weilin Huang, Yu Qiao, and Xiaoou Tang, Robust scene text detection with convolution neural network induced mser trees, in Computer Vision ECCV [8] Rodrigo Minetto, Nicolas Thome, Matthieu Cord, Jorge Stolfi, Fre de ric Precioso, Jonathan Guyomard, and Neucimar J Leite, Text detection and recognition in urban scenes, in IEEE International Conference on Computer Vision Workshops (ICCV Workshops), [9] Zhuowen Tu, Yi Ma, Wenyu Liu, Xiang Bai, and Cong Yao, Detecting texts of arbitrary orientations in natural images, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [10] Le Kang, Yi Li, and David Doermann, Orientation robust text line detection in natural images, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [11] Jing Zhang and Rangachar Kasturi, Character energy and link energy- 786 [12] [13] [14] [15] [16] [17] [18] [19] based text extraction in scene images, in Computer Vision ACCV Luka Neumann and Jiri Matas, Scene text localization and recognition with oriented stroke detection, in IEEE International Conference on Computer Vision (ICCV), Boris Epshtein, Eyal Ofek, and Yonatan Wexler, Detecting text in natural scenes with stroke width transform, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cong Yao, Xiang Bai, Baoguang Shi, and Wenyu Liu, Strokelets: A learned multi-scale representation for scene text recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), David Nistr and Henrik Stewnius, Linear time maximally stable extremal regions, Lecture Notes in Computer Science, pp , Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao, Robust text detection in natural scene images, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 5, pp , Jiri Matas, Ondrej Chum, Martin Urban, and Toma s Pajdla, Robust wide-baseline stereo from maximally stable extremal regions, Image and Vision Computing, vol. 22, no. 10, pp , Kaiming He, Jian Sun, and Xiaoou Tang, Guided image filtering, in Computer Vision ECCV Cihan Topal and Cuneyt Akinlar, Edge drawing: A combined realtime edge and segment detector, Journal of Visual Communication and Image Representation, vol. 23, no. 6, pp , 2012.

Segmentation Framework for Multi-Oriented Text Detection and Recognition

Segmentation Framework for Multi-Oriented Text Detection and Recognition Shashi Kant, Sini Shibu Department of Computer Science and Engineering, NRI-IIST, Bhopal Abstract - Here in this paper a new and