Thai Text Localization in Natural Scene Images using Convolutional Neural Network

Size: px

Start display at page:

Download "Thai Text Localization in Natural Scene Images using Convolutional Neural Network"

Nicholas Whitehead
5 years ago
Views:

1 Thai Text Localization in Natural Scene Images using Convolutional Neural Network Thananop Kobchaisawat * and Thanarat H. Chalidabhongse * Department of Computer Engineering, Chulalongkorn University, Bangkok, Thailand Thananop.Ko@student.chula.ac.th Tel: Department of Computer Engineering, Chulalongkorn University, Bangkok, Thailand Thanarat.C@chula.ac.th Tel: Abstract Text detection in natural scene images is a challenging problem due to many variations and uncontrollable factors comparing to text detection on scanned document. Unlike the existing Thai text detection methods which focus on using connected component analysis combining with other rulebased techniques to localize text, our proposed method is based on a well-known automatic feature extractor neural networks called Convolutional Neural Networks (CNN). The CNN is first trained with both English and Thai text datasets. A multi-scaled text confidence maps are constructed in order to cope with the text size variations. Some post-processing and Thai text analysis are also employed to acquire text locations in the image. Base on our experimental results, the proposed method can detect English and Thai text from natural scene images with a promising accuracy comparing to the state-of-the-art method. I. INTRODUCTION Text information in images can be used for many applications, such as automatic language translation, scene text understanding and assistive text reading for visually impaired people. Due to a wide range of applications, this problem has received significant attention from many researchers. However, locating text in natural scene images is unlike locating text from scanned documents. In natural scene images, there are many unpredictable factors, such as text size, style, numerous ranges of backgrounds and the variation of lighting condition. Many methods have been proposed and reported the promising results on English text dataset. Nevertheless, the existing methods do not work well enough on Thai text due to some specific characteristics of the language. Scene text localization techniques can be roughly categorized into two groups which are connected-componentbased and region-based [1]. Connected-component-based methods [2] [4] use prior knowledge of text characteristics such as color, stroke-width, geometric properties combine with post-processing and some heuristics to prune non-text area. Region-based methods typically use sliding window to find text areas in an image. Features are extracted from each window and passed to a classifier to classify text regions. Due to the popularity of machine learning, many region-based text localization methods use machine learning algorithms such as Adaboost [5][6], Support Vector Machine (SVM) [7][8] and Neural Network [9] as classifiers. In order to use these machine learning algorithms, feature extractors are required. Many well-crafted computer vision feature extractors, such as Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP), and Discrete Wavelet Transform (DWT) are used together with handcrafted features to build up feature extractor. Then, post-processing techniques are applied to acquire text location. In this paper, instead of finding a proper feature extractor for Thai text, we employ Convolutional Neural Network (CNN) which is a kind of learnable feature extractor neural network. CNN has enjoyed many great successes in related fields such as license plate localization, face detection, handwritten digit classification, and character recognition. We are able to create CNN text detector combined with postprocessing techniques to acquire Thai texts location in a natural scene image with promising results. The rest of this paper is organized as follows: In Section II, we present a survey of region-based text localization methods and Thai text localization in natural scenes. Our CNN text detector details are provided in Section III. In Section IV, we present the proposed post-processing techniques, which are designed to improve Thai text localization results. In Section V, we present the experimental results on test datasets compared to other multi-language text localization methods. Finally, we conclude our work in Section VI. II. RELATED WORK Many region-based text localization in natural scene images methods have been proposed. J. Lee et al. [10] presented text detection using six features, which were extracted from multiscaled input image and classified by Modest Adaboost text detector to obtain result map. Then, the post-processing techniques were applied to construct output text region. A text detector proposed by A. Coates et al. [11] used K APSIPA APSIPA 2014

2 mean clustering to learn feature extractor from the preprocessed ICDAR 2003 dataset. Like CNN, this method builds an unsupervised feature extractor from training data. However, there were no details about their text localization post-processing method described in the paper. T.Wang et al. [12] presented an end-to-end scene text recognition including scene text localization using CNN. Their first layer feature extractor of CNN text detector was derived from the method proposed in [11]. A post-processing technique was applied on multi-scaled outputs from CNN text detector to acquire text locations. However, their CNN text detector and the post-processing technique do not work well with Thai text due to some special characteristics of Thai language. For Thai text detection in natural scene images, W. Jirattitichareon and T.H. Chalidabhongse [13] proposed a method to detect Thai text from low-quality signs using edge features, connected component labeling and segmenting with color model using Gaussian Mixture Model (GMM). K.Woraratpunya et al. [14] also introduced Thai text detection from natural scene images using fast boundary clustering and modified connected component analysis with heuristic rules. However, these two methods proposed text analyzing techniques based on some assumptions of specific characteristics of sign images which might not work well on complex scene images containing lots of higher variations of text images. Figure 1. Example of text and non-text patches from ICDAR 2003, ICDAR 2011, Char74k, and SVT datasets. In order to make the CNN detect Thai text more accurately, we needed to train the text detector with Thai text dataset. However, to the best of our knowledge, there is no standard Thai text dataset. Thus, we synthesize images of Thai text dataset from 500 fonts with random sizes, styles, and apply filters to make dataset more realistic. We randomly created 40,000 Thai text patches as shown in Fig. 2. In this work, we use 80,000 text patches (40,000 for English and 40,000 for Thai) and 80,000 non-text patches. III. TEXT DETECTOR In order to localize text in natural scene images using CNN text detector, we had to train our text detector by well-known text dataset combined with synthetic Thai text dataset. Then, the trained text detector was applied on multi-scaled input image to acquire text confidence maps for a post-processing stage. A. Text Detector Learning Architecture 1) Dataset Acquisition and Pre-processing In order to train our text detector, we used 4 well-known text datasets. Those were ICDAR 2003 [15], ICDAR 2011 [16], Char74k [17] and SVT (Street View Text) [18]. These dataset images were converted into grayscale to overcome scene text color variations problem. A 32x32 sliding window was applied on each image to gather text and non-text patches. Each sliding window, that contained text area with at least 80% compared to the provided ground-truth, was counted as a text patch. The sample of acquired dataset is shown in Fig. 1 Figure 2. Example of generated Thai text dataset. Next, a local brightness and contrast normalization was applied to each patch to correct non-uniform illumination and contrast differences, which is defined as,,,, where,,,,, and, represent the output patch, the input patch, the estimated local mean and the estimated local standard deviation of input patch respectively. 2) Text Detector Training The proposed CNN text detector has 5 layers which are 2 convolution layers, 2 average pooling layers, and a fullyconnected layer. The output layer consists of 2 nodes which is text and non-text. The overall text detector model and structure is shown in Fig.3 and Table.1 respectively. (1)

Text Input Patch Convolution Layer Non- Text Fully Connected Subsampling

CNN Text Detector TABLE I TEXT DETECTOR STRUCTURE Layer Type Input Kernel

P 28x28 2x2 20 14x14 3 C 14x14 3x3 150 12x12 4 P 12x12 2x2 150 6x6 5 F

Fully-Connected Layer Our text detector was then trained using

hyperbolic tangent sigmoid defined in (2) as a non-linearity function and

7155 tanh (2) where and represent the output and input of variance

Text Confidence Map In this part, we used our trained text detector as

First, the multi-scaled input image was built.

multi-scaled text confidence maps that were used in the postprocessing

1) Image Pre-Processing An input image was converted into grayscale to

In order to make our text detector with fixed 32x32 pixel input to detect

to 150% of the original input image size with 10% increasing step.

window before passing to the trained CNN text detector to obtain

Original input image. Text confidence map at scale 1.5, 1.2, 1.1,0.9,0.

TEXT CONFIDENCE MAP POST-PROCESSING In this section, the text confidence

First, the estimated text bounding boxes were acquired from each text

Then, all scaled estimated text bounding boxes were selected based on NMS

3 Text Input Patch Convolution Layer Non- Text Fully Connected Subsampling Layer Layer Figure 3. CNN Text Detector TABLE I TEXT DETECTOR STRUCTURE Layer Type Input Kernel Feature map / Output Size Size Hidden Layer Size 1 C 32x32 5x x28 2 P 28x28 2x x14 3 C 14x14 3x x12 4 P 12x12 2x x6 5 F 6x6x * C Convolution Layer, P Average Pooling Layer, F Fully-Connected Layer Our text detector was then trained using back-propagation algorithm and with the variance normalized version of hyperbolic tangent sigmoid defined in (2) as a non-linearity function and Mean Squared Error (MSE) as an error function tanh (2) where and represent the output and input of variance normalized version of hyperbolic tangent sigmoid function. B. Text Confidence Map In this part, we used our trained text detector as described above to estimate text locations in the given input image. First, the multi-scaled input image was built. Then, each scaled image was passed to our trained text detector to obtain multi-scaled text confidence maps that were used in the postprocessing stage. 1) Image Pre-Processing An input image was converted into grayscale to deal with text color variations. In order to make our text detector with fixed 32x32 pixel input to detect text in various sizes, an image pyramid was constructed ranging from 10% to 150% of the original input image size with 10% increasing step. 2) Multi-Scaled Text Detector The 32x32 pixel sliding window was performed on each scaled image. A local brightness and contrast normalization was also applied on each window before passing to the trained CNN text detector to obtain multi-scaled text confidence maps as shown in Fig.4. Figure 4. Original input image. Text confidence map at scale 1.5, 1.2, 1.1,0.9,0.8,0.7 (From left to right). IV. TEXT CONFIDENCE MAP POST-PROCESSING In this section, the text confidence maps from Section III were post-processed to obtain final text locations. First, the estimated text bounding boxes were acquired from each text confidence maps. Then, all scaled estimated text bounding boxes were selected based on NMS (Non Maximum Suppression) to suppress overlapping bounding boxes with low scores. Finally, we performed Thai Text analysis on each candidate text boxes to acquire final text bounding boxes. A. Estimated Text Bounding Box A similar technique used in [12] was employed to produce line response. For each scaled text confidence map, we calculated line response by applying line level sliding window rules as defined below,, 0 where denoted line sliding window width. For each row response 0, we constructed line-level bounding box with a proper size to image scale, which made possibility of overlapping bounding box. When a bounding box from all scales was acquired, a NMS was applied to suppress overlapping bounding boxes with low scores and obtain the candidates of text bounding boxes as an output. B. Thai Text Characteristic and Analysis 1) Thai Text Characteristics Thai Text consists of consonants, vowels, tone marks, and special characters as shown in Table II. (3)

4 Unlike English text which all alphabets and vowels are written on a single line, ก ข one ฃ ค line ฅ ฆ of ง จ Thai ฉ ช text ซ ฌ is ญ divided ฎ ฏ ฐ ฑ into ฒ 4 levels as shown in Fig.5. Alphabets and some vowels are Consonants ณ ด ต ถ ท ธ น บ ป ผ ฝ พ ฟ ภ ม ย ร ล written in main level while the others might be written above or below levels. These special ว ศ ษ characteristics ส ห ฬ อ ฮ make the existing state-of-the-art English text detection methods not to work well with Thai text. Vowels Figure 5. Unlike the English text, the characters in Thai text distributive lie over the 4 levels. 2) Thai Text Analysis TABLE 2 THAI TEXT CLASSIFICATION ะาๅำ เโใไ Tone Marks Special Characters ๆฯ In order to refine each candidate text bounding boxes to be more accurate for Thai Text, we must know estimate text line location. We employed Canny edge detector [19] and applied connected component analysis on each candidate text bounding box as shown in Fig Figure 7. Estimated text line layout where a, b and c represent the estimated upper, center and lower text line. Focusing on the center components, we can get the estimated character height from the estimated upper and lower line location. In Thai text, vowels and tone marks are usually written above and below main line characters with height not over than 50% of character height. From this assumptions, we padded candidate text bounding box above and below by 50% of character height from the estimated upper and lower line locations. On the new pad bounding box, we performed same process as above to acquire the estimated center of text line, the upper and lower line locations. We considered the components above upper line and below lower line as upper and lower components respectively. a b c Upper Center Lower Figure 8. Padded candidate text bounding box. Text layout analysis result. (c) Figure 6. Candidate text bounding box. Canny edge detector result. (c) Connected Component Analysis result. We computed the centroid of each connected component and found the mean of y-position as estimated center of text line. The connected components which lied in ± 30% interval of estimated center line were considered as center line components. Then, estimated upper and lower line location were calculated from the center line components. From this information, we were able to acquire text line layout as shown in Fig.7 to perform components analysis. For each upper and lower components, we found the center line component as shown in Fig.8b. It had the least distance among the other center components. Normally, Thai text places the upper and lower components above or below center line character between 0 to 45 degrees from center line character centroid as shown in Fig.9. This hypothesis helped us consider that should addition components from padded text bounding box be included in final text bounding box.

5 0 45 TABLE 4 TEXT LOCALIZATION METHODS PERFORMANCE EVALUATION ON ICDAR2003 DATASET Method Dataset English [ICDAR2003] Precision Recall F-Measure 1 st ICDAR [15] Epshtein [2] Proposed method B.Bai [20] Y.Pan [3] Figure 9. Example of proper align upper component. After we calculated all upper and lower components hypothesis. For the final text bounding box, we padded the upper and lower of bounding box only if >70% of upper and lower components were aligned in proper positions. Then, we built the minimal bounding box, which included all proper components as the final text bounding box as shown in Fig.10. Figure 10. The estimated text bounding box. The final refined text bounding box. V. EXPERIMENTAL RESULT In order to evaluate our proposed method, we had conducted the experiment using Thai-English text and English text only dataset on different methods. Our Thai-English text dataset consists of 200 images in 640x480 pixel and English dataset is from ICDAR 2003 and ICDAR 2011 standard test dataset. We evaluated performance using 3 ICDAR standard text localization evaluation criterions [15] which are precision, recall and f-measure between our proposed method and other multi-language text localization methods. The result on each dataset is shown in Table 3-5. TABLE 3 TEXT LOCALIZATION METHODS PERFORMANCE EVALUATION ON THAI-ENGLISH DATASET Method Dataset Thai-English Precision Recall F-Measure Epshtein [2] T.Wang [12] Proposed method TABLE 5 TEXT LOCALIZATION METHODS PERFORMANCE EVALUATION ON ICDAR2011 DATASET Method Dataset English [ICDAR2011] Precision Recall F-Measure C.Yi [21] Proposed method Epshtein [1] L. Neumann [1] In Fig.11, we compared the results of each text localization method on Thai-English dataset. From the result, our method detected and localized Thai text more accurately. Our proposed method results in Fig.11 (e) showed that the vowels and tone marks in each test images were included in text bonding box compared to other multi-language text localization methods, which can detect center line characters but miss some vowels and tones marks. Fig.12 are some sample outputs from our proposed method. We can see that our method can detect both Thai and English natural scene text in variations of text styles, sizes, and colors, and even with a small effect of perspective distortion. However, our method failed to localize text in some difficult cases, such as texts that are significantly distorted due to perspective projection, text on non-linear planar, too small text, text on complex background and lightning condition. We show failed results in Fig.13 VI. CONCLUSIONS In this paper, we present a method to localize Thai text in natural scene images. Our system uses the learned features neural network (CNN) as text detector combined with the post-processing techniques with Thai text characteristic analysis. This combination improved Thai text localization result in term of less missed vowels and tones compared to other multi-language text localization methods. From the experimental result, based on standard evaluation method, our system shows good result on mixed Thai and English test images.

(c) (d) (e) Figure 11. Input image and interested text area. Result from T.

Neumann s method (www.textspotter.org). (e) Our proposed method result.

Matas, Scene Text Localization and Recognition with Oriented Stroke Detection,

(ICCV2013), 2013, pp. 97 104. [2] B. Epshtein, E. Ofek, and Y.

Conference on Computer Vision and Pattern Recognition, 2012.

Liu, A Hybrid Approach to Detect and Localize Texts in Natural Scene Images,

Computer Vision ACCV 2010, 2010, pp. 770 783. [5] X. Chen and A. L.

Computer Vision and Pattern Recognition, 2004. (CVPR2004), 2004, vol. 2, pp.

Liu, A Robust System to Detect and Localize Texts in Natural Scene Images, in

(DAS2008), 2008, pp. 35 42. [7] D. Chen, H. Bourlard, and J.

6 (c) (d) (e) Figure 11. Input image and interested text area. Result from T.Wang s method. (c) Result from Epshtein s method. (d) Result from L.Neumann s method ( (e) Our proposed method result. REFERENCES Figure 12. Example of correct results. [1] L. Neumann and J. Matas, Scene Text Localization and Recognition with Oriented Stroke Detection, in IEEE International Conference on Computer Vision, (ICCV2013), 2013, pp [2] B. Epshtein, E. Ofek, and Y. Wexler, Detecting text in natural scenes with stroke width transform, in IEEE Conference on Computer Vision and Pattern Recognition, (CVPR2012), 2010, pp [3] Y.-F. Pan, X. Hou, and C.-L. Liu, A Hybrid Approach to Detect and Localize Texts in Natural Scene Images, IEEE Trans. Image Process., vol. 20, no. 3, pp , [4] L. Neumann and J. Matas, A Method for Text Localization and Recognition in Real-world Images, in Computer Vision ACCV 2010, 2010, pp [5] X. Chen and A. L. Yuille, Detecting and reading text in natural scenes, in IEEE Conference on Computer Vision and Pattern Recognition, (CVPR2004), 2004, vol. 2, pp. II 366 II 373 Vol.2. [6] Y.-F. Pan, X. Hou, and C.-L. Liu, A Robust System to Detect and Localize Texts in Natural Scene Images, in 8th IAPR International Workshop on Document Analysis Systems, (DAS2008), 2008, pp [7] D. Chen, H. Bourlard, and J. Thiran, Text identification in complex background using SVM, in IEEE Conference on Computer Vision and Pattern Recognition, (CVPR 2001), 2001, vol. 2, pp. II 621 II 626 vol.2. [8] X. Li, W. Wang, S. Jiang, Q. Huang, and W. Gao, Fast and effective text detection, in 15th IEEE International Conference on

Image Processing, 2008. (ICIP2008), 2008, pp. 969 972. [9] S. M. Hanif and L.

(ICDAR2009), 2009, pp. 1 5. [10] J.-J. Lee, P.-H. Lee, S.-W. Lee, A. Yuille, and C.

Case, S. Satheesh, B. Suresh, W. Tao, and A. Y.

440 445. [12] T. Wang, D. J. Wu, A. Coates, and A. Y.

Chalidabhongse, Automatic Detection and Segmentation of Text in Low Quality Thai Sign Images, in IEEE Asia-Pacific Conference on Circuits and Systems, 2006. (APCCAS2006), 2006, pp. 1000 1003. [14] K.

(ICITEE2013), 2013, pp. 137 142. [15] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R.

7 Image Processing, (ICIP2008), 2008, pp [9] S. M. Hanif and L. Prevost, Text Detection and Localization in Complex Scene Images using Constrained AdaBoost Algorithm, in 10th International Conference on Document Analysis and Recognition, (ICDAR2009), 2009, pp [10] J.-J. Lee, P.-H. Lee, S.-W. Lee, A. Yuille, and C. Koch, AdaBoost for Text Detection in Natural Scene, in 11th International Conference on Document Analysis and Recognition, (ICDAR2011), 2011, pp [11] A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, W. Tao, and A. Y. Ng, Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning, in 11th International Conference on Document Analysis and Recognition, (ICDAR2011), 2011, pp [12] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, End-to-end text recognition with convolutional neural networks, in 21st International Conference on Pattern Recognition,2012 (ICPR2012), 2012, pp [13] W. Jirattitichareon and T. H. Chalidabhongse, Automatic Detection and Segmentation of Text in Low Quality Thai Sign Images, in IEEE Asia-Pacific Conference on Circuits and Systems, (APCCAS2006), 2006, pp [14] K. Woraratpanya, P. Boonchukusol, Y. Kuroki, and Y. Kato, Improved Thai text detection from natural scenes, in International Conference on Information Technology and Electrical Engineering, (ICITEE2013), 2013, pp [15] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young, ICDAR 2003 robust reading competitions, in 7th International Conference on Document Analysis and Recognition, (ICDAR2003), 2003, vol. 1, no. Icdar, pp [16] A. Shahab, F. Shafait, and A. Dengel, ICDAR 2011 Robust Reading Competition Challenge 2: Reading Text in Scene Images, in 11th International Conference on Document Analysis and Recognition, (ICDAR2011), 2011, pp [17] T. E. de Campos, B. R. Babu, and M. Varma, Character recognition in natural images, in Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, [18] K. Wang, B. Babenko, and S. Belongie, End-to-end scene text recognition, in 13th International Conference on Computer Vision, (ICCV2011), 2011, pp [19] J. Canny, A Computational Approach to Edge Detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp , [20] B. Bai, F. Yin, and C. L. Liu, Scene Text Localization Using Gradient Local Correlation, in 12th International Conference on Document Analysis and Recognition, (ICDAR2013), 2013, pp [21] C. Yi and Y. Tian, Text String Detection From Natural Scenes by Structure-Based Partition and Grouping, IEEE Trans. Image Process., vol. 20, no. 9, pp , Figure 13. Example of incorrect results.

Broken Characters Identification for Thai Character Recognition Systems

Broken Characters Identification for Thai Character Recognition Systems NUCHAREE PREMCHAISWADI*, WICHIAN PREMCHAISWADI* UBOLRAT PACHIYANUKUL**, SEINOSUKE NARITA*** *Faculty of Information Technology, Dhurakijpundit