Scene Text Detection Using Machine Learning Classifiers

601 Scene Text Detection Using Machine Learning Classifiers Nafla C.N. 1, Sneha K. 2, Divya K.P. 3 1 (Department of CSE, RCET, Akkikkvu, Thrissur) 2 (Department of CSE, RCET, Akkikkvu, Thrissur) 3 (Department of CSE, RCET, Akkikkvu, Thrissur) ABSTRACT In this paper we present an efficient method of scene text detection using two machine learning classifiers: one for generating candidate word regions and the other for the classification of text or nontext components. At first we extract connected components with the help of maximally stable extremal region algorithm. The resulting components are partitioned into clusters with help of an adaboost classifier based on adjacency relationship. After that we extract features for classification from the clusters. Then with the help of a support vector machine classifier we classify a block into text and nontext components Keywords - Connected component (CC), maximally stable extremal region (MSER), optical character recognition (OCR), support vevtor machine (SVM). I. INTRODUCTION Due to the wide availability of mobile devices having high quality digital cameras, research areas related to these devices are getting more attention in the last few decades. Text detection and extraction is one of the most important and interesting area among these researches. Texts present in camera captured images are considered as one of the important and strong source of information about that image and about the place or situation from where the image was captured. Text detection and extraction from images have a lot of valuable and useful application. Texts present in an image or video can be classified as scene text and caption text. Scene text exists in the image naturally. Caption texts refer to those texts which are added manually by the user. Scene texts overlap with the background. Therefore scene text detection and extraction are difficult as compared to the detection of caption text. Compared to the scanned document images, text extraction from the natural scenes are not easy because they exist in arbitrary orientation, different sizes and background interference. Examples of scene texts include signs on streets, display boards on shops, texts on vehicles, advertisement boards etc. Fig1 shows examples of text in natural scene images. Text string detection and extraction have a variety and useful applications. As people travel through different places for various purposes, it will be difficult for them to understand the text present on display boards in the foreign countries. In this case people either look for the help of guides or intelligent hand held devices for the translation of the information written on display boards. For this text detection is an important part. Text detection can play a crucial role in the case of content-based visual information retrieval and the content-based image retrieval, which includes utilization of techniques of computer vision for the problem of image retrieval in huge database applications. Another important application of scene text extraction is helping people with visual disabilities. It will be a great help for them if they have a computerized system which can convey the text information present on the objects and locations. License plate detection is another important area where text detection plays a central role. License plate detection has crucial role in monitoring of traffic at custom check points, for tracking of stolen cars. etc. Another significant application of scene text detection and extraction are robotic navigation, automatic geocoding etc. Fig 1: Examples of natural images with scene text

602 OCR is one of the technologies which can extract text characters, by identifying the corners. This can be done only if the characters have correct separation from background. Background interference and degradation in images will lead to the decrease in performance of OCR. So performance of OCR is comparatively low in case of natural scene images. Texture analysis and topic based partition are other methods of detection. But they work correctly on document images. Text detection and extraction from natural image is not an simple task. Text may exist in complex background and also the chances of degradation are high in case of natural images. As a result text extractions from natural images have a lot of complexities. The paper is organized as follows. In Section II, a literature survey on existing methods of scene text detection is done. In Section III, we provide details of the proposed method. In Section IV, we show conclusion. II. LITERATURE SURVEY This section covers the study of existing scene text detection methods. Existing method of scene text detection can be categorized as Texture based method, connected component based method and hybrid method. 2.1Texture based methods Texture based methods considers text as a special kind of texture and identify the texts by using their properties like wavelet features, filter responses and local intensities. Angadi et al[1] described a method that make use of a high pass filter that works in DCT domain for suppressing of the background and make use of texture properties like homogeneity and contrast for detection of text. The method comprises mainly of 5 phases. They are removal of background in the DCT domain, deriving feature matrix D, block classification, merging of the blocks for text area extraction and finally refinement of the text region. Kim et al[2] described a method that uses a combination of CAMSHIFT and SVM for detection and extraction of text.. Raw pixel intensity that forms the textural pattern is given as input to the SVM. After texture extraction, the text identification is performed by using the CAMSHIFT. Gllavata et al[3] described a method that uses high frequency wavelet coefficients distribution obtained by the application of wavelet transform of the image. For separating text and non text area. Then text area classification is done by k-means clustering. Then text extraction is performed by OCR engine by giving segmented binary text image as input. 2.2 Connected component based methods In connected component based methods, at first the image is divided and candidate text components are extracted. After that non text elements are eliminated through various ways. Connected component based methods make use of geometrical properties. This method works properly on the images that contains texts of many variations like changes in orientation, font etc. Epshtein et al [4] describe a method that makes use of stroke width for the extraction of text components. A stroke is a contiguous part in an image that forms a band of approximately constant width. Constant stroke width is one of the important feature that separate texts from other components of a scene. In this method they make use of a logical operator together with geometrical reasoning that identifies the place having same stroke width for the identification of regions having text. Yi et al [5] describes a method that use of gradient features and color homogeneity of character components for the extraction of candidate text regions. After that character candidate grouping is performed to detect text strings. This is performed on the basis of structural features of characters in text string such as differences in character size, distances between neighboring characters, and alignment of characters. Gatos et al[6] described a methodology for text detection from natural scene images is based on an efficient binarization and enhancement technique followed by a connected component analysis procedure. Starting from the original image, the method produces a binary image and an inverted binary image. Then connected components are extracted from complementary images. Further, the text verification is conducted at character level and word level on the candidate connected components. Finally, text regions localized in two images are refined and merged in post-processing. 2.3 Hybrid based methods Hybrid based method is a combination of texture based and connected component based methods. Yi et al[7] described a hybrid approach. At first a text region detector generates a text estimation map. This helps in the segmentation of text components by local binarization. After that non text component filtering is performed by a conditional random field model. Finally text line grouping of text components are performed by learning based energy minimization method. Liu et al[8] described a hybrid based method. This method is based on the assumption that characters have closed contours and a character string contains characters that lie in a straight line. This method extracts the text

603 region by extracting closed contours and searching neighbors of them. III PROPOSED METHOD This section describes the techniques used in the proposed methodology. 3.1 overview of proposed method We have illustrated the block diagram of our system in fig 2. Fig 3: input image MSER algorithm finds out the connected component that is brighter or darker than their surroundings. Fig 4 shows the result of MSER extraction of the input image shown in fig 3. Fig 2: Overview of proposed system As shown in the diagram the method consists of mainly of the following steps: connected component extraction, clustering with the help of an adaboost classifier, feature extraction for svm classification, classification of clusters into text and nontext components. For the CCs extraction we make use of MSER algorithm. An adaboost classifier that works on the basis of adjacency relationship between the CCS is used for clustering. Then we extract features. After that we classify the clusters as text and nontext components. For classification, we make use of an svm classifier. 3.2 connected component extraction Although there are a lot of CC extraction methods we make use of MSER algorithm because of its low computation cost with high performance. MSER algorithm will extract the part of the image where local binarization will be stable over a wide range of thresholds. This property helps us to extract most of the text components in the image. Fig 4: Result of MSER extraction 3.3 Clustering of CCs Clustering includes grouping of CCs based on adjacency relationship with the help of adaboost classifier 3.3.1Building of training sets Our classifier is based on the pair wise adjacency relationship between connected components extracted using MSER. For building the training set for the classifier, we obtain a collection of CCs by the help of MSER extraction to the set of training images. Then for every pair of extracted CCs we check if they are adjacent and they belong to text component set. Then we build a set of positive and negative examples. Positive set

604 contains samples that are adjacent and both belong to text component set. Negative samples are constructed by providing pairs of CCs such that one CC belongs to text component set and other belongs to nontext set. 3.3.2 Adaboost learning and clustering of CCs With the help of collected samples, we train an adaboost classifier which tells us whether two given CCs are adjacent or not. For the purpose of training of classifier we make use of one color based property and four geometrical properties of CCs. first we construct bounding box on each CC and denote its height and width as, respectively. For each pair of CCs, we estimate the vertical overlap, horizontal overlap and horizontal distance between the bounding boxes. They are denoted by vo ij, ho ij, d ij respectively., (1), (2) (3) And color distance between two CCs. we calculate these features for both positive and negative samples. We train an adaboost classifier with the help of these features. We set the output of the adaboost classifier as +1 for CCs that are adjacent and -1 for CCs that are not adjacent. We checks these adjacency for all pair of CCs extracted using MSER. Then we cluster the CCs with the help of union find set algorithm. 3.3 Feature extraction After clustering we will get a set of clusters which includes text as well as non text regions. For the classification of text and nontext component, we make use of an SVM classifier. For this we have to extract features from the clusters. For this we divide each cluster into overlapped square and we extract feature from each square block. Each square block is divided into 4 vertical and horizontal ones and features are extracted. For a horizontal block, we find a) number of white pixels, b) number of vertical white-black transitions c) number of vertical black-white transitions as features, and features for vertical block is defined similarly. 3.4 SVM classification For the training of SVM we first apply our connected component extraction, clustering and feature extraction steps and we train a support vector machine classifier for the classification of square block as text and nontext component. For a testing image, we do all the above steps and finally decision result of all the square blocks of a cluster is integrated. If the number square blocks which are text is greater than the non text, then that cluster is classified as a text component. Fig 6: Text region detected from input image Fig 5: Result of clustering on input image IV CONCLUSION Due to the complicated background and unpredictable text appearances scene text detection is still a challenging problem. We have presented in this paper an improved scene text detection method that makes use of machine learning classifiers. One for identifying the text component and other classification of text and non text

605 components. Our method is designed to work correctly on images having text strings arranged horizontally. Our future work will focus on developing an efficient learning based algorithm that extracts text in complex background and texts of arbitrary orientation. ACKNOWLEDGEMNTS Every success stands as a testimony not only to the hardship but also to hearts behind it. Likewise, the present work has been undertaken and completed with direct and indirect help from many people and I would like to acknowledge all of them for the same [9] H Koo and D Kim., Scene text detection via connected component clustering and non-text filtering, IEEE Trans. Image Proc., vol. 22, no. 6 pp. 2296 2305, 2013 [10] P. Shivakumara, T. Q. Phan, L. Shijian and C. L. Tan, Gradient Vector Flow and Grouping Based for Arbitrarily-Oriented Scene Text Detection in Video Images, IEEE Trans. CSVT, 2013, pp 1729-1739. REFERENCES [1] Angadi, S.A. and Kodabagi, M.M, Text region extraction from low resolution natural scene images using texture features, 2ndInternational Advance Computing Conference, IEEE, 2010,pp 121-128 [2] K. I. Kim, K. Jung, and J. H. Kim, Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm, IEEE Trans. PAMI, vol. 25, no. 12, pp. 1631 1639, 2003. [3] J. Gllavata, R. Ewerth, and B. Freisleben, Text Detection in Images Based on Unsupervised Classification of High-Frequency Wavelet Coefficients, Proc. of Int l Conf. on Pattern Recognition, Cambridge, UK, (page 425-428 Year of Publication : 2004 ICPR.2004.1334146 ). [4] B. Epshtein, E. Ofek, and Y. Wexler, Detecting text in natural scenes with stroke width transform, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Page. 2963 2970 Year of Publication: 2010 CVPR.2010.5540041 [5] Yingli Tian and Chucai Yi, Text string detection from natural scenes by structure based partition and grouping, IEEE Transactions on image processing, vol. 20, no. 9, pp. 2594-2605, 2011. [6] Gatos, B.,Pratikakis, I. & Perantonis, S.J.,Towards text recognition in natural scene Images, in Proceedings of Int. Conf. Automation and Technology, ( Page 354-359 Year of Publication 2005) [7] Yi-Feng Pan, Xinwen Hou, Cheng-LinLiu(2009), Text Localization In Natural Scene Images Based On Conditional Random Field, ICDAR,pp 6-10. [8] Y.Liu, S. Goto, and T. Ikenaga, A contour-based robust algorithm for text detection in color images, IEICE Trans. Inf. Syst., vol. E89-D, no. 3, pp. 1221 1230, 2006.