WITH the increasing use of digital image capturing

Size: px

Start display at page:

Download "WITH the increasing use of digital image capturing"

Gabriel Underwood
5 years ago
Views:

1 800 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 A Hybrid Approach to Detect and Localize Texts in Natural Scene Images Yi-Feng Pan, Xinwen Hou, and Cheng-Lin Liu, Senior Member, IEEE Abstract Text detection and localization in natural scene images is important for content-based image analysis. This problem is challenging due to the complex background, the non-uniform illumination, the variations of text font, size and line orientation. In this paper, we present a hybrid approach to robustly detect and localize texts in natural scene images. A text region detector is designed to estimate the text existing confidence and scale information in image pyramid, which help segment candidate text components by local binarization. To efficiently filter out the non-text components, a conditional random field (CRF) model considering unary component properties and binary contextual component relationships with supervised parameter learning is proposed. Finally, text components are grouped into text lines/words with a learning-based energy minimization method. Since all the three stages are learning-based, there are very few parameters requiring manual tuning. Experimental results evaluated on the ICDAR 2005 competition dataset show that our approach yields higher precision and recall performance compared with state-of-the-art methods. We also evaluated our approach on a multilingual image dataset with promising results. Index Terms Conditional random field (CRF), connected component analysis (CCA), text detection, text localization. I. INTRODUCTION WITH the increasing use of digital image capturing devices, such as digital cameras, mobile phones and PDAs, content-based image analysis techniques are receiving intensive attention in recent years. Among all the contents in images, text information has inspired great interests, since it can be easily understood by both human and computer, and finds wide applications such as license plate reading, sign detection and translation, mobile text recognition, content-based web image search, and so on [19]. Jung et al. [13] define an integrated image text information extraction system (TIE, shown in Fig. 1) with four stages: text detection, text localization, text extraction and enhancement, and recognition. Among these stages, text detection and localization, bounded in dashed line in Fig. 1, are critical to the overall system performance. In the last decade, many methods (as surveyed in [13], [19], [47]) Manuscript received July 06, 2009; revised March 31, 2010; accepted August 07, Date of publication September 02, 2010; date of current version February 18, This work was supported by the National Natural Science Foundation of China (NSFC) under Grant and Grant The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Margaret Cheney. The authors are with the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing , China ( yfpan@nlpr.ia.ac.cn; xwhou@nlpr.ia.ac.cn; liucl@nlpr.ia.ac.cn). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TIP Fig. 1. Architecture of a TIE system [13]. have been proposed to address image and video text detection and localization problems, and some of them have achieved impressive results for specific applications. However, fast and accurate text detection and localization in natural scene images is still a challenge due to the variations of text font, size, color and alignment orientation, and it is often affected by complex background, illumination changes, image distortion and degrading. The existing methods of text detection and localization can be roughly categorized into two groups: region-based and connected component (CC)-based. Region-based methods attempt to detect and localize text regions by texture analysis. Generally, a feature vector extracted from each local region is fed into a classifier for estimating the likelihood of text. Then neighboring text regions are merged to generate text blocks. Because text regions have distinct textural properties from non-text ones, these methods can detect and localize texts accurately even when images are noisy. On the other hand, CC-based methods directly segment candidate text components by edge detection or color clustering. The non-text components are then pruned with heuristic rules or classifiers. Since the number of segmented candidate components is relatively small, CC-based methods have lower computation cost and the located text components can be directly used for recognition. Although the existing methods have reported promising localization performance, there still remain several problems to solve. For region-based methods, the speed is relatively slow and the performance is sensitive to text alignment orientation. On the other hand, CC-based methods cannot segment text components accurately without prior knowledge of text position and scale. Moreover, designing fast and reliable connected component analyzer is difficult since there are many non-text components which are easily confused with texts when analyzed individually /$ IEEE

2 PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 801 To overcome the above difficulties, we present a hybrid approach to robustly detect and localize texts in natural scene images by taking advantages of both region-based and CC-based methods. Since local region detection can robustly detect scene texts even in noisy images, we design a text region detector to estimate the probabilities of text position and scale, which help segment candidate text components with an efficient local binarization algorithm. To combine unary component properties and binary contextual component relationships, a conditional random field (CRF) model with supervised parameter learning is proposed. Finally, text components are grouped into text lines/words robustly with an energy minimization method. Experimental results on English and multilingual image datasets show that the proposed approach yields higher precision and recall performance compared with state-of-the-art methods. Particularly, the proposed approach reports higher performance on the ICDAR 2005 text locating competition dataset. Some contents of our approach have been presented in our conference papers [29], [30], but this paper extends in several aspects: 1) more technical details of the system are given; 2) more classifiers and learning criteria are evaluated for training the CRF model; 3) the text line grouping method is extended to word level; and 4) further evaluation on a multilingual image dataset is added. The rest of the paper is organized as follows. Section II briefly reviews the related works. Section III gives an overview of the proposed system. Section IV describes the preprocessing stage. Section V explains how to formulate the component classification into a CRF model. Section VI describes the learning-based text line/word grouping method. Experimental results and discussions are presented in Section VII, and concluding remarks are provided in Section VIII. II. RELATED WORKS Most region-based methods are based on observations that text regions have distinct characteristics from non-text regions such as the distribution of gradient strength and texture properties. Generally, a region-based method consists of two stages: 1) text detection to estimate text existing confidence in local image regions by classification, and 2) text localization to cluster local text regions into text blocks, and text verification to remove non-text regions for further processing. An earlier method proposed by Wu et al. [44] uses a set of Gaussian derivative filters to extract texture features from local image regions. With the corresponding filter responses, all image pixels are assigned to one of three classes ( text, nontext and complex background ), then -means clustering and morphological operators are used to group text pixels into text regions. The methods of Zhong et al. [48] and Gllavata et al. [8] employ image transformations, such as discrete cosine transform (DCT) and wavelet decomposition, to extract features. By thresholding the filter responses, non-text regions are removed and the remaining text regions are grouped according to their spatial relationships. The approach proposed by Chen et al. [3] extracts texts in video images by detecting horizontal and vertical edges with a Canny filter and smoothing the edge maps to extract candidate text lines with morphological operators. Several gray and gradient features are employed to verify candidate text lines with a support vector machine (SVM) classifier. The above methods can only detect texts of fixed scale. To capture texts of variable scales, Kim et al. [14] extract gray-level texture features from all local regions in each layer of image pyramid. Then a SVM classifier is used to generate text probability maps, followed by a fast mode seeking process: continuously adaptive mean-shift (CAMSHIFT), to search for the text positions. Similarly with the image pyramid, Lyu et al. [24] detect candidate text edges of various scales with a Sobel operator. A local thresholding procedure insensitive to illumination changes is used to filter out non-text edges, and the text regions are then grouped into text lines by recursive profile projection analysis (PPA). To make use of color information for text localization in videos, the method of Lienhar and Wernicke [20] computes the gradient map with color derivative operators. A neural network classifier with multi-scale fusion is then employed to generate candidate text blocks followed by a recursive PPA process for text localization. Li et al. [18] proposed an algorithm for detecting texts in video by using first- and second-order moments of wavelet decomposition responses as local region features classified by a neural network classifier. Text regions are then merged at each pyramid layer and further projected back to the original image map. Chen et al. [4] proposed a hierarchical sign text detection framework by combining multi-scale Laplacian of Gaussian (LOG) edge detection, adaptive search, color analysis and affine normalization. Finally, a Gaussian mixture model (GMM) is used for layout analysis. Recently, Weinman et al. [42] use a CRF model for patch-based text detection. This method justifies the benefit of adding contextual information to traditional local region-based text detection methods. Their experimental results show that this method can deal with texts of variable scales and alignment orientations. To speed up text detection, Chen and Yuille [5] proposed a fast text detector using a cascade AdaBoost classifier, whose weak learners are selected from a feature pool containing gray-level, gradient and edge features. Detected text regions are then merged into text blocks, from which text components are segmented by local binarization. Their results on the ICDAR 2005 competition dataset [23] show that this method performs competitively and is more than 10 times faster than the other methods. Unlike region-based methods, CC-based methods are based on observations that texts can be seen as a set of connected components, each of which has distinct geometric features, and neighboring components have close spatial and geometric relationships. These methods normally consist of three stages: 1) CC extraction to segment candidate text components from images; 2) CC analysis to filter out non-text components using heuristic rules or classifiers; and 3) post-processing to group text components into text blocks (e.g., words and lines). The method of Liu et al. [22] extracts candidate CCs based on edge contour features and removes non-text components by wavelet feature analysis. Within each text component region, a GMM is used for binarization by fitting the gray-level distributions of the foreground and background pixel clusters. Zhu et al.

802 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 [50] first use a nonlinear local binarization algorithm to segment candidate CCs.

coarse-to-fine pruning of non-text components. The method of Takahashi et al. [39] extracts candidate text components using a Canny edge detector in color images.

components. Zhang et al. [46] presented a Markov random field (MRF) method for exploring the neighboring information of components.

3 802 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 [50] first use a nonlinear local binarization algorithm to segment candidate CCs. Several types of component features, including geometric, edge contrast, shape regularity, stroke statistics and spatial coherence features, are then defined to train an AdaBoost classifier for fast coarse-to-fine pruning of non-text components. The method of Takahashi et al. [39] extracts candidate text components using a Canny edge detector in color images. A region adjacency graph is then built up on the extracted CCs and some heuristic rules based on local component features and adjacent component relationships are designed to prune non-text components. Zhang et al. [46] presented a Markov random field (MRF) method for exploring the neighboring information of components. The candidate text components are initially segmented with a mean-shift process. After building up a component adjacency graph, a MRF model integrating a first-order component term and a higher-order contextual term is used for labeling components as text or non-text. For multilingual text localization, Liu et al. proposed a method [21] which employs a GMM to fit third-order neighboring information of components using a specific training criterion: maximum minimum similarity (MMS). Their experiments show good performance on their multilingual image datasets. Fig. 2. Flowchart of the proposed system. III. SYSTEM OVERVIEW As mentioned above, either region-based methods or CC-based methods alone can not detect and localize texts very well. Motivated by the work combining local and global information [26] to address object detection and recognition tasks, we found that region-based methods and CC-based methods are complementary. Specifically, region-based methods can extract local texture information to accurately segment candidate components while CC-based methods can filter out non-text components and localize text regions accurately. Thus, we propose a hybrid text detection and localization approach by combining region-based and CC-based information. Our system consists of three stages as shown in Fig. 2. At the pre-processing stage, a text region detector is designed to detect text regions in each layer of the image pyramid and project the text confidence and scale information back to the original image, scale-adaptive local binarization is then applied to generate candidate text components. At the connected component analysis (CCA) stage, a CRF model combining unary component properties (including the text confidence) and binary contextual component relationships is used to filter out non-text components. At the last stage, neighboring text components are linked with a learning-based minimum spanning tree (MST) algorithm and between-line/word edges are cut off with an energy minimization model to group text components into text lines or words. IV. PRE-PROCESSING To extract and utilize local text region information, a text region detector is designed to estimate the text confidence and the corresponding scale, based on which candidate text components can be segmented and analyzed accurately. Fig. 3. Sub-regions for extracting HOG features. Fig. 4. Schematic depiction of a WaldBoost classifier. A. Text Region Detector Initially, the original color image is converted into a graylevel image, on which the image pyramid with scale step 1.2 is built up with nearest interpolation to capture text information of different scales. Based on our previous work [29], a text region detector is designed by integrating a widely used feature descriptor: histogram of oriented gradients (HOG) [6] and a boosted cascade classifier: WaldBoost [37]. Note that the aim of text region detector is not to find accurate text positions but to estimate probabilities of the text position and scale information. In detail, within each window sliding in a layer of the image pyramid, we define several sub-regions by horizontally and vertically partitioning the window as shown in Fig. 3, from where 4-orientation HOG descriptors are extracted and fed to the WaldBoost classifier to estimate the text confidence. The WaldBoost classifier is different from the previously used cascade AdaBoost classifier in that it has only one strong classifier with every weak classifier being able to filter out negative objects. This mechanism, as shown in Fig. 4, leads to the simplification of the training process and the speedup of detection.

4 PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 803 Fig. 5. Example of the pre-processing stage. B. Text Confidence and Scale Maps To measure the text confidence for each image patch in a window, no matter it is accepted or rejected, we translate the outputs of the WaldBoost classifier into posterior probabilities (confidence values) based on a boosted classifier calibration method [40]: the posterior probability of a label,, conditioned on its detection state, at the stage, can be estimated based on the Bayes formula as defined as where all the stage likelihoods are calculated on a validation dataset during training. By probability calibration, the text confidence map for each layer of image pyramid is obtained. Then, all the pixel confidence and scale values at different pyramid layers are projected back to the original image for calculating the final text confidence and scale maps. In the final map, the text confidence value of the pixel and the scale value are calculated by weighted arithmetic and geometric mean of corresponding pyramid pixel values, respectively: where is the set of pixels of all pyramid layers which are projected back to the original image pixel, is a weighting coefficient defined as the pixel s confidence value (1) (2) (3) at the pyramid layer, is the arithmetic normalization factor and is the geometric normalization factor. Geometric mean is taken for estimating pixel scales because of the exponential nature of image pyramid layers. The text scale map is used in local binarization for adaptively segmenting candidate CCs and the confidence map is used later in CCA for component classification. C. Image Segmentation To segment candidate CCs from the gray-level image, Niblack s local binarization algorithm [27] is adopted due to its high efficiency and non-sensitivity to image degrading. The formula to binarize each pixel is defined as if ; if ; otherwise, where and are the intensity mean and standard deviation (STD) of the pixels within a -radius window centered on the pixel and the smoothing term is empirically set to 0.4. Different from other methods [5], [50] where the radius of windows is fixed or chosen based on some simple rules, such as the gray-level STD, we calculate the radius from the text scale map which is more stable under noisy conditions. After local binarization, because we assume that within each local region, gray-level values of foreground pixels are higher or lower than the average intensity, connected components with 0 or 255 value are extracted as candidate text components and those of value 100 are not considered further. An example of the pre-processing stage is shown in Fig. 5. On the text confidence map, the red color represents high probability of text and the blue color represents low probability. On the text scale map, each pixel with text confidence greater than 0.9 is drawn a circle whose radius indicating the estimated text scale. (4)

5 804 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 V. CONNECTED COMPONENT ANALYSIS For connected components analysis (CCA), we propose a conditional random field (CRF) model to assign candidate components as one of the two classes ( text and non-text ) by considering both unary component properties and binary contextual component relationships. A. Brief Introduction to CRF CRF [17] is a probabilistic graphical model which has been widely used in many areas such as natural language processing [34], computer vision [16], handwriting recognition [7] and document analysis [35]. Given a set of observation variables with state variables (labels). Let be a graph such that is indexed by the vertices of. Then is a conditional random field when the probability of conditioned on obeys the Markovian property where is the neighborhood set of. Based on the Hammersley Clifford Theorem [9], the conditional probability can be written in the form where is the set of all cliques in the graph, represents the real-valued potential function and, defined as, is a normalization factor (a.k.a. partition function). In implementation, the potential function can be approximated by arbitrary real-valued energy function with parameters and clique set, and the conditional probability can be rewritten as The learning process of CRF models is to estimate the energy function parameters by optimizing a specific criterion. With fixed energy function parameters, an inference algorithm is used to find the optimal labels of a new set of samples jointly conditioned on their observations by maximizing the conditional probability as Based on the definition of CRFs, we formulate CC analysis into a labeling problem: given an image containing candidate text components, on which a 2-D undirected component neighborhood graph is constructed. The objective of the CRF is to find the optimal component labels with the cliques and learned energy functions. The major steps of CCA process are described in the following. B. Component Neighborhood Graph Considering that neighboring text components normally have similar width or height, we build up a component neighborhood graph by defining a component linkage rule as (5) (6) (7) (8) (9) where is the Euclidean distance between two component centroids and and represent the width and height of component s bounding box. Here, the component centroid is calculated by averaging and coordinates of all component pixels. Any two components whose spatial relationship obeys this rule are linked as neighbors in the graph. C. Energy Function Traditional CCA methods only consider individual component properties, and thus, are prone to misclassify components when the image is cluttered and noisy. In this paper, we use the CRF model to explore contextual component relationships as well as unary component properties. Unlike the methods in references [21], [46] that use third-order cliques, we only consider second-order cliques due to its high efficiency, and our method performs sufficiently because of the effective pre-processing stage and supervised learning of CRF parameters. The total energy function is defined as (10) where and denote two-class ( text and non-text ) unary energy function and three-class (both text, both non-text and different labels) binary energy function, respectively. is a combining coefficient for balancing the unary and binary terms. is the number of s neighbors for normalizing different neighborhood sizes. 1) Base Classifiers: Different types of classifiers have been used for approximating the energy functions of CRFs, such as generalized linear model [16], Gaussian mixture model [36], Boosting [41] and SVM [32]. In this work, we choose two types of artificial neural network (ANN) classifiers as base classifiers: single-layer perceptron (SLP) and multi-layer perceptron (MLP). It has been proven that ANN classifiers have good trade-off between effectiveness and efficiency [1] and their optimization algorithm, back propagation (BP) based on stochastic gradient decent, is also suitable for CRF models [28]. For comparison, we also use the SVM as base classifier to take advantage of its superior generalization capability, though it has high computational complexity and its parameters cannot be estimated jointly with the CRF parameters. 2) Unary Component Features: To characterize single component s geometric and textural properties, we use six types of unary component features as follows: Normalized width and height. To filter out non-text components whose sizes are too large or small, we normalize component s width and height with the scale value ( ) calculated by averaging the corresponding pixel values on the text scale map. Aspect ratio. This feature, defined as the minimum of the ratios between component s width to height and height to width, is to filter out non-text components whose shape is too long. Occupation ratio. This feature is defined as the ratio between the component pixel number ( ) and the component

6 PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 805 TABLE I UNARY FEATURES FOR A COMPONENT (x ) TABLE II BINARY FEATURES FOR TWO COMPONENTS (x AND x ) bounding box area ( ). It is intended to remove those nontext components with too many or too few pixels within the bounding box. Confidence. Text region texture property extracted by the HOG-WaldBoost detector in the pre-processing stage is very useful to differentiate between text and non-text components. This feature value is calculated by averaging component pixel values ( ) on the text confidence map. Compactness. To prune non-text components with too complex contour shape, compactness is defined as the ratio between the bounding box area ( ) and the square of the component s perimeter, whose value is the component contour pixel number ( ). Contour gradient. Since text components have distinct edgeness, we define this feature by averaging the Sobel gradient magnitude values ( ) of the component contour pixels ( ). 3) Binary Component Features: To characterize the spatial relationship and geometric and textural similarity between two neighboring component and, we use six types of binary component features as follows: Shape difference. Since neighboring text components usually have similar width and height, shape difference is defined as the ratio between the minimal and maximal width and height of two components, respectively. Spatial distance. Since neighboring components are likely to have same labels, we utilize the spatial distances between two components as features, calculated as the absolute difference between the x and y coordinates of component centroids ( ). Overlap ratio. It is unlikely that two overlapping components (especially, when one is contained in another) are both text components. We measure the overlap degree between two components as the ratio between two component bounding boxes intersecting area ( ) and the minimum of two box areas, and the ratio between the intersecting area and the maximum of two box areas. Ordering confidence. Since text components have higher confidence values than non-text ones, the confidence values of two components in descending order are used as features. Scale ratio. Neighboring text components also have similar scales, which are calculated from the text scale map. The ratio between the minimal and maximal scale of two components is used to capture this property. Gray-level difference. Neighboring text components usually have similar gray levels. We calculate this feature as the absolute difference of two components average gray-level values. The unary features and binary features are given in Tables I and II, respectively. In total, there are 7-D unary features and 10-D binary features used in the energy functions. The simple computation of these features ensures the efficiency of the CCA process. Some of these features have been used in previous methods [39], [46], [50] and have achieved good performance. D. Jointly Supervised Training A major advantage of CRF, compared with other graphical models such as MRF, is that the model parameters can be estimated by supervised training. If the base classifier (for evaluating energy functions) can be trained by gradient descent, its parameters can be estimated jointly with the CRF. For CRF models, a straightforward criterion to learn the parameters is the conditional log-likelihood (CLL), that is to maximize the conditional probability. In implementation, it can be transformed to minimize the negative log-likelihood, defined as (11) The computation of is complicated and becomes intractable when the node number and label space are too large. In this work, we turn to use two alternative criteria: saddle point approximation and LOG-likelihood of Margin. Saddle Point Approximation (SPA) SPA is a measurement to approximate the probability distribution with the Maximum a posteriori (MAP) estimation which has been used to overcome the computational intractable problem [15]. The criterion of the SPA has the same form as the CLL:, but with the partition function approximated as (12) (13) where denotes the set of cliques that contain the graph site. and are the site observations while and are their MAP estimation labels by the specific inference

7 806 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 method (graph cuts). Then SPA negative log-likelihood ( ) can be computed by substituting (12) and (13) to (11) as (14) In our implementation, the stochastic gradient decent algorithm is used to optimize the model parameters as (15) where is the learning rate. Here, the partial derivative of the SPA loss function is (16) LOG-likelihood of Margin (LOGM) Another criterion LOGM was recently proposed [10] for learning prototype-based classifiers, and can be generalized to the learning of structural models. The LOGM criterion is the logarithm of smoothed classification error, differing from the minimum classification error (MCE) criterion [12] only by the logarithm, which makes the loss to be convex with respect to the hypothesis margin. The LOGM criterion is different from the SPA in that it approximates the normalization factor by summing the energy of the truth label ( ) and the minimal energy among the incorrect labels ( ): (17) Then negative log-likelihood ( ) can be computed by substituting (13) and (17) to (11) as (18) where denotes the sigmoidal function:, and is a parameter that controls the hardness of nonlinearity. Similar to the SPA optimization, stochastic gradient decent is used to learn the model parameters as Here, the partial derivative of LOGM s loss function is (19) (20) The model parameters include the energy function parameters (related to the base classifiers) and the combining coefficient. The optimization process of both criteria can be easily combined with the BP algorithm for training the ANN classifier parameters and the combining coefficient simultaneously. For SVM-based CRFs, we can only optimize the because the parameters of SVM cannot be trained by gradient descent. In stead, we train the unary feature SVM and binary Fig. 6. Example of the CCA stage. (a) Components passing through unary classification thresholds. (b) Component neighborhood graph. (c) Text components after CRF labeling. feature SVM on labeled data of components and pairs of components and then combine the trained SVMs into the CRF. To find the optimal labels on undirected graphical model with the learned CRF parameters, several inference algorithms have been proposed. We choose to use the graph cuts ( -expansion) algorithm [2] for its superior efficiency compared to other methods [38]. Although the classifiers can not be seen as regular functions in graph cuts, the performance degrading is very little in experiments in consistent with the conclusion [38]. During the training process, we use a coupling strategy to learn CRF parameters by running the learning and inference steps iteratively: at each time, energy function parameters are first fixed to use graph cuts to find the component labels. Then the energy function values on the label configuration are used to optimize parameters based on the specific criterion. This iterative process continues until the graph energy values have very small changes. For LOGM training, we need to find the competing label configuration. If the optimal label configuration found by graph cuts is different from, it is taken as ; otherwise, we flip each node label of in turn and take the resulting configuration with minimal energy as. This process does not guarantee that is the most competing configuration but it works at low computation cost. During the test process, to alleviate the computation overhead of graph inference, some apparent non-text components are first removed by using thresholds on unary component features. The thresholds are set to safely accept almost all text components in the training set. The remaining components are then assigned labels by graph cuts with the learned CRF parameters. An example of component labeling is shown in Fig. 6. VI. TEXT GROUPING To group text components into text regions (lines and words), we design a learning-based method by clustering neighboring components into a tree with a minimum spanning tree (MST)

PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 807 algorithm and cutting off between-line (word) edges with an energy minimization model.

MST Building Motivated by Yin and Liu s work in handwritten text line grouping [45], we cluster text components into a tree with MST based on a learned distance metric, which is defined between two

8 PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 807 algorithm and cutting off between-line (word) edges with an energy minimization model. TABLE III TEXT LINE FEATURES FOR N CUT EDGES A. MST Building Motivated by Yin and Liu s work in handwritten text line grouping [45], we cluster text components into a tree with MST based on a learned distance metric, which is defined between two components as a linear combination of some features as (21) where is the vector of between-component features and is the vector of combining weights, which is learned by the mean squared error criterion [1] on a dataset of component pairs labeled in two classes: within line/word and extra line/word. With the learned metric, the MST can be efficiently constructed with the Kruskal algorithm [33]. Based on the fact that components belonging to the same text region are spatially close and have similar shapes, two types of binary component features (in Table II): shape difference and spatial distance, are selected as distance metric features. B. Edge Cuts With the initial component tree built with the MST algorithm, between-line/word edges need to be cut to partition the tree into subtrees, each of which corresponds to a text unit (line or word). 1) Line Partition: We propose a method to formulate the edge cutting in the tree as a learning-based energy minimization problem. In the component tree, each edge is assigned one of two labels: linked and cut, and each subtree corresponding to a text line is separated by cutting the cut edges. The objective of the proposed method is to find the optimal edge labels such that the total energy of the separated subtrees is minimal. The total text line energy is defined as (22) where is the number of subtrees (text lines), is the feature vector of a text line, and is the vector of combining weights. In our implementation, the weights are learned with the MCE criterion based on a coupling strategy of Zhou et al. [49], wherein the parameters optimization and labels inference are performed alternately to approach the optimal parameter values. To characterize the within-line similarity and between-line dissimilarity, we extract six features from a text line, some of which have been used in a work of online handwritten text line grouping [49]. Assume that the initial component tree is split into candidate text lines by cutting edges, the text line features are defined as follows: Line regression error. Assume that each text line is approximately straight, the sum of line regression errors ( ) is used to measure this regularity. The base line for each candidate text line is fitted with least square regression on the component centroids. Cut score. The value of learned distance metric ( ) is a good measurement of the similarity Fig. 7. Example of the text grouping stage. (a) Building component tree with MST. (b) Text line partition. (c) Text word partition. (d) Text localization results. between two components. We thus define the cut score as the sum of the logarithm of the cut edge distances. Line height. The line height is defined as the sum of the candidate text line distance ( ) between the top and bottom component centroids ( ) along the orthogonal orientation of the base line normalized by the line height. is expected to be small for a normal text line. Spatial distance. The Euclidean distance ( ) between the centroids of two components linked with the cut edge reflects the distance between text lines. The sum of such distances is taken as a feature. Bounding box distance. The spatial distance between the bounding boxes of two components linked with the cut edge is defined as another feature for characterizing the between-line spatial relationship. The distance is defined as, where and denote the distance between the closest left-right and top-bottom borders between two boxes, respectively. Line number. To avoid over-splitting the component tree, text line number is used as a regularization term. The text line features are tabulated in Table III. A greedy strategy similar to the grouping method [45] is employed to find the optimal edge labels. Initially, edges are all labeled as linked. Then at each time, an edge is relabeled as cut if after cutting it, the total text line energy is decreased and this edge decreases the energy most compared to the other

808 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 Fig. 8. Examples of evaluation datasets. (a) ICDAR 2005 competition dataset. (b) Our multilingual dataset. edges.

9 808 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 Fig. 8. Examples of evaluation datasets. (a) ICDAR 2005 competition dataset. (b) Our multilingual dataset. edges. This step repeats until the cutting does not decrease the total text line energy. Although this greedy strategy does not guarantee optimality, it performs satisfactorily in experiments. 2) Word Partition: For comparing our system with previous methods which reported word localization results, we further partition text lines into words using a similar process as line partition. The major difference lies in the word-level features, which are defined as: 1) word number; 2) component centroid distances of cut edges; 3) component bounding box distances of cut edges; 4) bounding box distances between words separated by cut edges; 5) the ratio between the component centroid distance of the cut edge and the average component centroid distance of the edges within separated words; and 6) bounding box distance ratio between the cut edge and edges within separated words. The features no. 2 6 are summated over the cut edges. Finally, text words corresponding to partitioned subtrees can be extracted and the ones containing too small components are removed as noises. An example of the text grouping stage is shown in Fig. 7. VII. EXPERIMENTS We evaluated the performance of the proposed approach on a public dataset: ICDAR 2005 text locating competition dataset [23], and a multilingual image dataset collected by us. 1 On the ICDAR 2005 dataset, we evaluated the effects of different base classifiers and CRF learning criteria, and compared the performance of connected component classification and text word localization with previous results. On the multilingual dataset, we evaluated the performance of text line localization. A. Experimental Data and Setting The images of the ICDAR 2005 text locating competition dataset were captured using digital camera in indoor and outdoor conditions. The database has three sets: 1) trial set including 258 TrialTrain images and 251 TrialTest images; 2) sample set with 20 images; and 3) competition set with For evaluation by the community, the multilingual dataset, including training and test images and ground-truths of text lines, has been put available on the website images. The text characters in these images include English and numeral characters. To evaluate the performance on texts of other languages, we also collected a multilingual image dataset containing 248 training images and 239 test images captured in natural scenes. In addition to English and numeral characters, a large number of Chinese characters are contained in these images. Fig. 8 shows some examples of images from two datasets. To measure the component classification performance, given the ground-truth text components set, the detected components set and the detected ground-truth text component set, the precision rate and recall rate are calculated by (23) where denotes the element number of the set. To measure the text (line or word) localization performance, we adopt the evaluation method as used in ICDAR 2005 text locating competition [23] such that our results can be fairly compared with the competition results. The method defines the precision rate and the recall rate based on the rectangle area match ratio between two text region rectangles and, i.e., the ratio between the area of their intersecting rectangle and the area of the minimal containing rectangle as (24) Given the ground-truth text region set and the localized text region set, the precision rate of each detected region rectangle and the recall rate of each ground-truth text region rectangle are defined as and the total precision rate and recall rate are defined as (25) (26) Furthermore, the harmonic mean defined as is used for evaluating the overall performance.

10 PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 809 Fig. 9. Component classification results. (a) SLP-based classifiers. (b) MLP-based classifiers. (c) SVM-based classifiers. (d) CRF with different base classifiers. Our system was coded with C++ language and operated on a Pentium D 3.4 GHz desktop computer with Window XP OS. In all the experiments, the window size of the text region detector was fixed as 16 16, which can find both individual characters and character clusters. Four-orientation HOG features were extracted from each region for classification by the WaldBoost classifier. The scaling step for image pyramid was fixed as 1.2 and the number of pyramid layers is depending on the image size: downsizing continues until the downsized image has. B. Results on ICDAR 2005 Dataset First, to evaluate the effects of different base classifiers and CRF training criteria on connected component classification, we used the TrialTrain set images for training and the TrialTest set images for testing. For comparison with the previous results of connected component classification [46] under the same condition, we also used randomly split subsets of the Sample set alternately for training and testing. For comparison with the ICDAR 2005 competition results of text word localization, we simulated the condition of ICDAR 2005 competition: used the Trial set (TrialTrain and TrialTest) for training classifiers and the competition set for evaluation. In each case, we collected a training set of text and non-text regions for training the WaldBoost classifier, and a disjoint set of regions (also from training images) for confidence calibration (estimating the stagewise accept and reject rates). The WaldBoost classifier is used in pre-processing to generate the text confidence map and the text scale map, which are then used in connected component analysis and text localization. 1) Connected Component Classification: This experiment was performed on the TrialTrain set (for training and validation) and the TrialTest set (for evaluation). For training the WaldBoost classifier (text region detector), we collected 4.8K text region samples and about the same number of non-text region samples from the TrialTrain images, and similar numbers of validation samples were collected for confidence calibration. In connected component analysis, we collected about 5.5K text components and 51K non-text components from the TrialTrain images for training base classifiers and CRF parameters. For testing, about 5K text components and 75K non-text components were collected from the TrialTest images. Three base classifiers: SLP, MLP and SVM, were used with CRF for component classification. Two criteria: SPA and LOGM, were adopted to optimize the CRF parameters. When used as local classifiers (utilizing only unary component features), the SLP and MLP were trained using the back propagation (BP) algorithm. When utilizing both unary features and binary features in CRF, the parameters of SLP and MLP were jointly trained with the combining coefficient, after initialization by BP. For the SVM, two classifiers was trained for unary features and binary features, respectively, by the sequential minimal optimization (SMO) algorithm [31], and then the combining coefficient was optimized in CRF training. We used SVM classifiers with the radial basis function (RBF) kernel:, where the kernel width was fixed as 2.5 empirically from validation results. The MLP has one hidden layer and the number of hidden units was empirically set as four times of number of input features. The component classification results of three base classifiers and CRFs are shown in Fig. 9. Comparing the methods, we have some observations: 1) Regardless of the base classifier type, the CRF-based classifier (utilizing both unary and binary features) outperforms the corresponding local classifier (utilizing only unary features). This justifies the benefit of component contextual relationships. 2) Comparing two criteria for CRF training, SPA and LOGM yield comparable performance of component classification. 3) Comparing CRFs with different base classifiers, the CRF with MLP and the CRF with SVM outperform the one with SLP, and the CRF with MLP (jointly trained) performs slightly better than that with SVM (non-jointly trained). The comparable performance of two training criteria SPA and LOGM can be explained as follows. In training CRF by SPA, when the MAP label configuration is incorrect, it equals the competing label configuration of LOGM ( ), and the loss functions of SPA and LOGM are related by (27) This monotonic relationship indicates that the parameters are tuned toward similar directions. When, the LOGM still updates the parameters with the hope of enlarging the hypothesis margin. However, this turns out to make little difference on the classification performance. We nevertheless choose the LOGM criterion for CRF training for subsequent experiments due to its theoretical advantage of convergence. The CRF with SVM does not outperform that with MLP because the MLP parameters were trained jointly with the CRF combining coefficient while the SVM parameters were not. Since the SVM classifier suffers from high computational complexity (large number of support vectors), we choose the MLP jointly trained with CRF by LOGM for connected component classification and the subsequent text localization.

11 810 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 Fig. 10. Component classification results with and without confidence feature. (a) SLP-based classifiers. (b) MLP-based classifiers. (c) SVM-based classifiers. TABLE IV CCA RESULTS OF OUR METHOD AND A PREVIOUS MRF-BASED METHOD To justify the benefit of text confidence features ( Confidence in Table I and Ordering Confidences in Table II), we compared the component classification performance with and without the confidence features. Fig. 10 shows the component classification results of three base classifiers with and without confidence features. We can see that for each base classifier, either as a local classifier or with CRF, the confidence features improve the classification performance remarkably compared to that without confidence features. This justifies that the text confidence is very useful for differentiating text components from non-text ones. 2) Comparison With Previous Results: Though many works of scene text detection and localization have been reported, only a few of them [22], [46] were evaluated on the standard ICDAR 2005 dataset. Zhang and Chang [46] reported results of connected component classification, and Liu et al. [22] reported results of character detection which is different from our problem. The method of Zhang and Chang [46] is based on mean shift region segmentation followed by higher-order MRF classification. It was evaluated on the Sample set of the ICDAR 2005 dataset by randomly splitting the 20 images into two subsets and using them alternately for training and testing. For fair comparison, we randomly split the 20 images into two subsets and used them alternately for training and testing. When using the first subset (Split1) for training the WaldBoost region detector, we collected 2K text region samples and about the same number of non-text region samples; while using the second subset (Split2) for training, the numbers of text/non-text region samples were about 1.9K. The results of connected component classification on Split1 (trained on Split2), Split2 (trained on Split1) and the average results are shown in Fig. 11. Therein, Extraction rate is defined as the ratio between the number of correctly segmented text components and the number of all text components. The average precision and recall rates of our method are compared with those of MRF-based method [46] in Table IV. The MRF-based method [46] extracted 67% of text components during region segmentation, and obtained 85% precision and recall rates on the segmented components. Our Fig. 11. Component classification results on the Sample set. method extracted 96.1% of text components in connected component segmentation, and obtained over 92% precision and recall rates on the segmented components. The processing speed (average time on an image) of two methods are similar. 3) Text Word Localization: For comparison with the ICDAR 2005 competition results, we trained with all the images in Trial set (TrialTrain and TrialTest), and evaluated on the competition set images. From the 509 Trial set images, we extracted 5K text region samples for training the WaldBoost classifier, and a similar number of non-text samples were selected from non-text regions during the training process. Totally 620 weak learners were generated to build the WaldBoost classifier. For confidence calibration, we collected a disjoint validation set of 5K text samples and similar number of non-text samples, also from the Trial images. For training the connected component classifiers (MLP and CRF), all the text and non-text components extracted from the Trial images were used. The text word localization performance of our method was evaluated using the same measurement of ICDAR 2005 competition on the same set of 501 images. Our results are compared with those of the best two participants of ICDAR 2005 competition [23] in Table V, where the first one is CC-based and the

PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 811 Fig. 12.

Text line localization examples on the multilingual dataset (from top row to bottom: original images, results of CCA and text line localization). second is region-based.

Comparing the processing speed, our method is slower than the second performing system of ICDAR 2005 competition but is faster than the best performing system. Fig.

12 PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 811 Fig. 12. Text word localization examples on the ICDAR 2005 competition dataset (from top row to bottom: original images, results of CCA and text word localization). Fig. 13. Text line localization examples on the multilingual dataset (from top row to bottom: original images, results of CCA and text line localization). second is region-based. We can see that our method gives higher precision and recall rates than the best performing systems of ICDAR 2005 competition. Comparing the processing speed, our method is slower than the second performing system of ICDAR 2005 competition but is faster than the best performing system. Fig. 12 shows some examples of text extraction and word localization by the proposed method. It is shown that the proposed method can detect and localize most scene texts even when there are individual characters and some texts overlapping with complex backgrounds. Nevertheless, some false positives and false negatives still exist. C. Results on Multilingual Dataset The images in the multilingual dataset contain many Chinese characters, which are aligned in text lines without apparent space between words. We hence evaluated the performance of text line localization instead of word localization. From the 248 TABLE V TEXT WORD LOCALIZATION RESULTS FOR DIFFERENT METHODS training images, we collected 5.4K text regions and about the same number of non-text regions for training the Waldboost region detector. About 4K text components (positive samples) and 33K non-text components (negative samples) were collected for training the CRF model. There totally exist 951 text lines in 239 test images for localization. The localization results given in Table VI show that the proposed method performs fairly well on multilingual text images as well. Fig. 13 shows some examples of text component extraction and text line localization.

812 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 Fig. 14. Examples of hard-to-segment images. TABLE VI TEXT LINE LOCALIZATION RESULTS ON MULTILINGUAL DATASET VIII.

13 812 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 3, MARCH 2011 Fig. 14. Examples of hard-to-segment images. TABLE VI TEXT LINE LOCALIZATION RESULTS ON MULTILINGUAL DATASET VIII. CONCLUDING REMARKS In this paper, we present a hybrid approach to localize scene texts by integrating region information into a robust CC-based method. The binary contextual component relationships, in addition to the unary component properties, are integrated in a CRF model, whose parameters are jointly optimized by supervised learning. Our experimental results demonstrated that the proposed method is effective in unconstrained scene text localization in several respects: 1) region-based information is very helpful for text component segmentation and analysis; 2) incorporating contextual component relationships, the CRF model differentiates text components from non-text components better than local classifiers; 3) joint optimization of base classifier parameters with CRF is beneficial; and 4) learning-based energy minimization method can group text components into text lines (words) robustly. Although the proposed system reported encouraging performance, it still needs further improvements. Our approach fails on some hard-to-segment texts as shown in Fig. 14. One possible solution is to take into account color information [25]. The speed of the proposed approach (about 1.2 s per image) need to be accelerated further. In addition, text recognition should be integrated with text localization to complete the need of text information extraction. Several methods [11], [43] have been proposed for solving this problem with encouraging results based on graphical models. Text recognition can also help reduce false positives of text extraction. ACKNOWLEDGMENT The authors are grateful to Prof. S. Lucas for providing the ICDAR 2005 competition dataset, to X.-D. Zhou and F. Yin for helpful discussions, and to the anonymous reviewers for valuable comments. REFERENCES [1] C. M. Bishop, Neural Networks for Pattern Recognition. New York: Oxford University Press, [2] Y. Boykov, O. Veksler, and R. Zabih, Fast approximate energy minimization via graph cuts, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 11, pp , [3] D. T. Chen, J.-M. Odobez, and H. Bourlard, Text detection and recognition in images and video frames, Pattern Recogn., vol. 37, no. 5, pp , [4] X. L. Chen, J. Yang, J. Zhang, and A. Waibel, Automatic detection and recognition of signs from natural scenes, IEEE Trans. Image Process., vol. 13, no. 1, pp , Jan [5] X. R. Chen and A. L. Yuille, Detecting and reading text in natural scenes, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR 04), Washington, DC, 2004, pp [6] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR 05), San Diego, CA, 2005, pp [7] S. L. Feng, R. Manmatha, and A. McCallum, Exploring the use of conditional random field models and HMMs for historical handwritten document recognition, in Proc. 2nd Int. Conf. Document Image Analysis for Libraries (DIAL 06), Washington, DC, 2006, pp [8] J. Gllavata, R. Ewerth, and B. Freisleben, Text detection in images based on unsupervised classification of high-frequency wavelet coefficients, in Proc. 17th Int. Conf. Pattern Recognition (ICPR 04), Cambridge, U.K., 2004, pp [9] J. M. Hammersley and P. Clifford, Markov Field on Finite Graphs and Lattices, 1971, unpublished. [10] X.-B. Jin, C.-L. Liu, and X. Hou, Regularized margin-based conditional log-likelihood loss for prototype learning, Pattern Recogn., vol. 43, no. 7, pp , [11] Y. Jin and S. Geman, Context and hierarchy in a probabilistic image model, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR 06), New York, NY, 2006, pp [12] B.-H. Juang and S. Katagiri, Discriminative learning for minimum error classification, IEEE Trans. Signal Process., vol. 40, pp , [13] K. Jung, K. I. Kim, and A. K. Jain, Text information extraction in images and video: A survey, Pattern Recogn., vol. 37, no. 5, pp , [14] K. I. Kim, K. Jung, and J. H. Kim, Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm, IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 12, pp , [15] S. Kumar, J. August, and M. Hebert, Exploiting inference for approximate parameter learning in discriminative fields: An empirical study, in Proc. 5th Int. Workshop on Energy Minimization Methods in Computer Vision and Pattern (EMMCVPR 05), St. Augustine, FL, 2005, pp [16] S. Kumar and M. Hebert, Discriminative random fields, Int. J. Comput. Vision, vol. 68, no. 2, pp , [17] J. Lafferty, A. McCallum, and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, in Proc. 18th Int. Conf. Machine Learning (ICML 01), San Francisco, CA, 2001, pp [18] H. P. Li, D. Doermann, and O. Kia, Automatic text detection and tracking in digital video, IEEE Trans. Image Process., vol. 9, pp , Jan [19] J. Liang, D. Doermann, and H. P. Li, Camera-based analysis of text and documents: A survey, Int. J. Document Anal. Recogn., vol. 7, no. 2-3, pp , [20] R. Lienhart and A. Wernicke, Localizing and segmenting text in images and videos, IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 4, pp , [21] X. B. Liu, H. Fu, and Y. D. Jia, Gaussian mixture modeling and learning of neighboring characters for multilingual text extraction in images, Pattern Recogn., vol. 41, no. 2, pp , [22] Y. X. Liu, S. Goto, and T. Ikenaga, A contour-based robust algorithm for text detection in color images, IEICE Trans. Inf. Syst., vol. E89-D, no. 3, pp , [23] S. M. Lucas, ICDAR 2005 text locating competition results, in Proc. 8th Int. Conf. Document Analysis and Recognition (ICDAR 05), Seoul, South Korea, 2005, pp [24] M. Lyu, J. Q. Song, and M. Cai, A comprehensive method for multilingual video text detection, localization, and extraction, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 2, pp , 2005.

PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 813 [25] C. Mancas-Thillou and B. Gosselin, Color text extraction with selective metric-based clustering, Comput.

Neural Inf. Process. Syst., vol. 16, 2003. [27] W. Niblack, An Introduction to Digital Image Processing. Birkeroed, Denmark: Strandberg Publishing, 1985. [28] S. Nicolas, J. Dardenne, T.

407 411. [29] Y.-F. Pan, X. W. Hou, and C.-L. Liu, A robust system to detect and localize texts in natural scene images, in Proc.

14 PAN et al.: A HYBRID APPROACH TO DETECT AND LOCALIZE TEXTS IN NATURAL SCENE IMAGES 813 [25] C. Mancas-Thillou and B. Gosselin, Color text extraction with selective metric-based clustering, Comput. Vis. Image Underst., vol. 107, no. 1-2, pp , [26] K. Murphy, A. Torralba, and W. T. Freeman, Using the forest to see the trees: A graphical model for recognizing scenes and objects, Adv. Neural Inf. Process. Syst., vol. 16, [27] W. Niblack, An Introduction to Digital Image Processing. Birkeroed, Denmark: Strandberg Publishing, [28] S. Nicolas, J. Dardenne, T. Paquet, and L. Heutte, Document image segmentation using a 2-D conditional random field model, in Proc. 9th Int. Conf. Document Analysis and Recognition (ICDAR 07), Curitiba, Brazil, 2007, pp [29] Y.-F. Pan, X. W. Hou, and C.-L. Liu, A robust system to detect and localize texts in natural scene images, in Proc. 8th IAPR Workshop on Document Analysis Syetems (DAS 08), Nara, Japan, 2008, pp [30] Y.-F. Pan, X. W. Hou, and C.-L. Liu, Text localization in natural scene images based on conditional random field, in Proc. 10th Int. Conf. Document Analysis and Recognition (ICDAR 09), Barcelona, Spain, 2009, pp [31] J. C. Platt, Fast training of support vector machines using sequential minimal optimization, in Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press, 1999, pp [32] P. Schnitzspan, M. Fritz, and B. Schiele, Hierarchical support vector random fields: Joint training to combine local and global features, in Proc. 10th European Conf. Computer Vision (ECCV 08), Berlin, Germany, 2008, pp [33] R. Sedgewick, Algorithms in C, Part 5: Graph Algorithms, 3rd ed. Boston, MA: Addison-Wesley Professional, [34] F. Sha and F. Pereira, Shallow parsing with conditional random fields, in Proc. Conf. North American Chapter Assoc. Computational Linguistics on Human Language Technology (NAACL 03), Morristown, NJ, 2003, pp [35] S. Shetty, H. Srinivasan, M. Beal, and S. Srihari, Segmentation and labeling of documents using conditional random fields, in Proc. Document Recognition and Retrieval XIV, Proc. SPIE, San Jose, CA, Jan. 2007, pp. 6500U [36] J. Shotton, J. M. Winn, C. Rother, and A. Criminisi, TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation, in Proc. 9th European Conf. Computer Vision (ECCV 06), Graz, Austria, 2006, pp [37] J. Sochman and J. Matas, WaldBoost Learning for time constrained sequential detection, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR 05), San Diego, CA, 2005, pp [38] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother, A comparative study of energy minimization methods for Markov random fields with smoothness-based priors, IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 6, pp , [39] H. Takahashi, Region graph based text extraction from outdoor images, in Proc. 3rd Int. Conf. Information Technology and Applications (ICITA 05), Sydney, Australia, 2005, pp [40] H. Takatsuka, M. Tanaka, and M. Okutomi, Distribution-based face detection using calibrated boosted cascade classifier, in Proc. 14th Int. Conf. Image Analysis and Processing (ICIAP 07), Modena, Italy, 2007, pp [41] A. B. Torralba, K. P. Murphy, and W. T. Freeman, Contextual models for object detection using boosted random fields, Adv. Neural Inf. Process. Syst., vol. 16, [42] J. Weinman, A. Hanson, and A. McCallum, Sign detection in natural images with conditional random fields, in Proc. 14th IEEE Workshop on Machine Learning for Signal Processing (MLSP 04), São Luís, Brazil, 2004, pp [43] J. Weinman, E. Learned-Miller, and A. Hanson, Scene text recognition using similarity and a lexicon with sparse belief propagation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 10, pp , [44] V. Wu, R. Manmatha, and E. M. Riseman, Finding text in images, in Proc. 2nd ACM Int. Conf. Digital Libraries (DL 97), New York, NY, 1997, pp [45] F. Yin and C.-L. Liu, Handwritten text line segmentation by clustering with distance metric learning, in Int. Conf. Frontiers in Handwriting Recognition (ICFHR 08), Montreal, Canada, 2008, pp [46] D.-Q. Zhang and S.-F. Chang, Learning to detect scene text using a higher-order MRF with belief propagation, in Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshop s (CVPRW 04), Washington, DC, 2004, pp [47] J. Zhang and R. Kasturi, Extraction of text objects in video documents: Recent progress, in Proc. 8th IAPR Workshop on Document Analysis Syetems (DAS 08), Nara, Japan, 2008, pp [48] Y. Zhong, H. J. Zhang, and A. K. Jain, Automatic caption localization in compressed video, IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 4, pp , [49] X.-D. Zhou, D.-H. Wang, and C.-L. Liu, A robust approach to text line grouping in online handwritten Japanese documents, Pattern Recogn., vol. 42, no. 9, pp , [50] K. H. Zhu, F. H. Qi, R. J. Jiang, L. Xu, M. Kimachi, Y. Wu, and T. Aizawa, Using Adaboost to detect and segment characters from natural senes, in Proc. 1st Conf. Caramera Based Document Analysis and Recognition (CBDAR 05), Seoul, South Korea, 2005, pp Yi-Feng Pan received the B.S. degree in automation from Southeast University, Nanjing, China, and the M.E. and Ph.D. degrees in pattern recognition and intelligent control from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2004, 2006, and 2010, respectively. Currently, he is a research staff member at the Information Technology Lab, Fujitsu R&D Center, Beijing, China. His research interests include pattern recognition, image processing, computer vision, and especially on text information extraction in natural scene images. Xinwen Hou received the B.S. degree from Zhengzhou University, China, in 1995, the M.S. degree from the University of Science and Technology of China in 1998, and the Ph.D. degree in mathematics from Beijing University, Beijing, China, in From 2001 to 2003, he was with the Department of Mathematics, Nankai University, China, as a Postdoc. He joined the National Laboratory of Pattern Recognition, Chinese Academy of Sciences (CASIA) in June His early research focused on face recognition and facial analysis. His current research interests include machine learning, pattern recognition and image recognition, especially manifold learning, ensemble learning, and subspace analysis. Cheng-Lin Liu (SM 04) received the B.S. degree in electronic engineering from Wuhan University, Wuhan, China, the M.E. degree in electronic engineering from Beijing Polytechnic University, Beijing, China, and the Ph.D. degree in pattern recognition and intelligent control from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 1989, 1992 and 1995, respectively. He was a Postdoctoral Fellow at the Korea Advanced Institute of Science and Technology (KAIST) and later at the Tokyo University of Agriculture and Technology, Japan, from March 1996 to March From 1999 to 2004, he was a research staff member and later a senior researcher at the Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan. Since 2005, he has been a Professor at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China, and is now the deputy director of the laboratory. His research interests include pattern recognition, image processing, neural networks, machine learning, and especially the applications to character recognition and document analysis. He has contributed many effective methods to different aspects of handwritten document analysis. His algorithms have yielded superior performance, and have been transferred to industrial applications including mail sorting and form processing. He has published over 100 technical papers at prestigious international journals and conferences, including IEEE TPAMI, Pattern Recognition, IEEE TNN, ICPR, and ICDAR. Dr. Liu won the IAPR/ICDAR Young Investigator Award of 2005, and received the Outstanding Youth Fund of NSFC in He is a senior member of the IEEE, a member of the ACM and the IEICE Japan, and a senior member of the Chinese Computer Federation.

International Journal of Electrical, Electronics ISSN No. (Online): and Computer Engineering 3(2): 85-90(2014)

I J E E E C International Journal of Electrical, Electronics ISSN No. (Online): 2277-2626 Computer Engineering 3(2): 85-90(2014) Robust Approach to Recognize Localize Text from Natural Scene Images Khushbu