Text-Edge-Box: An Object Proposal Approach for Scene Texts Localization

Size: px

Start display at page:

Download "Text-Edge-Box: An Object Proposal Approach for Scene Texts Localization"

Basil French
5 years ago
Views:

2017 IEEE Winter Conference on Applications of Computer Vision Text-Edge-Box: An Object Proposal Approach for Scene Texts Localization Dinh Nguyen 1,3, Lu Shijian 2,3, Nizar Ouarti 1,3, Mounir

1 2017 IEEE Winter Conference on Applications of Computer Vision Text-Edge-Box: An Object Proposal Approach for Scene Texts Localization Dinh Nguyen 1,3, Lu Shijian 2,3, Nizar Ouarti 1,3, Mounir Mokhtari 3,4 1 University Pierre & Marie Curie, France, 2 Institute for Infocomm Research, Singapore 3 Image & Pervasive Access Lab, Singapore (UMI 2955), 4 Institut Mines-Telecom, France dinh.nguyen van@etu.upmc.fr, slu@i2r.a-star.edu.sg nizar.ouarti@ipal.cnrs.fr, Mounir.Mokhtari@mines-telecom.fr Abstract Text proposal has been gaining interest in recent years due to the great success of object proposal in categoriesindependent object localization. In this paper, we present a novel text-specific proposal technique that provides superior bounding boxes for accurate text localization in scenes. The proposed technique, which we call Text Edge Box (TEB), uses a binary edge map, a gradient map and an orientation map of an image as inputs. Connected components are first found within the binary edge map, which are scored by two proposed low-cue text features that are extracted in the gradient map and the orientation map, respectively. These scores present text probability of connected components and are aggregated in a text edge image. Scene texts proposals are finally generated by grouping the connected components and estimating their likelihood of being words. The proposed TEB has been evaluated on the two public scene text datasets: the Robust Reading Competition 2013 dataset (ICDAR 2013) dataset and the Street View Text (SVT) dataset. Experiments show that the proposed TEB outperforms the state-of-the-art techniques greatly. 1. Introduction Texts in scenes provide rich semantic cues for context understanding, and automatic scene texts recognition has been attracting increasing interest in recent years[1, 2]. In general, end-to-end scene text recognition comprises two major tasks: text detection and text recognition. The first task searches and detects text regions in scenes, and the second further recognizes words within the detected text regions. Leveraging on the prevalent object proposal works, we propose a scene text proposal technique that localizes text regions successfully. For searching texts regions in scenes, the traditional approach [3, 4, 5] exploits the sliding window strategy. However, this approach has to deal with exhaustive search using windows of different scales and aspect ratios. Another typical approach detects text regions based on various segmentation techniques [6, 7, 8, 9]. The segmentation approach achieves promising performance but it is very sensitive to different types of degradation that are often introduced by uncontrolled illuminations, shadow, geometric distortions, and so on. In recent years, object proposal techniques have been widely investigated due to their capacity in locating category-independent objects. For example, Jaderberg et al. used object proposal as an initial step in their end-to-end scene text recognition work [10] which produces superior recognition performance as compared with most state-ofthe-art systems [11, 12]. The exploitation of object proposal for scene text localization is inspired by observation that characters in scenes are actually quite similar to generic objects due to high intra-class variation that is often introduced by different types of distortions. The intra-class variation increases exponentially when scene text detection moves from character-level to word-level following the great success of word recognition [10, 13, 14]. On the other hand, generic object proposal techniques often produce a huge number of proposals when applied for scene text localization task. This shifts object proposals to the traditional sliding window approach in term of the large searching space. For scene text localization, the number of proposals can be reduced significantly by incorporating certain text-specific features [15, 16, 17]. We design a novel text-specific proposal technique for detecting text in scenes. The proposed technique has three contributions. First, we design two low-cue text features, namely, an edge pair and an edge variance that are extracted from a gradient map and an orientation map of an image, respectively. The two features are inspired from the text-specific properties as observed in [18, 19]. The edge pair feature differs from the Stroke Width Transform (SWT) [18] because we only monitor the orientations of pixels in connected components instead of their distances. The edge variance and the text-specific image contrast in [19] are extracted from a gradient map, whereas the proposed fea /17 $ IEEE DOI /WACV

Figure 1. The flowchart of the proposed Text Edge Box technique is shown on the left. All proposal boxes are not exhaustively displayed because of the important number of occurrence.

2 Figure 1. The flowchart of the proposed Text Edge Box technique is shown on the left. All proposal boxes are not exhaustively displayed because of the important number of occurrence. On the right is a pseudo-code illustrating proposal generation and this step is explained in Section 3.2 ture captures the gradient variance which reflects better the monotonous contrast in text boundaries. Second, a grouping strategy is proposed to cluster connected components into text line proposals which are further split into wordlevel proposals based on their geometric information. The grouping strategy is more suitable than the sliding window to cater to the unconstrained aspect ratio of words in scenes. Third, a proposal scoring function is designed by combining scores of connected components within a proposal region and scores of their correlations, which captures the important characteristics of scene texts effectively. 2. Related works Object proposal was investigated in recent years to localize generic objects in scenes [20]. As an alternative framework to the traditional sliding window based object detection, it can locate category-independent objects with a much smaller number of image patches, hence boost object detection and recognition efficiency significantly. Traditional object proposal techniques can be divided into two major categories including the boundary based and the graphconnectivity based. In the boundary based techniques, objects are assumed to have well-defined boundaries where proposals are produced either based on certain boundary properties and cues [21, 22] or by grouping and scoring image pixels using boundary connection [20, 23, 24, 25, 26]. In the graph-connectivity based techniques, connectivity of pixels, super-pixels or segments are exploited to merge them together to generate proposals [27, 28, 29]. Leveraging on the powerful feature extraction and classification cap ability of deep learning architecture, increasing object localization works have been reported with impressive performance by applying convolutional neural network based proposal generations [30, 31, 32, 33]. Scene text proposal has been investigated over the past few years due to its advantage in localizing texts in scenes[15, 16, 17]. The pilot research [15] has searches and ranks potential text regions based on the Maximally Stable Extremal Region (MSER) and regions description. Adaboost classifier is implemented to score text regions. It outperforms the direct applications of existing generic object proposals greatly. The Symmetry-Text Line detection [16] provides text line proposals by estimating symmetry appearance of texts. The symmetry filters are designed to estimate text probability and provide a text heat map. The text line proposals are found by applying threshold on the text heat map, which are further partitioned into word-level proposals based on distance between connected components within the original image. With the well designed text-nontext classifier, it achieves superior detection performance as compared with state-of-the-art text detection techniques. The latest approach [17] has been developed by integrating the pre-trained convolutional neural network (VGG16 net) with their own Inception-Region- Proposal-Network. The performance is better than other existing generic object proposals. 3. Text-Edge-Box proposal approach This section describes our proposed technique that is designed to produce word-level text proposals in scenes. The framework of the proposed technique is shown in Figure 1. Firstly, we exploit the Canny edge detector [34] to generate a gradient map, an orientation map and a binary edge map. Pixels in the orientation map are normalized into the range of [0,π]. Connected components (CCs) are labelled within the binary map, which are further scored by combination of 1297

3 Figure 2. The path a and b are respectively an example image and its own binary edge map. The path c shows examples of connected components including one for text connected component (connected component A) and one for non-text connected component (connected component B). The red arrows are orientation of the considered pixels in the connected components. The dash lines are searching lines corresponding to the orientation of pixels. The pixels in the couple pixels shown in the connected component A are defined as edge pair pixels. Obviously, text connected component includes much more number of edge pair pixels than non-text connected component. two proposed low-cue text features including an edge pair feature (EP) and an edge variance feature (EV). The two text features are estimated from the orientation and the gradient at the corresponding CC pixels, respectively. The CCs are then merged together to produce word-level proposals. A proposal scoring function is designed, which computes probability of each word-level proposal being a word by combining the scores of CCs and scores of their relationships (correlation in component scores, component sizes, and links between pair of components). Finally, word-level proposals are sorted in the descending order, and those with high scores are identified as words Text edge image generation We first define the two proposed low-cue text features including an Edge Pair Feature (EP) and an Edge Variance Feature (EV). A CC scoring function is then presented, which assigns a score to each CC and store it in a Text Edge Image (TEI) The edge pair feature The first feature is an edge pair (EP) that is inspired from the Stroke Width Transform method [18]. It is developed base on a supposition that CCs of text objects are likely to contain the high portion number of couples pixels that have opposite orientations, like the two example pixels of the connected component A as illustrated at the path c in Figure 2. From now, we call these pixels as edge pair pixels. In order to detect them, we start at each given pixel in a CC and its orientation is used to decide a searching line. If an opposite orientation pixel in the same CC has been found in the searching line, the considered pixel and the searched one with an opposite orientation are defined as edge pair pixels. The EP feature of a given CC is defined as a fraction of edge pair pixels in the CC as follows: EP(CC)= N pp(cc) N p (CC) Where N pp (CC) and N p (CC) denote the number of edge pair pixels and the number of edge pixels belonging to a CC under study, respectively. The CC having higher EP value is more likely to be a text CC. The value of this feature is in the range of [0, 1] The edge variance feature The second feature is an edge variance (EV) that measures the variance of gradient magnitude of pixels in a CC. This parameter is useful because the gradients of pixels in the boundary of an individual character (or boundaries of characters in a same word) are often monotonous. Therefore, their variances are expected to be small. We utilize an exponential function of the gradient variance in order to normalize these values into the range of [0, 1] and produce high values for text CCs as below: (1) EV (CC)=e var(cc) (2) Where the var(cc) denotes the variance of the gradient of pixels in a CC The text edge image The text edge image (TEI) is a score map that shows the being-text probability of each CC. Pixels in a CC have the value of the CC score, and other pixels have a value of zero. The score of each CC is estimated by weighted summation of its two text probability features as follows: CCscore = αep +(1 α)ev (3) where α is in the range of [0, 1]. Its value is determined through a tuning process which is described in Section

4 Since both features have their values in the range [0, 1], all pixels in the TEI are in the range of [0,1] Scene text proposal generation strategy As showing in Figure 1.b, the CCs are first merged into text lines which are then split to small subgroups referring to word-level proposals. Starting with a given CC (called candidate A), three properties of its bounding box (bb A ) are exploited including box height (h A ), box width (w A ) and box size (s A ). A corresponding search area is designed by expanding the bb A, where the search area width (w search ) is equal to the image width, and the search area height (h search )isγ times bigger than the h A determined by expanding the h A equally in both sides in the vertical direction. CC candidate B (and its properties are bb B,w B,h B, and s B ) is merged with the CC candidate A to form a group if the bb B satisfies: (1) The ratio of intersection between the bb B and the A s search space to the s B is higher than τ s, (2) the ratio between min(w A,w B ) and max(w A,w B ) is higher than τ w, (3) the ratio between min(h A,h B ) and max(h A,h B ) is higher than τ h. The parameters γ and τ s are sensitive to horizontal texts. The τ w and τ h are sensitive to size relationship between characters in a word. How to set value for these parameters is discussed in Section 4.2. In order to divide text line proposals into small subgroups which refer to word-level proposals, an average of distances between adjacent CCs boxes in the horizontal direction is estimated. Dividing positions are decided at the location where the distance is larger than the average value Ranking This section elaborates a strategy to provide a ranked list of proposals in the decreasing priority order. Four measures are defined as S a,s c,s h, and S o, which are all normalized in the range of [0,1]. The S a is an averaging scores of CCs within a word-level proposal region, where the score of each included CC is defined in Eq.3. The S c, S h and S o indicate affinity among grouped CCs. These measures are designed so that a proposal covering a word will have high values. In particular, they are calculated from the variance of scores of grouped CCs, the variance of bounding box s height of CCs in a proposal region, and the variance of angles between a line linking centroids of two neighbourhood CCs and the horizontal axis, respectively. These angles are adjusted in the range of [0,π]. Generally, a proposal has a high likelihood of being word if it satisfies (1) its CCs should have similar scores, (2) the heights of CCs should be approximately stable, and (3) the connection lines between the CCs should be appropriate in same direction. Therefore, the variances of these measures are expected to be small for a word region proposal. In order to derive a high S c,s h,s o value for a group of CCs which is likely to be a word, and also normalize the measures in the range of [0, 1], the arctan functions is implemented for each measure as follows: S x = 2 arctan(k x/var x ) π where the symbol x represents for the c, h, and o, and the var x refers to the variances of CC s score, height, and angle respectively. The parameter k x is set at the middle of each measure range, i.e. 0.5, a half of image height, and π/2, respectively for S c, S h, and S o. The score function of a proposal region S p is computed as: (4) S p = S a arctan(k 1 Sx ) (5) where, the function arctan(k 1 Sx ) is used to control the relationship between the S p and the S a. If the S x makes the arctan(k 1 Sx ) value higher than 1, we say that the S x has a supporting effect (S p > S a ). If the S x makes the function value smaller than 1, it has a penalizing effect (S p <S a ). It means that: although a proposal has a high value of the S a, it is unlikely to be a word proposal if its CCs provide low S x values (referring to the penalizing effect, see the low score candidate at the path c in Figure 1). The parameter k 1 adjusts the role of the S x measures. In particular, if the k 1 increases, the role of the S x measures is reduced. In this research, we expect that if the S x is higher than the middle value of its range as 0.5, the function arctan(k 1 Sx ) has the supporting effect and vice versa. So, the parameter k 1 is set at 3 as the arctan(3 0.5) 1. Note that, the S x are considered only when the number of CCs in a proposal is larger than 3. Otherwise, the S p is calculated based on the S a only. 4. Experiments and results 4.1. Experiment set-up The proposed technique takes a scene image as an input and generates proposals that cover locations of words. The target is to achieve high recall with a small number of proposals. The optimal parameters of the proposed system are estimated on the training sets and the system s performance has been evaluated on the testing sets of two public datasets including the ICDAR Robust Reading Competition 2013 dataset (ICDAR2013) [1] and the Street View Text dataset (SVT) [35]. The two datasets contain 229 and 101 images for training and 233 and 249 images for testing, respectively. The SVT dataset is more challenging than the ICDAR2013 dataset because of including huge noise, poor lighting, low contrast as well as low resolution images. The proposed TEB has been compared to three scene text proposal algorithms: the simple text specific selective search (TP) [15], the Symmetry-Text Line (STL) [16], the 1299

α =0.3 α =0.4 α =0.5 α =0.52 α =0.54 α =0.56 α =0.58 α =0.6 α =0.7 IoU =0.5 94.85 95.39 95.93 96.02 96.11 96.11 96.11 96.11 96.05 IoU =0.7 84.85 85.26 85.8 85.9 85.99 85.9 85.8 85.8 85.8 IoU =0.9 63.

5 α =0.3 α =0.4 α =0.5 α =0.52 α =0.54 α =0.56 α =0.58 α =0.6 α =0.7 IoU = IoU = IoU = Table 1. The detection rate (in%) of the proposed technique with the variation of the α value, and the difference of the IoU threshold on the joined training sets of the two scene text datasets: ICDAR2013 and SVT. The maximum number of proposal regions is DeepText (DT) [17]. In addition, it is also compared with other generic object proposal methods including the Edge- Box (EB) [22], the Geodesic (GOP) [24], the Randomized- Prim (RP) [28], and the Multiscale Combination Grouping (MCG) [23]. The parameters of the TP, STL and other object proposal algorithms are kept as default as their recommendation for the best performance. Note that the implementation of the DT is not released. The comparison with this method is done based on the results as reported in the published paper[17]. We therefore cap the proposal number at Because of a large size of images (an average at 1194x870 for WxH) and a large size range of texts (from 4 to 2146 for width, and from 3 to 785 for height) in the two datasets, this cap emphasises the advantage of object proposal in the comparison to the sliding window based exhaustive search strategy. Besides, it also helps to reduce the computational cost for scene text recognition. Due to the diversity of color, light and size of text objects in scenes, the proposed algorithm is implemented with different color representation (Gray and RGB) and different scales (from 0.1 to 1 with the inner step is 0.3) to enrich chances of finding out positive proposals Parameters tuning Figure 3. The difference between one-to-one overlap, one-to-many overlap, and many-to-one overlap. The red boxes are the ground truth boxes and the green dash boxes are the proposal regions boxes. We follow the evaluation method that is widely used for the evaluation of object proposals as described in the publications [17, 22, 30, 33]. It considers concretely only oneto-one overlap between proposal regions and ground truth boxes. The same evaluation method was used in the IC- DAR2003 competition [36] and is much more constrained than the framework used in the ICDAR2013 competition [37] which also considers one-to-many and many-to-one overlap for detection evaluation as illustrated in Figure 3. The proposed technique is evaluated based on the detection rate under various testing conditions formed by a combination of the certain number of proposals and the intersection over union (IoU) thresholds. The IoU measures how well proposals overlap with ground truth boxes, and the higher IoU threshold requires the better overlap. Generally, the IoU threshold of 0.5 is acceptable to decide whether objects have been located [38]. However, the higher IoU thresholds are usually expected due to cases of unpredictable word proposals as mentioned in [10], which used the EdgeBox proposals [22] to find texts locations in scenes. In addition, good objects proposal algorithms are often required to produce small number of proposals [39]. The proposed technique involves five specific parameters including: γ,τ s,τ w,τ h and α. In order to improve the robustness of the proposed system under a wide diversity of text appearance, these five parameters are determined based on the joined training sets of the ICDAR2013 dataset (for high contrast texts) and the SVT (for blur texts). We first focus on generating high quality group of proposals, which maximizes overlap with the ground truth, by varying the four parameters γ,τ s,τ w,τ h, and the ranking step is ignored. These four parameters are tuned in the range of [1, 2], [0.5, 1], [0.1, 1], [0.5, 1] with the internal step of 0.05, respectively. All generated proposals were collected for evaluating detection rate. The best values of the four parameters are searched as γ =1.5,τ s =0.75,τ w = 0.3,τ h =0.75, respectively. After obtaining good proposals, we concentrate on scoring proposals and shifting likely-to-be-text proposals to the top of the list by arranging the found group in the descending order. The parameter α has been estimated for this purpose. This parameter controls the contribution of two proposed features (EP and EV) which are reflected into values of S a,s c and S p in the scoring function (Eq.4, Eq.5). As the results presented in Table 1, when the number of proposals is set to maximum as 5000, the α value around 0.54 and 0.56 provides the optimal performance on the joined training set under many thresholds of IoU. We therefore set the α at 0.54 for all experiments including the comparison with other state-of-the-art methods. 1300

6 Figure 4. The detection rate evaluation vs the number of proposals (top row) and the IoU threshold (bottom row) of the Text Edge Box (TEB) and other state-of-the-art algorithms, including the simple text specific selective search (TP) [15], the Symmetry-Text Line (STL) [16], the DeepText (DT) [17], the EdgeBox (EB) [22], the Geodesic (GOP) [24], the RandomizedPrim (RP) [28], and the Multiscale Combination Grouping (MCG) [23] on the ICDAR 2013 dataset Dataset Recall(%) TEB TP STL EB RP MCG GOP DT ICDAR N ICDAR SVT N Table 2. The number of proposal regions needed for different recall rate (50% and 75%) with the IoU=0.7. There is no technique can reach 75% recall rate with the IoU = 0.7 on the SVT dataset. The character N means that we do not have information for the comparison. The symbol - means that these techniques can not reach the assigned recall rate. Dataset TEB TP STL EB RP MCG GOP DT ICDAR N SVT N Table 3. The processing time (in second) of the algorithms on two popular scene text datasets: the ICDAR2013 and the SVT. The character N means that we do not have information for the comparison. The time processing of the STL technique is 6 times slower than the author s report [16] because we spent their all generated text-line proposals to generate word-level proposals for fair comparison Experimental results Figure 4 illustrates the performance of a proposed technique as well as the comparison with state-of-the-art techniques. In the top row, the detection rate vs the number of proposals on the ICDAR 2013 dataset has been calculated under the three different IoU thresholds, i.e 0.5, 0.7 and 0.9. The TEB algorithm obviously outperforms others methods at the different IoU values when the number of proposals is larger than The DT leverages on the deep learning model for scoring proposal regions. Its performance is therefore very competitive at the small number of proposals, especially with the IoU = 0.5. This is due to the deep learning model that has advantage in recognizing non-text regions and eliminating them from the generated proposal list. However, when the IoU threshold increases to 0.7 and 0.9, our proposed system localizes scene texts more successfully. The TP is the most competitive technique when the huge number of proposals are accepted. The EB shows better result than the TP when the number of proposals is smaller than On the other hand, its performance deteriorates when the number of proposals increases. The bottom row shows the second experiment that estimates the detection rate vs the IoU threshold for the different set of proposals: 100, 500 and The TEB outperforms other methods (excluding the DT) significantly under the different bunch of proposals. When the number of proposals increases and the IoU requirement is more constrained, the proposed TEB performs better than the DT. In addition, we also test the minimum number of pro- 1301

Figure 5. The performance of the end-to-end word spotting systems which are constructed by the comparison techniques and the word recognition model [40].

There are some examples of the SVT ground truth boxes which our proposals cannot localize with the IoU threshold of 0.7. The red boxes are the ground truths and the green boxes are our proposals.

7 Figure 5. The performance of the end-to-end word spotting systems which are constructed by the comparison techniques and the word recognition model [40]. The performance of RegModel here is the result of the word recogntion model [40] tested on the ground truth of the testing sets of two scene text datasets as ICDAR2013 and SVT. Figure 6. There are some examples of the SVT ground truth boxes which our proposals cannot localize with the IoU threshold of 0.7. The red boxes are the ground truths and the green boxes are our proposals. The proposal boxes are much more smaller and closer to the scene text objects than the ground truth boxes. posals that are required to obtain different desired recalls. Hosang et al. [41] shows that this criteria correlates well with the detection performance and it has been used to evaluate quality of the proposals in the EB [22] and the Hyper- Net [33]. Table 2 shows the experimental results on the two datasets. On the ICDAR2013 dataset, the TEB algorithm always requires the smallest number of proposal regions. On the SVT dataset, the TEB performs slightly lower in comparison with the EB algorithm, but better than other stateof-the-art algorithms. On the other hand, the minimum proposal regions required are clearly more than those for the ICDAR2013 dataset. Besides the poorer image quality in the SVT dataset, one important reason of the lower performance is due to the ground truth of the SVT dataset where the manually labelled bounding boxes are often much larger than the actual boxes. This is illustrated in Figure 6 where the ground truth boxes in the red color are clearly much larger than the boxes produced by the proposed TEB in the green color. Furthermore, the IoU based evaluation often has certain constraints where the proposals have small overlap with the ground truth boxes but cover entire objects as illustrated in Figure 6. We also adopted another evaluation that uses word recognition models to estimate the quality of proposals. The well-known word recognition model provided by Jaderberg et al. [40] is implemented to perform this additional task. A proposal is a correct localization if it overlaps with one of the ground truth boxes and provides enough information to help the recognition model to recognize correct word. The better proposal technique will achieve the higher F-score at the output of the recognition model. The quality of the recognition model is first estimated on the ground trust boxes of the testing sets of the two datasets. The F- scores of the model in the ICDAR2013 and the SVT dataset are and respectively. They are presented in Figure 5 as RegModel s performances. This is the maximum performance that each proposal technique might obtain if they can provide good proposals that match perfectly to the ground truth boxes. As shown in Figure 5, the TEB method produces the largest number of good proposals which help the recognition model read contained words correctly. In addition, the performance of the proposed TEB algorithm just changes slightly when we increase the number of proposals from 1000 to 5000 in both datasets. It proves that the proposed technique ranks proposals better than other techniques. Therefore, most good proposals have been ranked correctly at the top of the list. The efficiency of the proposed technique is also evaluated based on the execution time. All above techniques are evaluated on the same computer and executed in one thread as the Xeon CPU E GHz. As presented in Table 3, the proposed TEB is comparable to the most efficient method except the original EB method. However, the original EB method does not perform well in term of the minimum proposal number required and the maximum recall obtained. For the DeepText method [17], the authors have not released their program yet, and we do not have processing time report in our device. According to their re- 1302

Figure 7. Examples from the ICDAR2013 dataset that our algorithm failed to localize. The red boxes are ground truths and the green boxes are our proposal regions. port, their algorithm take average 1.

Discussion One distinctive feature of the proposed TEB is that it does not rely on any classifier for eliminating non-text proposals (as implementation in the TP, DT ).

8 Figure 7. Examples from the ICDAR2013 dataset that our algorithm failed to localize. The red boxes are ground truths and the green boxes are our proposal regions. port, their algorithm take average 1.7 second for an image in the ICDAR2013 dataset in their device using the single GPU K40 which is much more powerful than what we used. 5. Discussion One distinctive feature of the proposed TEB is that it does not rely on any classifier for eliminating non-text proposals (as implementation in the TP, DT ). It simply uses the two proposed features and the geometric relationships among CCs to rank proposals. Nevertheless, very good performances were obtained on the two public scene text datasets, which demonstrate the effectiveness of the proposed text-specific proposal technique. Besides, while testing on the SVT dataset that includes a certain amount of non-horizontal text lines, the TEB is still competitive as compared with the TP, which is designed without horizontal restriction. It is observed that the proposed algorithm can handle multi-orientation text lines. As discussion in Section 3.2, the horizontal text line assumption can be relaxed by increasing the parameter γ for a larger searching space and reducing the parameter τ s for retaining more CCs. Then, the scoring function should be upgraded to handle the huge number of CCs in the merged groups. We will investigate it in our future work. Figure 7 illustrates several typical scenarios where our algorithm often fails to provide good proposals including ultra-low contrast (a.1), complex background (a.2), very small text size (a.3), and uneven illumination (a.2, a.4). In particular, the edges of texts in a complex background are often connected with edges of other objects where the edge pair feature may not be extracted reliably. Similarly, when the text objects are covered by shadow or uneven illumination, the forms of texts boundaries have been destroyed, where the text edge features may not be extracted properly either. In term of low contrast, the text edges could miss to be detected by the Canny edge detector due to the ultra-low Figure 8. Some examples of the proposed technique output when it is applied on pill images gradient magnitude. 6. Application An application is under development that aims to use scene text detection and recognition technique for supporting elderly in reading task. Especially, we want to apply it to capture the imprint features for pill recognition which have been studied widely in recent years[42, 43, 44]. Based on 1000 proposals in the top of the list, we can find out the imprint areas correctly as illustrated in Figure Conclusion In this paper, we proposed a text-specific proposal algorithm to search text regions in scenes. Two text-specific features, namely, an edge pair and an edge variance, which were designed to search for more likely text components. In order to measure the text likelihood of the proposal boxes, we designed a scoring function that computes word probability base on correlations of connected components in their score, height, and orientation of connections. The effectiveness of the proposed technique has been demonstrated by its superior performance as compared with other state-ofthe-art algorithms. 1303

9 References [1] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazàn, and L. P. de las Heras, Icdar 2013 robust reading competition, in Proceedings of the th International Conference on Document Analysis and Recognition, ICDAR 13, (Washington, DC, USA), pp , IEEE Computer Society, [2] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny, Icdar 2015 competition on robust reading, in Proceedings of the th International Conference on Document Analysis and Recognition (ICDAR), ICDAR 15, (Washington, DC, USA), pp , IEEE Computer Society, [3] A. Mishra, K. Alahari, and C. V. Jawahar, Enhancing energy minimization framework for scene text recognition with topdown cues, [4] K. Wang, B. Babenko, and S. Belongie, End-to-end scene text recognition, in IEEE International Conference on Computer Vision (ICCV), (Barcelona, Spain), [5] K. Wang and S. Belongie, Word spotting in the wild, in European Conference on Computer Vision (ECCV), (Heraklion, Crete), Sept [6] W. Huang, Y. Qiao, and X. Tang, Robust scene text detection with convolution neural network induced MSER trees, in Computer Vision - ECCV th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV, pp , [7] C. Zhang, C. Yao, B. Shi, and X. Bai, Automatic discrimination of text and non-text natural images, in 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Tunis, Tunisia, August 23-26, 2015, pp , [8] M. Sung, B. Jun, H. Cho, and D. Kim, Scene text detection with robust character candidate extraction method, in 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Tunis, Tunisia, August 23-26, 2015, pp , [9] S. Qin and R. Manduchi, A fast and robust text spotter, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, [10] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, Reading text in the wild with convolutional neural networks, CoRR, vol. abs/ , [11] A. Gordo, Supervised mid-level features for word image representation, CoRR, vol. abs/ , [12] M. Jaderberg, A. Vedaldi, and A. Zisserman, Deep features for text spotting, in European Conference on Computer Vision, [13] J. Almazán, A. Gordo, A. Fornés, and E. Valveny, Word spotting and recognition with embedded attributes, in TPAMI, [14] B. Su and S. Lu, Accurate scene text recognition based on recurrent neural network, in Computer Vision - ACCV th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part I, pp , [15] L. Gomez and D. Karatzas, Textproposals: a text-specific selective search algorithm for word spotting in the wild, in arxiv preprint arxiv: , [16] Z. Zhang, W. Shen, C. Yao, and X. Bai, Symmetry-based text line detection in natural scenes, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, [17] Z. Zhong, L. Jin, S. Zhang, and Z. Feng, Deeptext: A unified framework for text proposal generation and text detection in natural images, CoRR, vol. abs/ , [18] B. Epshtein, E. Ofek, and Y. Wexler, Detecting text in natural scenes with stroke width transform., in CVPR, pp , IEEE, [19] S. Lu, T. Chen, S. Tian, J.-H. Lim, and C.-L. Tan, Scene text extraction based on edges and support vector regression, Int. J. Doc. Anal. Recognit., vol. 18, pp , June [20] B. Alexe, T. Deselaers, and V. Ferrari, What is an object?, in The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, June 2010, pp , [21] Z. Zhang, Y. Liu, T. Bolukbasi, M. Cheng, and V. Saligrama, BING++: A fast high quality object proposal generator at 100fps, CoRR, vol. abs/ , [22] C. L. Zitnick and P. Dollár, Edge boxes: Locating object proposals from edges, in ECCV, [23] P. Arbeláez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik, Multiscale combinatorial grouping, in Computer Vision and Pattern Recognition, [24] P. Krähenbühl and V. Koltun, Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, ch. Geodesic Object Proposals, pp Cham: Springer International Publishing, [25] A. Humayun, F. Li, and J. M. Rehg, Rigor: Reusing inference in graph cuts for generating object regions, in Computer Vision and Pattern Recognition (CVPR), Proceedings of IEEE Conference on, IEEE, june [26] E. Rahtu, J. Kannala, and M. B. Blaschko, Learning a category independent object detection cascade, in IEEE International Conference on Computer Vision, [27] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, Selective search for object recognition, International Journal of Computer Vision, [28] S. Manén, M. Guillaumin, and L. Van Gool, Prime Object Proposals with Randomized Prim s Algorithm, in iccv, Dec [29] J. Carreira and et al., Constrained parametric min-cuts for automatic object segmentation,

10 [30] S. Ren, K. He, R. B. Girshick, and J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, CoRR, vol. abs/ , [31] J. Dai, Y. Li, K. He, and J. Sun, R-FCN: Object detection via region-based fully convolutional networks, arxiv preprint arxiv: , [32] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, CoRR, vol. abs/ , [33] T. Kong, A. Yao, Y. Chen, and F. Sun, Hypernet: Towards accurate region proposal generation and joint object detection, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June [34] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, pp , jun [35] K. Wang and S. Belongie, Computer Vision ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I, ch. Word Spotting in the Wild, pp Berlin, Heidelberg: Springer Berlin Heidelberg, [36] L. P. Sosa, S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young, Icdar 2003 robust reading competitions, in In Proceedings of the Seventh International Conference on Document Analysis and Recognition, pp , IEEE Press, [37] C. Wolf, J. michel Jolion, and B. J. Verne, Object count/area graphs for the evaluation of object detection and segmentation algorithms, International Journal on Document Analysis and Recognition, pp , [38] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The pascal visual object classes challenge 2009 (voc2009) results [39] J. H. Hosang, R. Benenson, and B. Schiele, How good are detection proposals, really?, CoRR, vol. abs/ , [40] J. M., A. Vedaldi, and A. Zisserman, Deep features for text spotting, in European Conference on Computer Vision, [41] J. H. Hosang, R. Benenson, P. Dollár, and B. Schiele, What makes for effective detection proposals?, CoRR, vol. abs/ , [42] A. K. J. Youngbum Lee, Unsang Park, Pill-id: Matching and retrieval of drug pill imprint images, Tech. Rep. MSU-CSE-10-4, Department of Computer Science, Michigan State University, East Lansing, Michigan, February [43] J. Yu, Z. Chen, S. Kamata, and J. Yang, Accurate system for automatic pill recognition using imprint information, IET Image Processing, vol. 9, no. 12, pp , [44] R. Palenichka, A. Lakhssassi, and M. Palenichka, Visual attention-guided approach to monitoring of medication dispensing using multi-location feature saliency patterns, in The IEEE International Conference on Computer Vision (ICCV) Workshops, December

Object Proposals for Text Extraction in the Wild

Object Proposals for Text Extraction in the Wild Lluís Gómez and Dimosthenis Karatzas Computer Vision Center, Universitat Autònoma de Barcelona Email: {lgomez,dimos}@cvc.uab.es arxiv:59.237v [cs.cv] 8