Text-Edge-Box: An Object Proposal Approach for Scene Texts Localization

Size: px
Start display at page:

Download "Text-Edge-Box: An Object Proposal Approach for Scene Texts Localization"

Transcription

1 2017 IEEE Winter Conference on Applications of Computer Vision Text-Edge-Box: An Object Proposal Approach for Scene Texts Localization Dinh Nguyen 1,3, Lu Shijian 2,3, Nizar Ouarti 1,3, Mounir Mokhtari 3,4 1 University Pierre & Marie Curie, France, 2 Institute for Infocomm Research, Singapore 3 Image & Pervasive Access Lab, Singapore (UMI 2955), 4 Institut Mines-Telecom, France dinh.nguyen van@etu.upmc.fr, slu@i2r.a-star.edu.sg nizar.ouarti@ipal.cnrs.fr, Mounir.Mokhtari@mines-telecom.fr Abstract Text proposal has been gaining interest in recent years due to the great success of object proposal in categoriesindependent object localization. In this paper, we present a novel text-specific proposal technique that provides superior bounding boxes for accurate text localization in scenes. The proposed technique, which we call Text Edge Box (TEB), uses a binary edge map, a gradient map and an orientation map of an image as inputs. Connected components are first found within the binary edge map, which are scored by two proposed low-cue text features that are extracted in the gradient map and the orientation map, respectively. These scores present text probability of connected components and are aggregated in a text edge image. Scene texts proposals are finally generated by grouping the connected components and estimating their likelihood of being words. The proposed TEB has been evaluated on the two public scene text datasets: the Robust Reading Competition 2013 dataset (ICDAR 2013) dataset and the Street View Text (SVT) dataset. Experiments show that the proposed TEB outperforms the state-of-the-art techniques greatly. 1. Introduction Texts in scenes provide rich semantic cues for context understanding, and automatic scene texts recognition has been attracting increasing interest in recent years[1, 2]. In general, end-to-end scene text recognition comprises two major tasks: text detection and text recognition. The first task searches and detects text regions in scenes, and the second further recognizes words within the detected text regions. Leveraging on the prevalent object proposal works, we propose a scene text proposal technique that localizes text regions successfully. For searching texts regions in scenes, the traditional approach [3, 4, 5] exploits the sliding window strategy. However, this approach has to deal with exhaustive search using windows of different scales and aspect ratios. Another typical approach detects text regions based on various segmentation techniques [6, 7, 8, 9]. The segmentation approach achieves promising performance but it is very sensitive to different types of degradation that are often introduced by uncontrolled illuminations, shadow, geometric distortions, and so on. In recent years, object proposal techniques have been widely investigated due to their capacity in locating category-independent objects. For example, Jaderberg et al. used object proposal as an initial step in their end-to-end scene text recognition work [10] which produces superior recognition performance as compared with most state-ofthe-art systems [11, 12]. The exploitation of object proposal for scene text localization is inspired by observation that characters in scenes are actually quite similar to generic objects due to high intra-class variation that is often introduced by different types of distortions. The intra-class variation increases exponentially when scene text detection moves from character-level to word-level following the great success of word recognition [10, 13, 14]. On the other hand, generic object proposal techniques often produce a huge number of proposals when applied for scene text localization task. This shifts object proposals to the traditional sliding window approach in term of the large searching space. For scene text localization, the number of proposals can be reduced significantly by incorporating certain text-specific features [15, 16, 17]. We design a novel text-specific proposal technique for detecting text in scenes. The proposed technique has three contributions. First, we design two low-cue text features, namely, an edge pair and an edge variance that are extracted from a gradient map and an orientation map of an image, respectively. The two features are inspired from the text-specific properties as observed in [18, 19]. The edge pair feature differs from the Stroke Width Transform (SWT) [18] because we only monitor the orientations of pixels in connected components instead of their distances. The edge variance and the text-specific image contrast in [19] are extracted from a gradient map, whereas the proposed fea /17 $ IEEE DOI /WACV

2 Figure 1. The flowchart of the proposed Text Edge Box technique is shown on the left. All proposal boxes are not exhaustively displayed because of the important number of occurrence. On the right is a pseudo-code illustrating proposal generation and this step is explained in Section 3.2 ture captures the gradient variance which reflects better the monotonous contrast in text boundaries. Second, a grouping strategy is proposed to cluster connected components into text line proposals which are further split into wordlevel proposals based on their geometric information. The grouping strategy is more suitable than the sliding window to cater to the unconstrained aspect ratio of words in scenes. Third, a proposal scoring function is designed by combining scores of connected components within a proposal region and scores of their correlations, which captures the important characteristics of scene texts effectively. 2. Related works Object proposal was investigated in recent years to localize generic objects in scenes [20]. As an alternative framework to the traditional sliding window based object detection, it can locate category-independent objects with a much smaller number of image patches, hence boost object detection and recognition efficiency significantly. Traditional object proposal techniques can be divided into two major categories including the boundary based and the graphconnectivity based. In the boundary based techniques, objects are assumed to have well-defined boundaries where proposals are produced either based on certain boundary properties and cues [21, 22] or by grouping and scoring image pixels using boundary connection [20, 23, 24, 25, 26]. In the graph-connectivity based techniques, connectivity of pixels, super-pixels or segments are exploited to merge them together to generate proposals [27, 28, 29]. Leveraging on the powerful feature extraction and classification cap ability of deep learning architecture, increasing object localization works have been reported with impressive performance by applying convolutional neural network based proposal generations [30, 31, 32, 33]. Scene text proposal has been investigated over the past few years due to its advantage in localizing texts in scenes[15, 16, 17]. The pilot research [15] has searches and ranks potential text regions based on the Maximally Stable Extremal Region (MSER) and regions description. Adaboost classifier is implemented to score text regions. It outperforms the direct applications of existing generic object proposals greatly. The Symmetry-Text Line detection [16] provides text line proposals by estimating symmetry appearance of texts. The symmetry filters are designed to estimate text probability and provide a text heat map. The text line proposals are found by applying threshold on the text heat map, which are further partitioned into word-level proposals based on distance between connected components within the original image. With the well designed text-nontext classifier, it achieves superior detection performance as compared with state-of-the-art text detection techniques. The latest approach [17] has been developed by integrating the pre-trained convolutional neural network (VGG16 net) with their own Inception-Region- Proposal-Network. The performance is better than other existing generic object proposals. 3. Text-Edge-Box proposal approach This section describes our proposed technique that is designed to produce word-level text proposals in scenes. The framework of the proposed technique is shown in Figure 1. Firstly, we exploit the Canny edge detector [34] to generate a gradient map, an orientation map and a binary edge map. Pixels in the orientation map are normalized into the range of [0,π]. Connected components (CCs) are labelled within the binary map, which are further scored by combination of 1297

3 Figure 2. The path a and b are respectively an example image and its own binary edge map. The path c shows examples of connected components including one for text connected component (connected component A) and one for non-text connected component (connected component B). The red arrows are orientation of the considered pixels in the connected components. The dash lines are searching lines corresponding to the orientation of pixels. The pixels in the couple pixels shown in the connected component A are defined as edge pair pixels. Obviously, text connected component includes much more number of edge pair pixels than non-text connected component. two proposed low-cue text features including an edge pair feature (EP) and an edge variance feature (EV). The two text features are estimated from the orientation and the gradient at the corresponding CC pixels, respectively. The CCs are then merged together to produce word-level proposals. A proposal scoring function is designed, which computes probability of each word-level proposal being a word by combining the scores of CCs and scores of their relationships (correlation in component scores, component sizes, and links between pair of components). Finally, word-level proposals are sorted in the descending order, and those with high scores are identified as words Text edge image generation We first define the two proposed low-cue text features including an Edge Pair Feature (EP) and an Edge Variance Feature (EV). A CC scoring function is then presented, which assigns a score to each CC and store it in a Text Edge Image (TEI) The edge pair feature The first feature is an edge pair (EP) that is inspired from the Stroke Width Transform method [18]. It is developed base on a supposition that CCs of text objects are likely to contain the high portion number of couples pixels that have opposite orientations, like the two example pixels of the connected component A as illustrated at the path c in Figure 2. From now, we call these pixels as edge pair pixels. In order to detect them, we start at each given pixel in a CC and its orientation is used to decide a searching line. If an opposite orientation pixel in the same CC has been found in the searching line, the considered pixel and the searched one with an opposite orientation are defined as edge pair pixels. The EP feature of a given CC is defined as a fraction of edge pair pixels in the CC as follows: EP(CC)= N pp(cc) N p (CC) Where N pp (CC) and N p (CC) denote the number of edge pair pixels and the number of edge pixels belonging to a CC under study, respectively. The CC having higher EP value is more likely to be a text CC. The value of this feature is in the range of [0, 1] The edge variance feature The second feature is an edge variance (EV) that measures the variance of gradient magnitude of pixels in a CC. This parameter is useful because the gradients of pixels in the boundary of an individual character (or boundaries of characters in a same word) are often monotonous. Therefore, their variances are expected to be small. We utilize an exponential function of the gradient variance in order to normalize these values into the range of [0, 1] and produce high values for text CCs as below: (1) EV (CC)=e var(cc) (2) Where the var(cc) denotes the variance of the gradient of pixels in a CC The text edge image The text edge image (TEI) is a score map that shows the being-text probability of each CC. Pixels in a CC have the value of the CC score, and other pixels have a value of zero. The score of each CC is estimated by weighted summation of its two text probability features as follows: CCscore = αep +(1 α)ev (3) where α is in the range of [0, 1]. Its value is determined through a tuning process which is described in Section

4 Since both features have their values in the range [0, 1], all pixels in the TEI are in the range of [0,1] Scene text proposal generation strategy As showing in Figure 1.b, the CCs are first merged into text lines which are then split to small subgroups referring to word-level proposals. Starting with a given CC (called candidate A), three properties of its bounding box (bb A ) are exploited including box height (h A ), box width (w A ) and box size (s A ). A corresponding search area is designed by expanding the bb A, where the search area width (w search ) is equal to the image width, and the search area height (h search )isγ times bigger than the h A determined by expanding the h A equally in both sides in the vertical direction. CC candidate B (and its properties are bb B,w B,h B, and s B ) is merged with the CC candidate A to form a group if the bb B satisfies: (1) The ratio of intersection between the bb B and the A s search space to the s B is higher than τ s, (2) the ratio between min(w A,w B ) and max(w A,w B ) is higher than τ w, (3) the ratio between min(h A,h B ) and max(h A,h B ) is higher than τ h. The parameters γ and τ s are sensitive to horizontal texts. The τ w and τ h are sensitive to size relationship between characters in a word. How to set value for these parameters is discussed in Section 4.2. In order to divide text line proposals into small subgroups which refer to word-level proposals, an average of distances between adjacent CCs boxes in the horizontal direction is estimated. Dividing positions are decided at the location where the distance is larger than the average value Ranking This section elaborates a strategy to provide a ranked list of proposals in the decreasing priority order. Four measures are defined as S a,s c,s h, and S o, which are all normalized in the range of [0,1]. The S a is an averaging scores of CCs within a word-level proposal region, where the score of each included CC is defined in Eq.3. The S c, S h and S o indicate affinity among grouped CCs. These measures are designed so that a proposal covering a word will have high values. In particular, they are calculated from the variance of scores of grouped CCs, the variance of bounding box s height of CCs in a proposal region, and the variance of angles between a line linking centroids of two neighbourhood CCs and the horizontal axis, respectively. These angles are adjusted in the range of [0,π]. Generally, a proposal has a high likelihood of being word if it satisfies (1) its CCs should have similar scores, (2) the heights of CCs should be approximately stable, and (3) the connection lines between the CCs should be appropriate in same direction. Therefore, the variances of these measures are expected to be small for a word region proposal. In order to derive a high S c,s h,s o value for a group of CCs which is likely to be a word, and also normalize the measures in the range of [0, 1], the arctan functions is implemented for each measure as follows: S x = 2 arctan(k x/var x ) π where the symbol x represents for the c, h, and o, and the var x refers to the variances of CC s score, height, and angle respectively. The parameter k x is set at the middle of each measure range, i.e. 0.5, a half of image height, and π/2, respectively for S c, S h, and S o. The score function of a proposal region S p is computed as: (4) S p = S a arctan(k 1 Sx ) (5) where, the function arctan(k 1 Sx ) is used to control the relationship between the S p and the S a. If the S x makes the arctan(k 1 Sx ) value higher than 1, we say that the S x has a supporting effect (S p > S a ). If the S x makes the function value smaller than 1, it has a penalizing effect (S p <S a ). It means that: although a proposal has a high value of the S a, it is unlikely to be a word proposal if its CCs provide low S x values (referring to the penalizing effect, see the low score candidate at the path c in Figure 1). The parameter k 1 adjusts the role of the S x measures. In particular, if the k 1 increases, the role of the S x measures is reduced. In this research, we expect that if the S x is higher than the middle value of its range as 0.5, the function arctan(k 1 Sx ) has the supporting effect and vice versa. So, the parameter k 1 is set at 3 as the arctan(3 0.5) 1. Note that, the S x are considered only when the number of CCs in a proposal is larger than 3. Otherwise, the S p is calculated based on the S a only. 4. Experiments and results 4.1. Experiment set-up The proposed technique takes a scene image as an input and generates proposals that cover locations of words. The target is to achieve high recall with a small number of proposals. The optimal parameters of the proposed system are estimated on the training sets and the system s performance has been evaluated on the testing sets of two public datasets including the ICDAR Robust Reading Competition 2013 dataset (ICDAR2013) [1] and the Street View Text dataset (SVT) [35]. The two datasets contain 229 and 101 images for training and 233 and 249 images for testing, respectively. The SVT dataset is more challenging than the ICDAR2013 dataset because of including huge noise, poor lighting, low contrast as well as low resolution images. The proposed TEB has been compared to three scene text proposal algorithms: the simple text specific selective search (TP) [15], the Symmetry-Text Line (STL) [16], the 1299

5 α =0.3 α =0.4 α =0.5 α =0.52 α =0.54 α =0.56 α =0.58 α =0.6 α =0.7 IoU = IoU = IoU = Table 1. The detection rate (in%) of the proposed technique with the variation of the α value, and the difference of the IoU threshold on the joined training sets of the two scene text datasets: ICDAR2013 and SVT. The maximum number of proposal regions is DeepText (DT) [17]. In addition, it is also compared with other generic object proposal methods including the Edge- Box (EB) [22], the Geodesic (GOP) [24], the Randomized- Prim (RP) [28], and the Multiscale Combination Grouping (MCG) [23]. The parameters of the TP, STL and other object proposal algorithms are kept as default as their recommendation for the best performance. Note that the implementation of the DT is not released. The comparison with this method is done based on the results as reported in the published paper[17]. We therefore cap the proposal number at Because of a large size of images (an average at 1194x870 for WxH) and a large size range of texts (from 4 to 2146 for width, and from 3 to 785 for height) in the two datasets, this cap emphasises the advantage of object proposal in the comparison to the sliding window based exhaustive search strategy. Besides, it also helps to reduce the computational cost for scene text recognition. Due to the diversity of color, light and size of text objects in scenes, the proposed algorithm is implemented with different color representation (Gray and RGB) and different scales (from 0.1 to 1 with the inner step is 0.3) to enrich chances of finding out positive proposals Parameters tuning Figure 3. The difference between one-to-one overlap, one-to-many overlap, and many-to-one overlap. The red boxes are the ground truth boxes and the green dash boxes are the proposal regions boxes. We follow the evaluation method that is widely used for the evaluation of object proposals as described in the publications [17, 22, 30, 33]. It considers concretely only oneto-one overlap between proposal regions and ground truth boxes. The same evaluation method was used in the IC- DAR2003 competition [36] and is much more constrained than the framework used in the ICDAR2013 competition [37] which also considers one-to-many and many-to-one overlap for detection evaluation as illustrated in Figure 3. The proposed technique is evaluated based on the detection rate under various testing conditions formed by a combination of the certain number of proposals and the intersection over union (IoU) thresholds. The IoU measures how well proposals overlap with ground truth boxes, and the higher IoU threshold requires the better overlap. Generally, the IoU threshold of 0.5 is acceptable to decide whether objects have been located [38]. However, the higher IoU thresholds are usually expected due to cases of unpredictable word proposals as mentioned in [10], which used the EdgeBox proposals [22] to find texts locations in scenes. In addition, good objects proposal algorithms are often required to produce small number of proposals [39]. The proposed technique involves five specific parameters including: γ,τ s,τ w,τ h and α. In order to improve the robustness of the proposed system under a wide diversity of text appearance, these five parameters are determined based on the joined training sets of the ICDAR2013 dataset (for high contrast texts) and the SVT (for blur texts). We first focus on generating high quality group of proposals, which maximizes overlap with the ground truth, by varying the four parameters γ,τ s,τ w,τ h, and the ranking step is ignored. These four parameters are tuned in the range of [1, 2], [0.5, 1], [0.1, 1], [0.5, 1] with the internal step of 0.05, respectively. All generated proposals were collected for evaluating detection rate. The best values of the four parameters are searched as γ =1.5,τ s =0.75,τ w = 0.3,τ h =0.75, respectively. After obtaining good proposals, we concentrate on scoring proposals and shifting likely-to-be-text proposals to the top of the list by arranging the found group in the descending order. The parameter α has been estimated for this purpose. This parameter controls the contribution of two proposed features (EP and EV) which are reflected into values of S a,s c and S p in the scoring function (Eq.4, Eq.5). As the results presented in Table 1, when the number of proposals is set to maximum as 5000, the α value around 0.54 and 0.56 provides the optimal performance on the joined training set under many thresholds of IoU. We therefore set the α at 0.54 for all experiments including the comparison with other state-of-the-art methods. 1300

6 Figure 4. The detection rate evaluation vs the number of proposals (top row) and the IoU threshold (bottom row) of the Text Edge Box (TEB) and other state-of-the-art algorithms, including the simple text specific selective search (TP) [15], the Symmetry-Text Line (STL) [16], the DeepText (DT) [17], the EdgeBox (EB) [22], the Geodesic (GOP) [24], the RandomizedPrim (RP) [28], and the Multiscale Combination Grouping (MCG) [23] on the ICDAR 2013 dataset Dataset Recall(%) TEB TP STL EB RP MCG GOP DT ICDAR N ICDAR SVT N Table 2. The number of proposal regions needed for different recall rate (50% and 75%) with the IoU=0.7. There is no technique can reach 75% recall rate with the IoU = 0.7 on the SVT dataset. The character N means that we do not have information for the comparison. The symbol - means that these techniques can not reach the assigned recall rate. Dataset TEB TP STL EB RP MCG GOP DT ICDAR N SVT N Table 3. The processing time (in second) of the algorithms on two popular scene text datasets: the ICDAR2013 and the SVT. The character N means that we do not have information for the comparison. The time processing of the STL technique is 6 times slower than the author s report [16] because we spent their all generated text-line proposals to generate word-level proposals for fair comparison Experimental results Figure 4 illustrates the performance of a proposed technique as well as the comparison with state-of-the-art techniques. In the top row, the detection rate vs the number of proposals on the ICDAR 2013 dataset has been calculated under the three different IoU thresholds, i.e 0.5, 0.7 and 0.9. The TEB algorithm obviously outperforms others methods at the different IoU values when the number of proposals is larger than The DT leverages on the deep learning model for scoring proposal regions. Its performance is therefore very competitive at the small number of proposals, especially with the IoU = 0.5. This is due to the deep learning model that has advantage in recognizing non-text regions and eliminating them from the generated proposal list. However, when the IoU threshold increases to 0.7 and 0.9, our proposed system localizes scene texts more successfully. The TP is the most competitive technique when the huge number of proposals are accepted. The EB shows better result than the TP when the number of proposals is smaller than On the other hand, its performance deteriorates when the number of proposals increases. The bottom row shows the second experiment that estimates the detection rate vs the IoU threshold for the different set of proposals: 100, 500 and The TEB outperforms other methods (excluding the DT) significantly under the different bunch of proposals. When the number of proposals increases and the IoU requirement is more constrained, the proposed TEB performs better than the DT. In addition, we also test the minimum number of pro- 1301

7 Figure 5. The performance of the end-to-end word spotting systems which are constructed by the comparison techniques and the word recognition model [40]. The performance of RegModel here is the result of the word recogntion model [40] tested on the ground truth of the testing sets of two scene text datasets as ICDAR2013 and SVT. Figure 6. There are some examples of the SVT ground truth boxes which our proposals cannot localize with the IoU threshold of 0.7. The red boxes are the ground truths and the green boxes are our proposals. The proposal boxes are much more smaller and closer to the scene text objects than the ground truth boxes. posals that are required to obtain different desired recalls. Hosang et al. [41] shows that this criteria correlates well with the detection performance and it has been used to evaluate quality of the proposals in the EB [22] and the Hyper- Net [33]. Table 2 shows the experimental results on the two datasets. On the ICDAR2013 dataset, the TEB algorithm always requires the smallest number of proposal regions. On the SVT dataset, the TEB performs slightly lower in comparison with the EB algorithm, but better than other stateof-the-art algorithms. On the other hand, the minimum proposal regions required are clearly more than those for the ICDAR2013 dataset. Besides the poorer image quality in the SVT dataset, one important reason of the lower performance is due to the ground truth of the SVT dataset where the manually labelled bounding boxes are often much larger than the actual boxes. This is illustrated in Figure 6 where the ground truth boxes in the red color are clearly much larger than the boxes produced by the proposed TEB in the green color. Furthermore, the IoU based evaluation often has certain constraints where the proposals have small overlap with the ground truth boxes but cover entire objects as illustrated in Figure 6. We also adopted another evaluation that uses word recognition models to estimate the quality of proposals. The well-known word recognition model provided by Jaderberg et al. [40] is implemented to perform this additional task. A proposal is a correct localization if it overlaps with one of the ground truth boxes and provides enough information to help the recognition model to recognize correct word. The better proposal technique will achieve the higher F-score at the output of the recognition model. The quality of the recognition model is first estimated on the ground trust boxes of the testing sets of the two datasets. The F- scores of the model in the ICDAR2013 and the SVT dataset are and respectively. They are presented in Figure 5 as RegModel s performances. This is the maximum performance that each proposal technique might obtain if they can provide good proposals that match perfectly to the ground truth boxes. As shown in Figure 5, the TEB method produces the largest number of good proposals which help the recognition model read contained words correctly. In addition, the performance of the proposed TEB algorithm just changes slightly when we increase the number of proposals from 1000 to 5000 in both datasets. It proves that the proposed technique ranks proposals better than other techniques. Therefore, most good proposals have been ranked correctly at the top of the list. The efficiency of the proposed technique is also evaluated based on the execution time. All above techniques are evaluated on the same computer and executed in one thread as the Xeon CPU E GHz. As presented in Table 3, the proposed TEB is comparable to the most efficient method except the original EB method. However, the original EB method does not perform well in term of the minimum proposal number required and the maximum recall obtained. For the DeepText method [17], the authors have not released their program yet, and we do not have processing time report in our device. According to their re- 1302

8 Figure 7. Examples from the ICDAR2013 dataset that our algorithm failed to localize. The red boxes are ground truths and the green boxes are our proposal regions. port, their algorithm take average 1.7 second for an image in the ICDAR2013 dataset in their device using the single GPU K40 which is much more powerful than what we used. 5. Discussion One distinctive feature of the proposed TEB is that it does not rely on any classifier for eliminating non-text proposals (as implementation in the TP, DT ). It simply uses the two proposed features and the geometric relationships among CCs to rank proposals. Nevertheless, very good performances were obtained on the two public scene text datasets, which demonstrate the effectiveness of the proposed text-specific proposal technique. Besides, while testing on the SVT dataset that includes a certain amount of non-horizontal text lines, the TEB is still competitive as compared with the TP, which is designed without horizontal restriction. It is observed that the proposed algorithm can handle multi-orientation text lines. As discussion in Section 3.2, the horizontal text line assumption can be relaxed by increasing the parameter γ for a larger searching space and reducing the parameter τ s for retaining more CCs. Then, the scoring function should be upgraded to handle the huge number of CCs in the merged groups. We will investigate it in our future work. Figure 7 illustrates several typical scenarios where our algorithm often fails to provide good proposals including ultra-low contrast (a.1), complex background (a.2), very small text size (a.3), and uneven illumination (a.2, a.4). In particular, the edges of texts in a complex background are often connected with edges of other objects where the edge pair feature may not be extracted reliably. Similarly, when the text objects are covered by shadow or uneven illumination, the forms of texts boundaries have been destroyed, where the text edge features may not be extracted properly either. In term of low contrast, the text edges could miss to be detected by the Canny edge detector due to the ultra-low Figure 8. Some examples of the proposed technique output when it is applied on pill images gradient magnitude. 6. Application An application is under development that aims to use scene text detection and recognition technique for supporting elderly in reading task. Especially, we want to apply it to capture the imprint features for pill recognition which have been studied widely in recent years[42, 43, 44]. Based on 1000 proposals in the top of the list, we can find out the imprint areas correctly as illustrated in Figure Conclusion In this paper, we proposed a text-specific proposal algorithm to search text regions in scenes. Two text-specific features, namely, an edge pair and an edge variance, which were designed to search for more likely text components. In order to measure the text likelihood of the proposal boxes, we designed a scoring function that computes word probability base on correlations of connected components in their score, height, and orientation of connections. The effectiveness of the proposed technique has been demonstrated by its superior performance as compared with other state-ofthe-art algorithms. 1303

9 References [1] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i. Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazàn, and L. P. de las Heras, Icdar 2013 robust reading competition, in Proceedings of the th International Conference on Document Analysis and Recognition, ICDAR 13, (Washington, DC, USA), pp , IEEE Computer Society, [2] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny, Icdar 2015 competition on robust reading, in Proceedings of the th International Conference on Document Analysis and Recognition (ICDAR), ICDAR 15, (Washington, DC, USA), pp , IEEE Computer Society, [3] A. Mishra, K. Alahari, and C. V. Jawahar, Enhancing energy minimization framework for scene text recognition with topdown cues, [4] K. Wang, B. Babenko, and S. Belongie, End-to-end scene text recognition, in IEEE International Conference on Computer Vision (ICCV), (Barcelona, Spain), [5] K. Wang and S. Belongie, Word spotting in the wild, in European Conference on Computer Vision (ECCV), (Heraklion, Crete), Sept [6] W. Huang, Y. Qiao, and X. Tang, Robust scene text detection with convolution neural network induced MSER trees, in Computer Vision - ECCV th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV, pp , [7] C. Zhang, C. Yao, B. Shi, and X. Bai, Automatic discrimination of text and non-text natural images, in 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Tunis, Tunisia, August 23-26, 2015, pp , [8] M. Sung, B. Jun, H. Cho, and D. Kim, Scene text detection with robust character candidate extraction method, in 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Tunis, Tunisia, August 23-26, 2015, pp , [9] S. Qin and R. Manduchi, A fast and robust text spotter, in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, [10] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, Reading text in the wild with convolutional neural networks, CoRR, vol. abs/ , [11] A. Gordo, Supervised mid-level features for word image representation, CoRR, vol. abs/ , [12] M. Jaderberg, A. Vedaldi, and A. Zisserman, Deep features for text spotting, in European Conference on Computer Vision, [13] J. Almazán, A. Gordo, A. Fornés, and E. Valveny, Word spotting and recognition with embedded attributes, in TPAMI, [14] B. Su and S. Lu, Accurate scene text recognition based on recurrent neural network, in Computer Vision - ACCV th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, Revised Selected Papers, Part I, pp , [15] L. Gomez and D. Karatzas, Textproposals: a text-specific selective search algorithm for word spotting in the wild, in arxiv preprint arxiv: , [16] Z. Zhang, W. Shen, C. Yao, and X. Bai, Symmetry-based text line detection in natural scenes, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, [17] Z. Zhong, L. Jin, S. Zhang, and Z. Feng, Deeptext: A unified framework for text proposal generation and text detection in natural images, CoRR, vol. abs/ , [18] B. Epshtein, E. Ofek, and Y. Wexler, Detecting text in natural scenes with stroke width transform., in CVPR, pp , IEEE, [19] S. Lu, T. Chen, S. Tian, J.-H. Lim, and C.-L. Tan, Scene text extraction based on edges and support vector regression, Int. J. Doc. Anal. Recognit., vol. 18, pp , June [20] B. Alexe, T. Deselaers, and V. Ferrari, What is an object?, in The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, June 2010, pp , [21] Z. Zhang, Y. Liu, T. Bolukbasi, M. Cheng, and V. Saligrama, BING++: A fast high quality object proposal generator at 100fps, CoRR, vol. abs/ , [22] C. L. Zitnick and P. Dollár, Edge boxes: Locating object proposals from edges, in ECCV, [23] P. Arbeláez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik, Multiscale combinatorial grouping, in Computer Vision and Pattern Recognition, [24] P. Krähenbühl and V. Koltun, Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, ch. Geodesic Object Proposals, pp Cham: Springer International Publishing, [25] A. Humayun, F. Li, and J. M. Rehg, Rigor: Reusing inference in graph cuts for generating object regions, in Computer Vision and Pattern Recognition (CVPR), Proceedings of IEEE Conference on, IEEE, june [26] E. Rahtu, J. Kannala, and M. B. Blaschko, Learning a category independent object detection cascade, in IEEE International Conference on Computer Vision, [27] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, Selective search for object recognition, International Journal of Computer Vision, [28] S. Manén, M. Guillaumin, and L. Van Gool, Prime Object Proposals with Randomized Prim s Algorithm, in iccv, Dec [29] J. Carreira and et al., Constrained parametric min-cuts for automatic object segmentation,

10 [30] S. Ren, K. He, R. B. Girshick, and J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, CoRR, vol. abs/ , [31] J. Dai, Y. Li, K. He, and J. Sun, R-FCN: Object detection via region-based fully convolutional networks, arxiv preprint arxiv: , [32] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, CoRR, vol. abs/ , [33] T. Kong, A. Yao, Y. Chen, and F. Sun, Hypernet: Towards accurate region proposal generation and joint object detection, in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June [34] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, pp , jun [35] K. Wang and S. Belongie, Computer Vision ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I, ch. Word Spotting in the Wild, pp Berlin, Heidelberg: Springer Berlin Heidelberg, [36] L. P. Sosa, S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young, Icdar 2003 robust reading competitions, in In Proceedings of the Seventh International Conference on Document Analysis and Recognition, pp , IEEE Press, [37] C. Wolf, J. michel Jolion, and B. J. Verne, Object count/area graphs for the evaluation of object detection and segmentation algorithms, International Journal on Document Analysis and Recognition, pp , [38] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The pascal visual object classes challenge 2009 (voc2009) results [39] J. H. Hosang, R. Benenson, and B. Schiele, How good are detection proposals, really?, CoRR, vol. abs/ , [40] J. M., A. Vedaldi, and A. Zisserman, Deep features for text spotting, in European Conference on Computer Vision, [41] J. H. Hosang, R. Benenson, P. Dollár, and B. Schiele, What makes for effective detection proposals?, CoRR, vol. abs/ , [42] A. K. J. Youngbum Lee, Unsang Park, Pill-id: Matching and retrieval of drug pill imprint images, Tech. Rep. MSU-CSE-10-4, Department of Computer Science, Michigan State University, East Lansing, Michigan, February [43] J. Yu, Z. Chen, S. Kamata, and J. Yang, Accurate system for automatic pill recognition using imprint information, IET Image Processing, vol. 9, no. 12, pp , [44] R. Palenichka, A. Lakhssassi, and M. Palenichka, Visual attention-guided approach to monitoring of medication dispensing using multi-location feature saliency patterns, in The IEEE International Conference on Computer Vision (ICCV) Workshops, December

Object Proposals for Text Extraction in the Wild

Object Proposals for Text Extraction in the Wild Object Proposals for Text Extraction in the Wild Lluís Gómez and Dimosthenis Karatzas Computer Vision Center, Universitat Autònoma de Barcelona Email: {lgomez,dimos}@cvc.uab.es arxiv:59.237v [cs.cv] 8

More information

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Arbitrary-Oriented Scene Text Detection via Rotation Proposals 1 Arbitrary-Oriented Scene Text Detection via Rotation Proposals Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, Xiangyang Xue arxiv:1703.01086v1 [cs.cv] 3 Mar 2017 Abstract This paper

More information

A pooling based scene text proposal technique for scene text reading in the wild

A pooling based scene text proposal technique for scene text reading in the wild A pooling based scene text proposal technique for scene text reading in the wild Dinh NguyenVan a,e,, Shijian Lu b, Shangxuan Tian c, Nizar Ouarti a,e, Mounir Mokhtari d,e a Sorbonne University - University

More information

arxiv: v1 [cs.cv] 23 Apr 2016

arxiv: v1 [cs.cv] 23 Apr 2016 Text Flow: A Unified Text Detection System in Natural Scene Images Shangxuan Tian1, Yifeng Pan2, Chang Huang2, Shijian Lu3, Kai Yu2, and Chew Lim Tan1 arxiv:1604.06877v1 [cs.cv] 23 Apr 2016 1 School of

More information

Texture Complexity based Redundant Regions Ranking for Object Proposal

Texture Complexity based Redundant Regions Ranking for Object Proposal 26 IEEE Conference on Computer Vision and Pattern Recognition Workshops Texture Complexity based Redundant Regions Ranking for Object Proposal Wei Ke,2, Tianliang Zhang., Jie Chen 2, Fang Wan, Qixiang

More information

WeText: Scene Text Detection under Weak Supervision

WeText: Scene Text Detection under Weak Supervision WeText: Scene Text Detection under Weak Supervision Shangxuan Tian 1, Shijian Lu 2, and Chongshou Li 3 1 Visual Computing Department, Institute for Infocomm Research 2 School of Computer Science and Engineering,

More information

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep learning for object detection. Slides from Svetlana Lazebnik and many others Deep learning for object detection Slides from Svetlana Lazebnik and many others Recent developments in object detection 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before deep

More information

Object detection with CNNs

Object detection with CNNs Object detection with CNNs 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before CNNs After CNNs 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 year Region proposals

More information

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

Multi-Oriented Text Detection with Fully Convolutional Networks

Multi-Oriented Text Detection with Fully Convolutional Networks Multi-Oriented Text Detection with Fully Convolutional Networks Zheng Zhang 1 Chengquan Zhang 1 Wei Shen 2 Cong Yao 1 Wenyu Liu 1 Xiang Bai 1 1 School of Electronic Information and Communications, Huazhong

More information

Efficient indexing for Query By String text retrieval

Efficient indexing for Query By String text retrieval Efficient indexing for Query By String text retrieval Suman K. Ghosh Lluís, Gómez, Dimosthenis Karatzas and Ernest Valveny Computer Vision Center, Dept. Ciències de la Computació Universitat Autònoma de

More information

DEPTH-AWARE LAYERED EDGE FOR OBJECT PROPOSAL

DEPTH-AWARE LAYERED EDGE FOR OBJECT PROPOSAL DEPTH-AWARE LAYERED EDGE FOR OBJECT PROPOSAL Jing Liu 1, Tongwei Ren 1,, Bing-Kun Bao 1,2, Jia Bei 1 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 National Laboratory

More information

arxiv: v1 [cs.cv] 12 Sep 2016

arxiv: v1 [cs.cv] 12 Sep 2016 arxiv:1609.03605v1 [cs.cv] 12 Sep 2016 Detecting Text in Natural Image with Connectionist Text Proposal Network Zhi Tian 1, Weilin Huang 1,2, Tong He 1, Pan He 1, and Yu Qiao 1,3 1 Shenzhen Key Lab of

More information

Available online at ScienceDirect. Procedia Computer Science 96 (2016 )

Available online at   ScienceDirect. Procedia Computer Science 96 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 96 (2016 ) 1409 1417 20th International Conference on Knowledge Based and Intelligent Information and Engineering Systems,

More information

Lecture 5: Object Detection

Lecture 5: Object Detection Object Detection CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 5: Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 Traditional Object Detection Algorithms Region-based

More information

arxiv: v3 [cs.cv] 1 Feb 2017

arxiv: v3 [cs.cv] 1 Feb 2017 TextProposals: a Text-specific Selective Search Algorithm for Word Spotting in the Wild arxiv:1604.02619v3 [cs.cv] 1 Feb 2017 Lluís Gómez, Dimosthenis Karatzas Computer Vision Center, Universitat Autonoma

More information

Object Proposals Using SVM-based Integrated Model

Object Proposals Using SVM-based Integrated Model Object Proposals Using SVM-based Integrated Model Wenjing Geng, Shuzhen Li, Tongwei Ren and Gangshan Wu State Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China Email:

More information

Using RGB, Depth, and Thermal Data for Improved Hand Detection

Using RGB, Depth, and Thermal Data for Improved Hand Detection Using RGB, Depth, and Thermal Data for Improved Hand Detection Rachel Luo, Gregory Luppescu Department of Electrical Engineering Stanford University {rsluo, gluppes}@stanford.edu Abstract Hand detection

More information

Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore),

Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore), WordFences: Text Localization and Recognition ICIP 2017 Andrei Polzounov (Universitat Politecnica de Catalunya, Barcelona, Spain), Artsiom Ablavatski (A*STAR Institute for Infocomm Research, Singapore),

More information

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Presented by Tushar Bansal Objective 1. Get bounding box for all objects

More information

RSRN: Rich Side-output Residual Network for Medial Axis Detection

RSRN: Rich Side-output Residual Network for Medial Axis Detection RSRN: Rich Side-output Residual Network for Medial Axis Detection Chang Liu, Wei Ke, Jianbin Jiao, and Qixiang Ye University of Chinese Academy of Sciences, Beijing, China {liuchang615, kewei11}@mails.ucas.ac.cn,

More information

Segmentation Framework for Multi-Oriented Text Detection and Recognition

Segmentation Framework for Multi-Oriented Text Detection and Recognition Segmentation Framework for Multi-Oriented Text Detection and Recognition Shashi Kant, Sini Shibu Department of Computer Science and Engineering, NRI-IIST, Bhopal Abstract - Here in this paper a new and

More information

Scene Text Detection Using Machine Learning Classifiers

Scene Text Detection Using Machine Learning Classifiers 601 Scene Text Detection Using Machine Learning Classifiers Nafla C.N. 1, Sneha K. 2, Divya K.P. 3 1 (Department of CSE, RCET, Akkikkvu, Thrissur) 2 (Department of CSE, RCET, Akkikkvu, Thrissur) 3 (Department

More information

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab. [ICIP 2017] Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab., POSTECH Pedestrian Detection Goal To draw bounding boxes that

More information

Simultaneous Recognition of Horizontal and Vertical Text in Natural Images

Simultaneous Recognition of Horizontal and Vertical Text in Natural Images Simultaneous Recognition of Horizontal and Vertical Text in Natural Images Chankyu Choi, Youngmin Yoon, Junsu Lee, Junseok Kim NAVER Corporation {chankyu.choi,youngmin.yoon,junsu.lee,jun.seok}@navercorp.com

More information

ABSTRACT 1. INTRODUCTION 2. RELATED WORK

ABSTRACT 1. INTRODUCTION 2. RELATED WORK Improving text recognition by distinguishing scene and overlay text Bernhard Quehl, Haojin Yang, Harald Sack Hasso Plattner Institute, Potsdam, Germany Email: {bernhard.quehl, haojin.yang, harald.sack}@hpi.de

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

Simultaneous Recognition of Horizontal and Vertical Text in Natural Images

Simultaneous Recognition of Horizontal and Vertical Text in Natural Images Simultaneous Recognition of Horizontal and Vertical Text in Natural Images Chankyu Choi, Youngmin Yoon, Junsu Lee, Junseok Kim NAVER Corporation {chankyu.choi,youngmin.yoon,junsu.lee,jun.seok}@navercorp.com

More information

DeepBox: Learning Objectness with Convolutional Networks

DeepBox: Learning Objectness with Convolutional Networks DeepBox: Learning Objectness with Convolutional Networks Weicheng Kuo Bharath Hariharan Jitendra Malik University of California, Berkeley {wckuo, bharath2, malik}@eecs.berkeley.edu Abstract Existing object

More information

Text Extraction from Natural Scene Images and Conversion to Audio in Smart Phone Applications

Text Extraction from Natural Scene Images and Conversion to Audio in Smart Phone Applications Text Extraction from Natural Scene Images and Conversion to Audio in Smart Phone Applications M. Prabaharan 1, K. Radha 2 M.E Student, Department of Computer Science and Engineering, Muthayammal Engineering

More information

LETTER Learning Co-occurrence of Local Spatial Strokes for Robust Character Recognition

LETTER Learning Co-occurrence of Local Spatial Strokes for Robust Character Recognition IEICE TRANS. INF. & SYST., VOL.E97 D, NO.7 JULY 2014 1937 LETTER Learning Co-occurrence of Local Spatial Strokes for Robust Character Recognition Song GAO, Student Member, Chunheng WANG a), Member, Baihua

More information

Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping

Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping Chuhui Xue [0000 0002 3562 3094], Shijian Lu [0000 0002 6766 2506], and Fangneng Zhan [0000 0003 1502 6847] School of

More information

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network Anurag Arnab and Philip H.S. Torr University of Oxford {anurag.arnab, philip.torr}@eng.ox.ac.uk 1. Introduction

More information

TEXTS in scenes contain high level semantic information

TEXTS in scenes contain high level semantic information 1 ESIR: End-to-end Scene Text Recognition via Iterative Rectification Fangneng Zhan and Shijian Lu arxiv:1812.05824v1 [cs.cv] 14 Dec 2018 Abstract Automated recognition of various texts in scenes has been

More information

Object Proposal by Multi-branch Hierarchical Segmentation

Object Proposal by Multi-branch Hierarchical Segmentation Object Proposal by Multi-branch Hierarchical Segmentation Chaoyang Wang Shanghai Jiao Tong University wangchaoyang@sjtu.edu.cn Long Zhao Tongji University 92garyzhao@tongji.edu.cn Shuang Liang Tongji University

More information

Object Proposal by Multi-branch Hierarchical Segmentation

Object Proposal by Multi-branch Hierarchical Segmentation Object Proposal by Multi-branch Hierarchical Segmentation Chaoyang Wang Shanghai Jiao Tong University wangchaoyang@sjtu.edu.cn Long Zhao Tongji University 92garyzhao@tongji.edu.cn Shuang Liang Tongji University

More information

Combining Selective Search Segmentation and Random Forest for Image Classification

Combining Selective Search Segmentation and Random Forest for Image Classification Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, 2013 1 Problem Statement Random Forest algorithm have been successfully used in many

More information

Gradient of the lower bound

Gradient of the lower bound Weakly Supervised with Latent PhD advisor: Dr. Ambedkar Dukkipati Department of Computer Science and Automation gaurav.pandey@csa.iisc.ernet.in Objective Given a training set that comprises image and image-level

More information

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization Supplementary Material: Unconstrained Salient Object via Proposal Subset Optimization 1. Proof of the Submodularity According to Eqns. 10-12 in our paper, the objective function of the proposed optimization

More information

Edge Boxes: Locating Object Proposals from Edges

Edge Boxes: Locating Object Proposals from Edges Edge Boxes: Locating Object Proposals from Edges C. Lawrence Zitnick and Piotr Dollár Microsoft Research Abstract. The use of object proposals is an effective recent approach for increasing the computational

More information

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION Chien-Yao Wang, Jyun-Hong Li, Seksan Mathulaprangsan, Chin-Chin Chiang, and Jia-Ching Wang Department of Computer Science and Information

More information

Aggregating Local Context for Accurate Scene Text Detection

Aggregating Local Context for Accurate Scene Text Detection Aggregating Local Context for Accurate Scene Text Detection Dafang He 1, Xiao Yang 2, Wenyi Huang, 1, Zihan Zhou 1, Daniel Kifer 2, and C.Lee Giles 1 1 Information Science and Technology, Penn State University

More information

An Efficient Method to Extract Digital Text From Scanned Image Text

An Efficient Method to Extract Digital Text From Scanned Image Text An Efficient Method to Extract Digital Text From Scanned Image Text Jenick Johnson ECE Dept., Christ the King Engineering College Coimbatore-641104, Tamil Nadu, India Suresh Babu. V ECE Dept., Christ the

More information

OTCYMIST: Otsu-Canny Minimal Spanning Tree for Born-Digital Images

OTCYMIST: Otsu-Canny Minimal Spanning Tree for Born-Digital Images OTCYMIST: Otsu-Canny Minimal Spanning Tree for Born-Digital Images Deepak Kumar and A G Ramakrishnan Medical Intelligence and Language Engineering Laboratory Department of Electrical Engineering, Indian

More information

arxiv: v1 [cs.cv] 15 Oct 2018

arxiv: v1 [cs.cv] 15 Oct 2018 Instance Segmentation and Object Detection with Bounding Shape Masks Ha Young Kim 1,2,*, Ba Rom Kang 2 1 Department of Financial Engineering, Ajou University Worldcupro 206, Yeongtong-gu, Suwon, 16499,

More information

Feature Fusion for Scene Text Detection

Feature Fusion for Scene Text Detection 2018 13th IAPR International Workshop on Document Analysis Systems Feature Fusion for Scene Text Detection Zhen Zhu, Minghui Liao, Baoguang Shi, Xiang Bai School of Electronic Information and Communications

More information

Final Report: Smart Trash Net: Waste Localization and Classification

Final Report: Smart Trash Net: Waste Localization and Classification Final Report: Smart Trash Net: Waste Localization and Classification Oluwasanya Awe oawe@stanford.edu Robel Mengistu robel@stanford.edu December 15, 2017 Vikram Sreedhar vsreed@stanford.edu Abstract Given

More information

LEVERAGING SURROUNDING CONTEXT FOR SCENE TEXT DETECTION

LEVERAGING SURROUNDING CONTEXT FOR SCENE TEXT DETECTION LEVERAGING SURROUNDING CONTEXT FOR SCENE TEXT DETECTION Yao Li 1, Chunhua Shen 1, Wenjing Jia 2, Anton van den Hengel 1 1 The University of Adelaide, Australia 2 University of Technology, Sydney, Australia

More information

Multi-script Text Extraction from Natural Scenes

Multi-script Text Extraction from Natural Scenes Multi-script Text Extraction from Natural Scenes Lluís Gómez and Dimosthenis Karatzas Computer Vision Center Universitat Autònoma de Barcelona Email: {lgomez,dimos}@cvc.uab.es Abstract Scene text extraction

More information

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

More information

Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition

Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition Chee Kheng Ch ng Chee Seng Chan Centre of Image & Signal Processing, Faculty of Computer Science & Info. Technology, University

More information

YOLO9000: Better, Faster, Stronger

YOLO9000: Better, Faster, Stronger YOLO9000: Better, Faster, Stronger Date: January 24, 2018 Prepared by Haris Khan (University of Toronto) Haris Khan CSC2548: Machine Learning in Computer Vision 1 Overview 1. Motivation for one-shot object

More information

Combining Top-down and Bottom-up Segmentation

Combining Top-down and Bottom-up Segmentation Combining Top-down and Bottom-up Segmentation Authors: Eran Borenstein, Eitan Sharon, Shimon Ullman Presenter: Collin McCarthy Introduction Goal Separate object from background Problems Inaccuracies Top-down

More information

arxiv: v1 [cs.cv] 4 Mar 2017

arxiv: v1 [cs.cv] 4 Mar 2017 Deep Matching Prior Network: Toward Tighter Multi-oriented Text Detection Yuliang Liu, Lianwen Jin+ College of Electronic Information Engineering South China University of Technology +lianwen.jin@gmail.com

More information

Specular 3D Object Tracking by View Generative Learning

Specular 3D Object Tracking by View Generative Learning Specular 3D Object Tracking by View Generative Learning Yukiko Shinozuka, Francois de Sorbier and Hideo Saito Keio University 3-14-1 Hiyoshi, Kohoku-ku 223-8522 Yokohama, Japan shinozuka@hvrl.ics.keio.ac.jp

More information

An ICA based Approach for Complex Color Scene Text Binarization

An ICA based Approach for Complex Color Scene Text Binarization An ICA based Approach for Complex Color Scene Text Binarization Siddharth Kherada IIIT-Hyderabad, India siddharth.kherada@research.iiit.ac.in Anoop M. Namboodiri IIIT-Hyderabad, India anoop@iiit.ac.in

More information

Complexity-Adaptive Distance Metric for Object Proposals Generation

Complexity-Adaptive Distance Metric for Object Proposals Generation Complexity-Adaptive Distance Metric for Object Proposals Generation Yao Xiao Cewu Lu Efstratios Tsougenis Yongyi Lu Chi-Keung Tang The Hong Kong University of Science and Technology {yxiaoab,lucewu,tsougenis,yluaw,cktang}@cse.ust.hk

More information

Oriented Object Proposals

Oriented Object Proposals Oriented Object Proposals Shengfeng He and Rynson W.H. Lau City University of Hong Kong shengfeng he@yahoo.com, rynson.lau@cityu.edu.hk Abstract In this paper, we propose a new approach to generate oriented

More information

arxiv: v2 [cs.cv] 27 Feb 2018

arxiv: v2 [cs.cv] 27 Feb 2018 Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation arxiv:1802.08948v2 [cs.cv] 27 Feb 2018 Pengyuan Lyu 1, Cong Yao 2, Wenhao Wu 2, Shuicheng Yan 3, Xiang Bai 1 1 Huazhong

More information

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK 1 Po-Jen Lai ( 賴柏任 ), 2 Chiou-Shann Fuh ( 傅楸善 ) 1 Dept. of Electrical Engineering, National Taiwan University, Taiwan 2 Dept.

More information

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection

More information

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Liwen Zheng, Canmiao Fu, Yong Zhao * School of Electronic and Computer Engineering, Shenzhen Graduate School of

More information

Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Arbitrary-Oriented Scene Text Detection via Rotation Proposals 1 Arbitrary-Oriented Scene Text Detection via Rotation Proposals Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, Xiangyang Xue Fig. 1 arxiv:1703.01086v3 [cs.cv] 15 Mar 018 Abstract

More information

Towards Visual Words to Words

Towards Visual Words to Words Towards Visual Words to Words Text Detection with a General Bag of Words Representation Rakesh Mehta Dept. of Signal Processing, Tampere Univ. of Technology in Tampere Ondřej Chum, Jiří Matas Centre for

More information

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection Zeming Li, 1 Yilun Chen, 2 Gang Yu, 2 Yangdong

More information

Deep Direct Regression for Multi-Oriented Scene Text Detection

Deep Direct Regression for Multi-Oriented Scene Text Detection Deep Direct Regression for Multi-Oriented Scene Text Detection Wenhao He 1,2 Xu-Yao Zhang 1 Fei Yin 1 Cheng-Lin Liu 1,2 1 National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese

More information

Scene Text Recognition using Co-occurrence of Histogram of Oriented Gradients

Scene Text Recognition using Co-occurrence of Histogram of Oriented Gradients 203 2th International Conference on Document Analysis and Recognition Scene Text Recognition using Co-occurrence of Histogram of Oriented Gradients Shangxuan Tian, Shijian Lu, Bolan Su and Chew Lim Tan

More information

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation Object detection using Region Proposals (RCNN) Ernest Cheung COMP790-125 Presentation 1 2 Problem to solve Object detection Input: Image Output: Bounding box of the object 3 Object detection using CNN

More information

Finding Tiny Faces Supplementary Materials

Finding Tiny Faces Supplementary Materials Finding Tiny Faces Supplementary Materials Peiyun Hu, Deva Ramanan Robotics Institute Carnegie Mellon University {peiyunh,deva}@cs.cmu.edu 1. Error analysis Quantitative analysis We plot the distribution

More information

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang SSD: Single Shot MultiBox Detector Author: Wei Liu et al. Presenter: Siyu Jiang Outline 1. Motivations 2. Contributions 3. Methodology 4. Experiments 5. Conclusions 6. Extensions Motivation Motivation

More information

Detecting and Recognizing Text in Natural Images using Convolutional Networks

Detecting and Recognizing Text in Natural Images using Convolutional Networks Detecting and Recognizing Text in Natural Images using Convolutional Networks Aditya Srinivas Timmaraju, Vikesh Khanna Stanford University Stanford, CA - 94305 adityast@stanford.edu, vikesh@stanford.edu

More information

Unified, real-time object detection

Unified, real-time object detection Unified, real-time object detection Final Project Report, Group 02, 8 Nov 2016 Akshat Agarwal (13068), Siddharth Tanwar (13699) CS698N: Recent Advances in Computer Vision, Jul Nov 2016 Instructor: Gaurav

More information

Scene text extraction based on edges and support vector regression

Scene text extraction based on edges and support vector regression IJDAR (2015) 18:125 135 DOI 10.1007/s10032-015-0237-z SPECIAL ISSUE PAPER Scene text extraction based on edges and support vector regression Shijian Lu Tao Chen Shangxuan Tian Joo-Hwee Lim Chew-Lim Tan

More information

arxiv: v2 [cs.cv] 10 Jul 2017

arxiv: v2 [cs.cv] 10 Jul 2017 EAST: An Efficient and Accurate Scene Text Detector Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang Megvii Technology Inc., Beijing, China {zxy, yaocong, wenhe, wangyuzhi,

More information

arxiv: v1 [cs.cv] 31 Mar 2016

arxiv: v1 [cs.cv] 31 Mar 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract.

More information

12/12 A Chinese Words Detection Method in Camera Based Images Qingmin Chen, Yi Zhou, Kai Chen, Li Song, Xiaokang Yang Institute of Image Communication

12/12 A Chinese Words Detection Method in Camera Based Images Qingmin Chen, Yi Zhou, Kai Chen, Li Song, Xiaokang Yang Institute of Image Communication A Chinese Words Detection Method in Camera Based Images Qingmin Chen, Yi Zhou, Kai Chen, Li Song, Xiaokang Yang Institute of Image Communication and Information Processing, Shanghai Key Laboratory Shanghai

More information

arxiv: v1 [cs.cv] 1 Sep 2017

arxiv: v1 [cs.cv] 1 Sep 2017 Single Shot Text Detector with Regional Attention Pan He1, Weilin Huang2, 3, Tong He3, Qile Zhu1, Yu Qiao3, and Xiaolin Li1 arxiv:1709.00138v1 [cs.cv] 1 Sep 2017 1 National Science Foundation Center for

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors [Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors Junhyug Noh Soochan Lee Beomsu Kim Gunhee Kim Department of Computer Science and Engineering

More information

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies

Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) e-isjn: A4372-3114 Impact Factor: 7.327 Volume 6, Issue 12, December 2018 International Journal of Advance Research in Computer Science and Management Studies Research Article

More information

Learning to Generate Object Segmentation Proposals with Multi-modal Cues

Learning to Generate Object Segmentation Proposals with Multi-modal Cues Learning to Generate Object Segmentation Proposals with Multi-modal Cues Haoyang Zhang 1,2, Xuming He 2,1, Fatih Porikli 1,2 1 The Australian National University, 2 Data61, CSIRO, Canberra, Australia {haoyang.zhang,xuming.he,

More information

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material Yi Li 1, Gu Wang 1, Xiangyang Ji 1, Yu Xiang 2, and Dieter Fox 2 1 Tsinghua University, BNRist 2 University of Washington

More information

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b 1

More information

Dot Text Detection Based on FAST Points

Dot Text Detection Based on FAST Points Dot Text Detection Based on FAST Points Yuning Du, Haizhou Ai Computer Science & Technology Department Tsinghua University Beijing, China dyn10@mails.tsinghua.edu.cn, ahz@mail.tsinghua.edu.cn Shihong Lao

More information

Scene text recognition: no country for old men?

Scene text recognition: no country for old men? Scene text recognition: no country for old men? Lluís Gómez and Dimosthenis Karatzas Computer Vision Center Universitat Autònoma de Barcelona Email: {lgomez,dimos}@cvc.uab.es Abstract. It is a generally

More information

Road Surface Traffic Sign Detection with Hybrid Region Proposal and Fast R-CNN

Road Surface Traffic Sign Detection with Hybrid Region Proposal and Fast R-CNN 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) Road Surface Traffic Sign Detection with Hybrid Region Proposal and Fast R-CNN Rongqiang Qian,

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

Segmenting Objects in Weakly Labeled Videos

Segmenting Objects in Weakly Labeled Videos Segmenting Objects in Weakly Labeled Videos Mrigank Rochan, Shafin Rahman, Neil D.B. Bruce, Yang Wang Department of Computer Science University of Manitoba Winnipeg, Canada {mrochan, shafin12, bruce, ywang}@cs.umanitoba.ca

More information

Visual features detection based on deep neural network in autonomous driving tasks

Visual features detection based on deep neural network in autonomous driving tasks 430 Fomin I., Gromoshinskii D., Stepanov D. Visual features detection based on deep neural network in autonomous driving tasks Ivan Fomin, Dmitrii Gromoshinskii, Dmitry Stepanov Computer vision lab Russian

More information

arxiv: v3 [cs.cv] 2 Jun 2017

arxiv: v3 [cs.cv] 2 Jun 2017 Incorporating the Knowledge of Dermatologists to Convolutional Neural Networks for the Diagnosis of Skin Lesions arxiv:1703.01976v3 [cs.cv] 2 Jun 2017 Iván González-Díaz Department of Signal Theory and

More information

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL Yingxin Lou 1, Guangtao Fu 2, Zhuqing Jiang 1, Aidong Men 1, and Yun Zhou 2 1 Beijing University of Posts and Telecommunications, Beijing,

More information

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science Scene Text Recognition for Augmented Reality Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science Outline Research area and motivation Finding text in natural scenes Prior art Improving

More information

arxiv: v1 [cs.cv] 4 Dec 2017

arxiv: v1 [cs.cv] 4 Dec 2017 Enhanced Characterness for Text Detection in the Wild Aarushi Agrawal 2, Prerana Mukherjee 1, Siddharth Srivastava 1, and Brejesh Lall 1 arxiv:1712.04927v1 [cs.cv] 4 Dec 2017 1 Department of Electrical

More information

arxiv: v1 [cs.cv] 2 Jan 2019

arxiv: v1 [cs.cv] 2 Jan 2019 Detecting Text in the Wild with Deep Character Embedding Network Jiaming Liu, Chengquan Zhang, Yipeng Sun, Junyu Han, and Errui Ding Baidu Inc, Beijing, China. {liujiaming03,zhangchengquan,yipengsun,hanjunyu,dingerrui}@baidu.com

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

arxiv: v1 [cs.cv] 4 Jan 2018

arxiv: v1 [cs.cv] 4 Jan 2018 PixelLink: Detecting Scene Text via Instance Segmentation Dan Deng 1,3, Haifeng Liu 1, Xuelong Li 4, Deng Cai 1,2 1 State Key Lab of CAD&CG, College of Computer Science, Zhejiang University 2 Alibaba-Zhejiang

More information

Classifying a specific image region using convolutional nets with an ROI mask as input

Classifying a specific image region using convolutional nets with an ROI mask as input Classifying a specific image region using convolutional nets with an ROI mask as input 1 Sagi Eppel Abstract Convolutional neural nets (CNN) are the leading computer vision method for classifying images.

More information

arxiv: v1 [cs.cv] 16 Nov 2015

arxiv: v1 [cs.cv] 16 Nov 2015 Coarse-to-fine Face Alignment with Multi-Scale Local Patch Regression Zhiao Huang hza@megvii.com Erjin Zhou zej@megvii.com Zhimin Cao czm@megvii.com arxiv:1511.04901v1 [cs.cv] 16 Nov 2015 Abstract Facial

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Cascaded Sparse Spatial Bins for Efficient and Effective Generic Object Detection

Cascaded Sparse Spatial Bins for Efficient and Effective Generic Object Detection Cascaded Sparse Spatial Bins for Efficient and Effective Generic Object Detection David Novotny,2 Visual Geometry Group University of Oxford david@robots.ox.ac.uk Jiri Matas 2 2 Center for Machine Perception

More information