Text Particles Multi-band Fusion for Robust Text Detection

Text Particles Multi-ban Fusion for Robust Text Detection Pengfei Xu, Rongrong Ji, Hongxun Yao, Xiaoshuai Sun, Tianqiang Liu, an Xianming Liu School of Computer Science an Engineering Harbin Institute of Technology P.O. BOX 321, West Dazhi Street, Harbin, 150001, China {pfxu,rrji,yhx,xssun,tqliu,liuxianming}@vilab.hit.eu.cn Abstract. Texts in images an vieos usually carry important information for visual content unerstaning an retrieval. Two main restrictions exist in the state-of-the-art text etection algorithms: weak contrast an text-backgroun variance. This paper presents a robust text etection metho base on text particles (TP) multi-ban fusion to solve there problems. Firstly, text particles are generate by their local binary pattern of pyrami Haar wavelet coefficients in YUV color space. It preserves an uniforms text-backgroun contrasts while extracting multi-ban information. Seconly, the caniate text regions are generate via ensity-base text particle multi-ban fusion, an the LHBP histogram analysis is utilize to remove non-text regions. Our TP-base etection framework can robustly locate text regions regarless of iversity sizes, colors, rotations, illuminations an text-backgroun contrasts. Experiment results on ICDAR 03 over the existing methos emonstrate the robustness an effectiveness of the propose metho. Keywors: text etection, text particle, multi-ban fusion, local binary pattern, LHBP. 1 Introuction Recent years, there is a hot topic about the multimeia content analysis, retrieval an annotation [1-9]. Comparing with other visual contents, text information extracte from images/vieos is near to its high-level semantic cues. Text etection aims at localizing text regions within images/vieos via visual content analysis. Generally speaking, there are three approaches: 1). ege base metho [2, 3], in which ege etection is conucte an followe by text/non-text classifier such as SVM [2] or neural network [3]. 2). connecte component (CC) base metho [4-5], in which connecte component of text regions are etecte an extracte as escriptors, which are simple to implement for text localization. 3). texture base metho [6-8], which usually involves wavelet ecomposition an learning-base post classification. Texture base metho has been emonstrate to be robust an effective whether in literature [6-8] or in ICDAR 03/05 text etection competition. In our former works [8], we exten Local Haar Binary Patterns (LHBP) base on the wavelet energy feature. The metho get well performance then wavelet metho [7], but the threshol A. Campilho an M. Kamel (Es.): ICIAR 2008, LNCS 5112, pp. 587 596, 2008. Springer-Verlag Berlin Heielberg 2008

588 P. Xu et al. strategy cannot get well always. It oesn t consier the color information while only extracting the texture feature from the gray-ban. When the luminance of foregroun is similar to backgroun but ifferent colors, its performance is very poor. In the state-of-the-art methos [2-8], two problems are not well solve, which strictly restrict text etection performance of the text etection algorithms in realworl applications: 1. Problem of the weak contrast: Although localize threshols can uniform ifferent texture changes [7], its performance is poor when the text-backgroun contrast is low. When the text region is similar to backgroun (Fig. 1 (a)), it is ifficult to get high performance using either color or texture threshol base methos [4-8] ue to the low contrast. 2. Problem of the text-backgroun variance: In text regions, the backgroun variation strongly affects the feature extraction an text region etection. Former methos [2-3, 6-8] extract features accoring to the gray-level information of each image, which are strongly affecte by the backgroun variation, especially when the scene image is over expose (Fig. 1 (b)). (a) (b) Fig. 1. (a) Weak contrast between text an backgroun. (b) Text-backgroun variance. This paper aresses the above-mentione problems by a unifie solution. We propose a Text Particle (TP) escriptor to represent local texture features, which are extracte from the local binary pattern in Haar coefficients. The escriptor can etect text regions while ignoring their variations in scale, illumination uniformity, rotation an the contrast egree between the foregroun an the backgroun. Then, multi-ban Fig. 2. Text particles multi-ban fusion framework

Text Particles Multi-ban Fusion for Robust Text Detection 589 fusion is use to enhance the performance in etection, an post-processing, with LHBP histogram analysis, removes some non-text regions which are similar to textregion. Fig. 2 presents the propose text etection framework. The rest of this paper is organize as follows: In section 2, we give a escription about the TP text escriptor. Section 3 presents our multi-ban fusion strategy base on TP ensity evaluation an LHBP histogram analysis. Section 4 shows the experiment comparisons between the propose metho an some state-of-the-art text etection methos. Finally, this paper conclues an iscusses our future research irection. 2 Text Particles Base on Local Haar Binary Patterns We first escribe the two key elements in propose metho, the local Haar binary patterns (LHBP) (subsection 2.1) an irection analyze of text region (subsection 2.2). Then, we explain how to utilize there elements to obtain the Text Particle etector in subsection 2.3. 2.1 Local Haar Binary Patterns (LHBP) Propose by Ojala [10-11], local binary patterns (LBP) is a robust texture escriptor, which is use in vieo surveillance an face recognition [12-13]. LBP extracte the changes from local neighbors of each pixel an itself, thus, LBP hols not only translation an rotation invariant, but also illumination invariant. We utilize LBP to work out illumination variance in text regions. For each pixel (x c, y c ) in a given image, we conuct the binary conversion between (x c, y c ) an its 8- neighborhoo pixels as following: 1 f( xc, yc)- f( xp, yp) 0. Sx ( ) = 0 (1) f( xc, yc)- f( xp, yp) < 0 where f(x p, y p ) is the value of p th 8-neighborhoo pixel (p=0 7); an f(x c, y c ) is the value of the center pixel. Subsequently, a mask template (Fig.3 (b)) value 2 p is aopte to calculate the LBP value of this center pixel (x c, y c ) as below: where 1 x 0 Sx ( ) = 0 x < 0 p= 0 ( ) 7 LBP( f ( x, )) (, )- (, ) 2 p c yc = S f xp y p f xc y. c (2) (a) (b) (c) () Fig. 3. (a) the neighbor sequence; (b) a weighte mask; (c) an example; () the LBP pattern of (c), its value is 193

590 P. Xu et al. We evelope the local binary patterns (LBP) on the energy of high-frequency coefficients in pyrami Haar wavelet transformation omain to represent the multi-scale feature of images. The 8-neighborhoo LBP coe is employe at LH, HL an HH bans, name local Haar binary patterns (LHBP). Especially, a threshol criterion is aopte to filter graual illumination variance: p= 0 ( ) 7 LHBP( f ( x, )) (, )- (, ) 2 p c yc = S fhaar xp y p fhaar xc y. c where 1 x Threshol Sx ( ) = 0 x < Threshol Compare with traitional texture escriptor base on wavelet energy, LHBP is a threshol-restricte irectional coing of pyrami Haar in regarless of irection variation values, an it can normalize the illumination variance to text an backgroun in scene images. This is a noticeable avantage of LHBP. 2.2 Direction Analyze of Text Region Compare with non-text regions, text regions have more significant texture istribution at there irections: horizontal, vertical, iagonal an anti-iagonal. The strokes of letters usually have two or more above-mentione irections. The irectional istribution of common letters can be calculate, incluing capital letters, small letters an Arabic numbers as epicte in Table.1. The results in Table.1 emonstrate irectional relativity of two or more irection of letters strokes. Table 1. Directional relativity of letters strokes relativity of strokes irection Common letters B C D E F G H J L O P Q R S T U a b c horizontal, vertical e f g h j m n o p q r s t u 0 2 3 4 5 6 8 9 horizontal, iagonal A Z z 7 horizontal, anti-iagonal A vertical, iagonal K M N Y k 4 vertical, anti-iagonal K M N R Y iagonal, anti-iagonal A K M V W X Y k v w x y (3) The relationships between LHBP coing an the irections of texture are epicte as Directional Texture Coing Table (DTCT) in Table.2. For example, the LHBP coing in Fig.3 () is corresponing to the texture pattern in Fig.3(c). Table 2. Directional Texture Coing Table of LHBP (DTCT) Direction of texture LBP coe Horizontal 7, 112, 119 vertical 28, 193, 221 iagonal 4, 14, 64, 224 anti-iagonal 1, 16, 56, 131

Text Particles Multi-ban Fusion for Robust Text Detection 591 2.3 Text Particles Base on Directional LHBP As mentione in section 2.2, text regions have significant texture at some irections, an the value of LHBP can show their texture patterns. We will propose a novel text region escriptor in this section, which combines both the irectional texture istributing of text regions an the irectional character of LHBP coing. First, a winow-constraine etection template (size n n) is convolute over each ban an scale of LHBP image. The special texture value is calculate using Eq.4. n/2-1 n/2-1 DirThreshol (i, j) = DirTexture(i, j) ( k, l). (4) k=-n/2 l=-n/2 where 1, LHBP( i + k, j + l) DTCT DirTexture(, i j) = 0, LHBP( i+ k, j+ l) DTCT In Eq.4, is from 1 to 4 which map to the four irections: horizontal, vertical, iagonal an anti-iagonal, DirTexture (i,j) escribes the irection in DTCT of pixel (i, j) an its neighborhoos within the etection winow template (n n), an LHBP(x, y) is the value of LHBP image. We aopt the threshol criterions T at th irection in Eq.5 an Eq.6. 3 DirFlag = Flag( ). (5) = 0 where 1, Flag( ) = 0, DirThreshol T other True, DirFlag 2 DirFiler =. (6) Flase, other In Eq.6, if the DirFiler is True, the region is marke as caniate text region (size (n n), this processing is calle Text Particles (TP)).TP makes full use of LHBP an texture irection of text regions. We aopt the LHBP to present the texture in multiban of YUV color space, an it is effective when illumination, contrast an size are iverse (Fig.4). 3 Fusion Caniate Text Region 3.1 Fusion Strategy At this section, we escribe our TP multi-ban fusion strategy to refine caniate text regions base on TP ensity. With the TPs on every ban an scale, the fusion base on TP ensity aims at combining all of them to obtain more expressive features of text regions.

592 P. Xu et al. Firstly, we propose the TP ensity to evaluate the tightness istribution of TP in the area. We calculate the value by ensity estimation of each TP point as a iscrete approximation. The TP ensity of a etection area T in i th scale on j th ban is efine as: n 1 2 ( x xk ) Dij ( T) = e. (7) n k= 0 where n is the total number of TPs in the etection area T, x is the center of T, x k is the k th point in T, (x-x k ) is the original L2 istance between x k an x, an w k is the weight which is proportional to k th TP s area. In orer to full consieration about the results at ifferent scales on ifferent bans, we weighte merge ensities, in which the ensity value is calculate as Eq.8: M 3 FT ( ) = wij Dij ( T). (8) i= m j= 1 where the weight w ij represents the confience rate to i th scale of j th ban, m an M are the minimum an maximum scales of the wavelet transform on every ban. However, compare with all bans, it is more expressive to use only one ban when other ones have weak performance, such as the V ban performance is low when the text color is similar as backgroun color (Fig. 1(a)). We efine the TP ensity at j th ban as: M Fj( T) = wi Dij( T). (9) i= m M where 1 wj = wij M m i = m Then we employ the ensity criterion T s on the same region T at all scales of all bans, an ensity criterion T at the same scale of all bans. At last, we mark the area T as caniate text region, if it satisfies the T s at all scales of all bans or T at same scale of all bans. R(T ) Ture if F(T) T or F(T) T Flase else s i =. Generally speaking, the fusion strategy fuses the TPs from all bans an scales to achieve effective location on caniate text regions. As a result, it can perform more accurate on text region etection (Fig. 4). 3.2 Post-processing Base on LHBP Histogram We iscover that the accumulation histograms of non-text regions are much ifferent from text regions. We ivie the etecte regions (incluing text regions an non-text regions) into four blocks an calculate the accumulation histogram of LHBP which has 256 bins on every block for texture analysis. Then, we calculate the texture through their 4-blocks weighte histogram ifference to remove the non-text regions. (10)

Text Particles Multi-ban Fusion for Robust Text Detection 593 (a) (b) (c) () Fig. 4. The results of wavelet base metho [7] (a, c) an TP multi-ban fusion base metho (b, ) 4 Experiment 4.1 Datasets an Evaluation We evaluate our propose metho on the Location Detection Database of ICDAR 03 Robust Reaing Competition Set [14]. The ataset contains 258 training images an 249 valiation images which contain 1107 text regions. We use the valiation ataset for testing. Each test image contains one or more text lines. The etection task requires to automatically locating text lines in every test image. The results of the ifferent methos are evaluate with the recall, precision an f, which are the same to ICDAR2003 competition [14]. The recall(r) an the precision (p) are efine as follows: m ( r p e, T ) re E p =. (11) E m ( r p t, E ) rt E r =. (12) T where mrr (, ) = max mp(, rr') r' R, m p is the match between two rectangles as the area of intersection ivie by the area of the minimum bouning box containing both rectangles, E is the set of etection results an T is the set of correct text regions. The stanar measure f is a single measure of quality combining r an p. It efine as: 1 f =. (13) α / p+ (1 α) / r The parameter α gives the relative weights to p an r. It is set to 0.5 to give equal weight in our experiment. 4.2 Experiments In this section, three experiments are esigne to evaluate the performance of the propose metho.

594 P. Xu et al. Experiment 1 (TP base on LHBP): To evaluate the efficiency of LHBP escriptors in propose metho, we compare our etection metho with 1).the metho base on wavelet energy features [7] an 2).ege base metho [5]. The wavelet energy metho uses solely Haar wavelet texture feature without the color feature. We get the P-R curve through changing the threshols of the methos (Fig 5). An at the peak of f, comparing with wavelet energy metho [7], our metho s performance enhancements are over 12% in p, over 23% in r, an over 16% in f. Comparing with the experiment result of ege metho in [5] (Fig. 6.Ezaki-Ege), the precision of [5] is almost the same as our metho, but its recall an f are lower than our metho. Analysis: Take a eep insight into this result, the wavelet energy (or ege) base metho extract the feature only from the I-ban image of original image, an the feature extraction is effecte by text-backgroun contrast (illumination) variance. Using the TP base on LHBP, our metho can normalize the illumination variance of text-backgroun contrast to effective escribe the texture Fig. 5. P-R Curve of methos comparing Fig. 6. P&R comparison in ICDAR 03 trailtest set Experiment 2 (Multi-ban Fusion Strategy): To emonstrate the efficiency of the combination of TP escriptor with ifferent color bans, the propose metho is compare with the metho (solely the color features) in [5]. As presente in Fig.6, comparing with color base etection metho, performance enhancements are over 4% in p, over 38% in r, an over 20% in f. Analysis: The metho in [4] solely extracts the color features from source images an the features can t effective istinguish the text from backgroun when the text color is similar to backgroun (Fig.1 (a)). The propose metho extracts the TP from the YUV color space an fuses the results of every ban. It can get the intrinsic features of text to enhance the etection performance (Fig. 8). Experiment 3 (Results of ICDAR 03): Our metho is compare with 1. Representative text etection methos [4, 5] base on color an ege features an 2. The competition results of ICDAR 03 [14]. As presente in Fig.7, comparing to those methos, performance enhancements are roughly equivalent in p, but over 10% in r, an over 7% in f. It emonstrates the effectiveness of the propose Text Particles Multi-Ban Fusion metho.

Text Particles Multi-ban Fusion for Robust Text Detection 595 Fig. 7. P&R comparison in ICDAR 03 trail-test Set (a) (b) (c) () (e) (f) Fig. 8. Text etection Results Compare with Other Metho. (a, ) the propose metho s results. (b, e) Yi-Ege+Color results. (c, f ) Ye-Wavelet results. 5 Conclusion This paper proposes a text particle etection metho by LHBP-base multi-ban fusion. We not only aress the variances in both illumination an text-backgroun variance, but also fuse color & texture features to reinforce each other. Experiment results on ICDAR 03 over three state-of-the-art methos emonstrate the efficiency of the propose metho. In our future works, we woul further investigate the problem of rigi an nonrigi text region transformation, to implement our system to arbitrary viewpoints in the real-worl. Acknowlegement. This research is supporte by State 863 High Technology R&D Project of China (No. 2006AA01Z197), Program for China New Century Excellent Talents in University (NCET-05-03 34), Natural Science Founation of China (No. 60472043) an Natural Science Founation of Heilongjiang Province (No.E2005-29).

596 P. Xu et al. References 1. El Rube, I., Ahme, M., Kamel, M.: Wavelet approximation-base affine invariant shape representation functions. IEEE Transactions on Pattern Analysis an Machine Intelligence 28(2), 323 327 (2006) 2. Chen, D.T., Bourlan, H., Thiran, J.P.: Text ientification in complex backgroun using SVM. In: International Conference on Computer Vision an Pattern Recognition, pp. 621 626 (2001) 3. Lienhart, R., Wernicke, A.: Localizing an segmenting text in images an vieos. IEEE Transactions on Circuits an Systems for Vieo Technology 12, 256 268 (2002) 4. Ezaki, N., Bulacu, M., Schomaker, L.: Text Detection from Natural Scene Images: Towars a System for Visually Impaire Persons. In: International Conference on Pattern Recognition, vol. 2, pp. 683 686 (2004) 5. Yi, J., Peng, Y., Xiao, J.: Color-base Clustering for Text Detection an Extraction in Image. In: ACM Conference on Multimeia, pp. 847 850 (2007) 6. Gllavata, J., Ewerth, R., Freisleben, B.: Text etection in images base on unsupervise classification of high frequency wavelet coefficients. In: International Conference on Pattern Recognition, pp. 425 428 (2004) 7. Ye, Q.X., Huang, Q.M.: A New Text Detection Algorithm in Image/Vieo Frames. In: Avances in Multimeia Information Processing 5th Pacific Rim Conference on Multimeia, Tokyo, Japan, November 30-December 3, 2004, pp. 858 865 (2004) 8. Ji, R.R., Xu, P.F., Yao, H.X., Sun, X.S., Liu, T.Q.: Directional Correlation Analysis of Local Haar Binary Pattern for Text Detection. In: IEEE International Conference on Multimeia & Expo (accept, 2008) 9. Xi, D., Kamel, M.: Extraction of fille in strokes from cheque image using pseuo 2D wavelet with ajustable support. In: IEEE International Conference on Image Processing, vol. 2, pp. 11 14 (2005) 10. Ojala, T., Pietikäinen, M., Harwoo, D.: A Comparative Stuy of Texture Measures with Classification Base on Feature Distributions. Pattern Recognition 29(1), 51 59 (1996) 11. Ojala, T., Pietikäinen, M., Mäenpäa, T.: Multi-resolution Gray-Scale an Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis an Machine Intelligence 24(7), 971 987 (2002) 12. Li, S., Chu, R., Liao, S., Zhang, L.: Illumination Invariant Face Recognition Using Near- Infrare Images. IEEE Transactions on Pattern Analysis an Machine Intelligence 29(4), 627 639 (2007) 13. Zhao, G., Pietikäinen, M.: Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions. IEEE Transactions on Pattern Analysis an Machine Intelligence 29(6), 915 928 (2007) 14. Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: ICDAR 2003 robust reaing competitions. In: Proceeings of International Conference on Document Analysis an Recognition, pp. 682 687 (2003)