A Gradient Difference based Technique for Video Text Detection

A Gradent Dfference based Technque for Vdeo Text Detecton Palaahnakote Shvakumara, Trung Quy Phan and Chew Lm Tan School of Computng, Natonal Unversty of Sngapore {shva, phanquyt, tancl }@comp.nus.edu.sg Abstract Text detecton n vdeo mages has receved ncreasng attenton, partcularly n scene text detecton n vdeo mages, as t plays a vtal role n vdeo ndexng and nformaton retreval. Ths paper proposes a new and robust gradent dfference technque for detectng both graphcs and scene text n vdeo mages. The technque ntroduces the concept of zero crossng to determne the boundng boxes for the detected text lnes n vdeo mages, rather than usng the conventonal projecton profles based method whch fals to fx boundng boxes when there s no proper spacng between the detected text lnes. We demonstrate the capablty of the proposed technque by conductng experments on vdeo mages contanng both graphcs text and scene text wth dfferent font shapes and szes, languages, text drectons, background and contrasts. Our expermental results show that the proposed technque outperforms exstng methods n terms of detecton rate for large vdeo mage database. 1. Introducton Snce 1990s, wth rapd growth of avalable multmeda and ncreasng demand for nformaton ndexng and retreval, much effort has been done on text detecton n vdeo mages [1]. A large number of approaches have been proposed and already obtaned mpressve performance under some constrants [1]. But detectng texts n vdeo wthout any constrants remans challengng and nterestng due to many undesrable propertes of vdeo mages, such as low resoluton, low contrast, unknown text color, sze, poston, orentaton, color bleedng and unconstraned background [2,3]. Two types of text n vdeo are: (1) capton/graphcs/artfcal text whch s artfcally supermposed on the vdeo by human, and (2) scene text whch naturally occurs durng vdeo capture. Obvously, scene text detecton s a challengng task compared to graphcs text due to varyng lghtng, complex movement and transformaton [1]. From the lterature revew t s realzed that connected-component based methods are smple but not robust because they are based on geometrcal propertes of components [4]. On the other hand, texture based methods may be unsutable for small fonts and poor contrast text [5, 6]. In contrast to the precedng two approaches, edge and gradent based methods are fast and effcent but gve more false postves when the complex background present [7-9]. However the major problem of these methods s n choosng threshold values to classfy between text and non text pxels. A method based on unform colors n L* a* b* space s also proposed n [10] to locate unform colored text n vdeo frames. Obvously, ths method fals when text n vdeo contans multple colors n a text lne or n a word. The above observaton shows there s demand for developng a robust technque to gve a better detecton rate wth fewer false alarms wthout any constrants for text detecton n vdeo mages Hence n ths paper, we propose a new robust gradent dfference technque for detectng text n vdeo mages. We observe that the hgh postve and negatve gradent values exst nearer to text pxel or on text pxels compared wth gradent of non text pxel. Ths observaton motvated us to propose a gradent dfference technque for text detecton n vdeo mages. Further, nstead of the conventonal projecton profle based method, we ntroduce a zero crossng technque for fxng boundares of text lnes n vdeo mages. 2. Text Detecton Algorthm 2.1 Gradent Dfference for Text Detecton It s noted n [9] that gradent nformaton n text areas dffers from non text regons because of hgh contrast of text. Ths s the bass of our gradent dfference technque. For a gven gray color mage as shown n Fgure. 1, the technque computes gradent dx mage (G) by usng a horzontal mask [-1 1] whch gves rse to Fgure 1(b). Then Gradent Dfference (GD) s obtaned for each pxel n G as the dfference between the maxmum and mnmum gradent values wthn a local wndow of sze 1 n centered at the pxel where n s a value that depends on the character s stroke wdth. In ths study, we choose n = 11 by

keepng small fonts n mnd. Hgh postve and negatve gradent values n text regons result from hgh ntensty contrast between the text and background regons. Therefore, text regons wll have both large postve and negatve gradents n a local regon due to even dstrbuton of character strokes. Ths results n locally large GD values. To detect such large values the technque determnes Threshold (T) automatcally. It s shown n Fgure. 1(c) where we can see text clearly as whte patches and background as dark color. Small solated whte patches n Fgure 1(c) are removed and the output s shown n Fgure 1(d). Boundares for whte patches representng text lnes are computed usng a zero crossng technque whch wll be dscussed n secton 2.2, as shown n Fgure 1(e). Fgure 1(f) shows the text blocks detected and Fgure 1(g) shows the extracted text blocks. Input mage (b)gradent mage (c)text segments GD ( x, y ) = Max ( x, y ) Mn ( x, y ) (3) Then a pxel s classfed as follows Text pxel, f ( GD ( x, y ) > T ) ( x, y ) = Non Text pxel, Otherwse (4) A global threshold (T) s determned based on the average value of gradent dfference computed as follows. Frst we compute the average gradent values n m as: AVG = 1 G( x, y ) (5), n m x = 1 y = 1 where n, m are the dmenson of the gradent mage. Next we count the number of Hgh Gradent Values as NHG = count ( G ( x, y ) > AVG ) (6). The sum of GD s computed as n m (7), SGD = GD ( x, y ) x = 1 y = 1 Fnally the value of T s computed as T = SGD (( n m ) NHG ) (8). Graphcal representaton for GD obtaned by text detecton algorthm before and after thresholdng s gven n Fgure 2. It can be seen n Fgure 2(d) that non text areas are suppressed by the threshold T. (d)removed small whte patches (e) Text boundary GD before thresholdng (b) 3D graph for (f) Text blocks detected (g) Text blocks extracted Fgure 1. Text detecton More specfcally, the algorthm for detectng texts n vdeo mages s as follows. Let F(x, y) be the gven gray color mage, G(x, y) be the gradent mage obtaned by convolvng horzontal mask [-1 1] wth F(x,y), and W(x, y) be the local wndow centered at (x,y) of sze 1 11. Obtan the mnmum and the maxmum gradent values n W over G(x,y) as follows Mn ( x, y) = mn ( G ( x, y )) (1) x, y W ( x, y ) Max ( x, y ) = max ( G ( x, y )) (2) x, y W ( x, y ) Usng equaton (1) and (2), compute GD(x,y) as follows (c) GD after thresholdng (d) 3D graph for (c) Fgure 2. Text and background separaton 2.2. Zero Crossng for Fxng Boundng Boxes Conventonal projecton profle based method fals to fx boundng boxes for text lnes when there s no proper spacng between them. Fgure 3(b) shows one such example where the second and thrd lnes are connected to each other. These stuatons are common n vdeo text detecton. Therefore, we propose a zero crossng technque whch does not requre complete spacng between the text lnes, to fx the boundary for such text lnes. Zero crossng means transton from 0 to 1 and 1 to 0. The method counts the number of transtons from 0 to 1 and 1 to 0 n each column from

top to bottom of GD(x,y). As shown n Fgure 3(b) GD(x,y) s obtaned for Fgure 3 usng the text detecton algorthm gven n secton 2.1. Next t chooses the column whch gves the maxmum number of transtons to be the boundary for the text lnes. Here we gnore transton f the dstance between two transtons s too small. Wth the help of the number of transtons, the technque draws horzontal boundares for the text lnes as shown n Fgure 3(d). Further, the technque looks for spacng between the text components wthn two horzontal boundares to draw the vertcal boundary for the words and text lnes. Then the detected text blocks are extracted as shown n Fgure 3(e). Lastly, n order to elmnate false postves, we compute heght, wdth, aspect rato, the number of Canny edges, the number of Sobel edges and the number of transtons from 0 to 1 and 1 to 0 n the detected text blocks. We elmnate the text blocks as false postves f the number of Canny edges s too lttle, or the number of transtons s too small or the absolute dfference between the number of Canny edges and the number of Sobel edges s less than 2. of text blocks. The method mplemented usng MATLAB software s run on a PC wth Pentum 1V 2.33 GHz processor. The approxmate processng tme for each vdeo mage of sze 352x288 s about 4 seconds for text detecton. We have chosen three exstng methods [7, 9, 10] for comparson. Method [7] s based on Sobel edge nformaton for text detecton. Method [9] s based on gradent for text detecton. However, as explaned n secton 1, these methods suffer from the choce of several thresholds. Method [10] makes use of unform color for text locaton. 3.2 Sample Test Results Fgures 4-7 show the text detecton results of the proposed method n and the above three exstng methods n (b)-(d) for a varety of sample vdeo mages. In, we show the orgnal mage, the text detecton and the fnal text extracton results usng the proposed method. In (b)-(d), we show only the text detecton results of the three exstng methods. Fgure 4 shows that the three exstng methods fal for text detecton n low contrast mage whereas the proposed method detects most of the text n the mage correctly. Input mage (b) Text lne segments (d) Text lne boundary 3. Expermental Results (e) Text blocks extracted Fgure 3. Advantage of zero crossng technque 3.1 Dataset and Methods for Comparson Snce there s no benchmark database, we have created our own dataset for the purpose of expermentaton. In ths dataset, we have ncluded a varety of vdeo mages such as mages from moves, news clps (busness, sports), news contanng some scene texts, sports vdeos (golf, athletcs), musc vdeo and web mages. t also ncludes mages of multple languages such as Englsh, Korean and Chnese. In ths experment, we have selected 488 vdeo mages from the above sad sources whch gve 3231 actual number (b) Edge based (c) Gradent based (d) Unform text Fgure 4. Text detecton for low contrast mage Fgure 5 shows that the proposed method detects both graphcs and scene text n the athletc mage wth a false postve whle the three exstng methods fal to fx the text lne boundng boxes correctly. The gradent based method fals to detect scene text but the unform text color text method detects text successfully wth a false postve. Fgure 6 shows that the proposed method detects text n the news mage ncludng small font present at the bottom of the mage. Whle the edge and gradent based methods mss some text, the unform text color method tends to nclude addtonal non text nformaton n the boundng boxes. The gradent based

method appears to detect small font and low contrast text better than the other two exstng methods.. (b) Edge based (c) Gradent based (d) Unform text Fgure 5. Text detecton for athletc mage (b) Edge based (c) Gradent based (d) Unform text Fgure 6. Text detecton for news mage (b) Edge based (c) Gradent based (d) Unform text Fgure 7. Text detecton for complex background mage Fgure 7 shows that both the proposed method and the edge based method detect text n complex background correctly. On the other hand, the gradent based method and unform text color methods fal to detect text. 3.3 Comparson Metrcs We evaluate the performance of the proposed method by consderng detecton rate, false postve rate, msdetecton rate and average processng tme as decson parameters. The detected text blocks are represented by ther boundng boxes. The Average Processng Tme (APT) s measured for all mages under study. To judge the correctness of the text blocks detected, we manually count Actual Text Blocks (ATB) n the mages n the dataset. Also we manually label each of the detected blocks as one of the followng categores: Truly detected text block (TDB): a detected block that contans text fully or partally. Falsely detected text block (FDB): a detected block that does not contan text. Text block wth mssng data (MDB): a truly detected text block that msses some characters Based on the number of blocks n each of the categores mentoned above, the followng metrcs are calculated to evaluate the performance of the technques: Detecton rate (DR) = Number TDB / number of ATB. False postve rate (FPR) = Number of FDB / (number of TDB + number of FDB). Msdetecton rate (MDR) = Number of MDB/ Number of TDB The performance of the proposed technque n comparson wth the exstng methods s summarzed n Table 1 and Table 2. Table 2 shows that the detecton rate of the proposed method s hgher than the three exstng methods. Compared wth the exstng gradent method, the present method degrades somewhat n the false postve rate and msdetecton rate. Ths s nsgnfcant consderng the much hgher detecton rate of the present method. The average processng tme of the present method s also comparable to the exstng gradent method. Table 1: Results based on expermental study for the proposed and exstng methods Method ATB TDB FDB MDB Edge based [7] 3231 1288 112 217 Gradent based [9] 3231 1368 116 0 Unform text color [10] 3231 1996 379 1035 Proposed 3231 3085 212 63

Table 2: Performance (%) of the proposed and Exstng methods based on values reported n Table 1 Method DR FPR MDR APT(sec) Edge based [7] 39.8 8.0 16.8 25 Gradent based [9] 42.3 7.0 0 3 Unform text color [10] 61.7 15.9 51.8 42 Proposed 95.4 9.3 2.0 4 3.4 Experment on wndow sze We have conducted experments for the mage shown n Fgure 6 to choose proper n whch we used n secton 2.1 for detectng text canddates usng gradent dfference values as shown n Fgure 8. For our future work, we plan to use temporal nformaton to reduce the false postve rate and msdetecton rate because temporal nformaton wll help n locatng exact text poston n the vdeo mages. Furthermore, the method can be extended to fx the boundng boxes for text lnes wth arbtrary drecton by consderng the detected text block as seed pont to trace the drecton of the remanng text porton. Acknowledgment Ths research s supported n part by IDM R&D grant R252-000-325-279. 4. References. n = 4 (b) n = 9 (c) n = 11 Fgure 8. Choosng n values It s notced from Fgure 8 that for n = 4 we lost low contrast text bottom lne, for n = 9, t restore bottom lne but t msses rght sde low contrast text and for n = 11, the method detects all text lnes. Hence we choose n = 11 n ths work. Further, t s also notced n Fgure 8(b) that frst lne looks lke cropped whereas n (c) text lne restored completely. [1] J. Zang and R. Kastur. Extracton of Text Objects n Vdeo Documents: Recent Progress. The Eghth IAPR Workshop on Document Analyss Systems (DAS2008), Nara, Japan, September 2008, pp 5-17. [2] K. Jung, K.I. Km and A.K. Jan. Text nformaton extracton n mages and vdeo: a survey. Pattern Recognton, 37, 2004, pp. 977-997. [3] Q. Ye, Q. Huang, W. Gao and D. Zhao. Fast and robust text detecton n mages and vdeo frames. Image and Vson Computng 23, 2005, pp. 565-576. [4] A.K. Jan and B. Yu. Automatc Text Locaton n Images and Vdeo Frames. Pattern Recognton, Vol. 31(12), 1998, pp. 2055-2076. 3.5 Lmtaton of the proposed Method [5] Y. Zhong, H. Zhang and A.K. Jan. Automatc Capton Localzaton n Compressed Vdeo. IEEE Trans. Pattern Analyss and Machne Intellgence, Vol. 22, No. 4, 2000, pp. 385-392. Despte ts better performance than the exstg methods, the proposed method has a lmtaton n that t fals to fx boundng boxes for staggered text lnes or skewed scene text as shown n Fgure 9. Soluton to ths problem wll be handled n future. Staggered text lnes (b) Skewed scene text Fgure 9. Falure n fxng boundng boxes by the proposed method 4. Concluson and Future Work In ths paper, we propose a gradent dfference based text detecton technque for extractng both graphc text and scene text wth dfferent fonts, sze, scrpts, contrast, orentaton and backgrounds. A zero crossng technque for fxng boundng boxes for touchng text lnes s proposed rather than the projecton profle based method. Expermental results showed that the proposed method gves good detecton rate comparng wth the results of three exstng methods. [6] K. L Km, K. Jung and J. H. Km. Texture-Based Approach for Text Detecton n Images usng Support Vector Machnes and Contnuous Adaptve Mean Shft Algorthm. IEEE Transactons on Pattern Analyss and Machne Intellgence, Vol. 25, No. 12, December 2003, pp 1631-1639. [7] C. Lu, C. Wang and R. Da. Text Detecton n Images Based on Unsupervsed Classfcaton of Edge-based Features. ICDAR 2005, pp. 610-614. [8] P. Shvakumara, W. Huang and C. L. Tan. An Effcent Edge based Technque for Text Detecton n Vdeo Frames. The Eghth IAPR Workshop on Document Analyss Systems (DAS2008), Nara, Japan, September 2008, pp 307-314. [9] E. K. Wong and M. Chen. A new robust algorthm for vdeo text extracton. Pattern Recognton 36, 2003, pp. 1397-1406. [10] V. Y. Marnano and R. Kastur. Locatng Unform- Colored Text n Vdeo Frames. 15 th ICPR, Volume 4, 2000, pp 539-542.