Accurate Overlay Text Extraction for Digital Video Analysis

Accurate Overlay Text Extracton for Dgtal Vdeo Analyss Dongqng Zhang, and Shh-Fu Chang Electrcal Engneerng Department, Columba Unversty, New York, NY 10027. (Emal: dqzhang, sfchang@ee.columba.edu) Abstract Ths report descrbes a system to detect and extract the overlay texts n dgtal vdeo. Dfferent from the prevous approaches, the system used a multple hypothess testng approach: The regon-of-nterests (ROI) probably contanng the overlay texts are decomposed nto several hypothetcal bnary mages usng color space parttonng; A groupng algorthm then s conducted to group the dentfed character blocks nto text lnes n each bnary mage; If the layout of the grouped text lnes conforms to the verfcaton rules, the boundng boxes of these grouped blocks are output as the detected text regons. Fnally, moton verfcaton s used to reduce false alarms. In order to acheve real tme speed, ROI localzaton s realzed usng compressed doman features ncludng DCT coeffcents and moton vectors n MPEG vdeos. The proposed method showed mpressve results wth average recall 96.9% and precson 71.6% n testng on dgtal News vdeos. 1. Introducton Vdeotext detecton and recognton has been dentfed as one of the key components for the vdeo retreval and analyss system. Vdeotext detecton and recognton can be used n many applcatons, such as semantc vdeo ndexng, summarzaton, vdeo survellance and securty, multlngual vdeo nformaton access, etc. Vdeotext can be classfed nto two broad categores: Graphc text and scene text. Graphc text or text overlay s the vdeotext added mechancally by vdeo edtors, examples nclude the news/sports vdeo capton, move credts etc. Scene texts, are the vdeotexts embedded n the real-world objects or scenes, examples nclude street name, car lcense number, the number/name on the back of a soccer player. Ths report s to address the problem of accurately detectng and extractng the graph vdeotexts for vdeotext recognton. Although the overlay text s manually added nto the vdeo, the experments showed they are even as hard to extract as many vdeo objects, such as face, people etc. Ths s due to the followng reasons: 1. Many overlay texts present n the cluttered scene background; 2. There s no consstent color dstrbuton for texts n dfferent vdeos. Consequently, the color-tone based approach wdely used n face or people detecton applcaton actually cannot be appled n text detecton.; 3. The sze of the text regons may be very small such that when the color segmentaton based approach s appled, the small text regon may merge nto the large non-text regons n ts vcnty. There has been much pror work for vdeotext detecton and extracton. T.Sato et al. [1] nvestgated supermposed capton recognton n News vdeo. They use spatal dfferental flter to localze the text regon, and sze/poston constrants to refne the

detected area. Ther algorthm s used n the specfc doman, such as CNN news. R.Lenhart et al. [2] gave an approach usng texture, color segmentaton, contrast segmentaton, and moton analyss. In ther system, color segmentaton by regon mergng s done n global vdeo frame wthout a localzaton process. Thus the accuracy of the extracton may heavly rely on the segmentaton accuracy. L et al. [3] uses Haar wavelet to decompose the vdeo frame nto subband mages. Neural network s used to classfy the mage blocks nto text or non-text based on the subband mages. The Haar wavelet flterng s essentally a process of texture energy extracton. Y. Zhong [4] et al, employs DCT coeffcents to localze text regons n MPEG vdeo I-frame. They dd not use color nformaton and layout verfcaton, therefore false alarms are nevtable n those texture-lke objects, such as buldng, crowd. M. Bertn [5] presented a text locaton method usng salent corner detecton. Salent corner detecton may produce false postves n the cluttered background wthout good verfcaton process. J. Shm and C. Dora et al. [6] uses a regon-based approach for text detecton. They use gray level mages generated from the color vdeo frames. And texture features are not used n detecton. L. Agnhotr and N. Dmtrova.[7] uses a texture-based approach to locate the text regon, wthout usng the color segmentaton or decomposton. A. Jan [8] presented a method for text locaton n mage and vdeo frames. They use color space decomposton and ntegraton, texture and moton model s not used for detecton. And a smple layout verfcaton method s used based on vertcal projecton profle. Some of these work dd not explctly address the problem of accurate text boundary extracton. But accurate boundary extracton of vdeotexts s mportant for recognton, because the recognton error due to ll extracton s often unrecoverable usng enhancement or language/knowledge model. Our method attempts to address the lmtatons of the prevous systems by avodng background dsturbance and reducng the false alarms. We use a multple hypothess testng approaches: The regon-of-nterests (ROI) probably contanng the overlay texts are decomposed nto several hypothetcal bnary mages usng color space parttonng; A groupng algorthm then s conducted to cluster the dentfed character blocks nto text lnes n each bnary mage; If the layout of the grouped text lnes conforms to the verfcaton rules, the boundng boxes of these grouped blocks are output as the detected text regons. Moton verfcaton s also used to reduce false alarms. In order to acheve real tme speed, ROI localzaton s realzed usng compressed doman features ncludng DCT coeffcents and moton vectors n MPEG vdeos. The overall system can be llustrated n the followng dagram: Vdeo Localzaton by Texture&Moton Color Space Parttonng Block Groupng &Layout Analyss Temporal Verfcaton Text Block Fgure 1. System Flowchart In the dagram, texture and moton analyss s used to localze the regon-of-nterests (ROIs) of the vdeotexts by extractng texture and moton energy from compressed doman features. The color space parttonng s to dvde the HSV color space nto a few of parttons, and the hypothetcal bnary mages are generated for each partton.

Character block groupng and layout analyss cluster the character-lke blocks nto text regons and layout analyss s performed to verfy f the text regons are able to form text lnes. The temporal consstency analyss then s conducted to elmnate the false alarms stayng n the vdeo frames wth too short duraton. The report s organzed as follows: secton 2 descrbes the localzaton algorthm to detect the Regon of Interests. Secton 3 descrbes the hypothetcal bnary mage generaton usng color space parttonng. Secton 4 descrbes a rule-based approach for character block groupng and usng layout analyss to do verfcaton. Secton 5 presents the temporal verfcaton procedure to elmnate false alarms. 2. Localzaton usng Compressed Doman Features A typcal sze of the vdeo frame s 320x240. Wthout the localzaton of the nterest regon, the detecton program s dffcult to realze realtme speed. Furthermore, the localzaton usng texture or moton features can flter out rrelevant regons that would result n false alarms. However, capablty of the moton texture based approach to extract the accurate boundary of the text regon s usually poor, therefore an algorthm relyng alone on these features may not be suted for accurate text detecton. The problem becomes more severe f the background s cluttered. 2.1 Texture Energy and Moton Energy Texture features may be the most wdely used features for vdeotext detecton. The ntuton behnd texture based approach s vdeotexts are often wth hgh contrast aganst ther background and wth sharp stroke edges. These features make the text lne hold hgh energy n the hgh frequency band of the Fourer spectrum. For many vdeos, such as news and sports vdeo, usng texture feature alone s able to detect most of the text regon from the vdeo frames, snce texts n these vdeos are delberately rendered wth hgh contrast for the audence. The method used here s smlar to that used by Y. Zhong and Jans [4]. The texture energy n the horzontal and vertcal drecton s extracted from the 8x8 DCT coeffcent block n a MPEG-1 vdeo: Eh ( x, y) = C 0 k ( x, y) E ( x, y) = v 1 k 6 C 1 k 6 k0 ( x, y) (1) Where,j s the coordnates of a DCT block. E h s called the horzontal Texture Energy Map and E v s called the vertcal Texture Energy Map. For some vdeo genres, lke news and sports, most of the vdeotexts are statc. Thus the moton features can also be used n text extracton for those statc text blocks, the method has been successfully used n [9] for extracton of sports vdeo score box. Here a measurement called Moton Energy (ME) s used to characterze the moton ntensty of the regons. ME s the length of the moton vector of each macro block n the B or P frame. All Moton Energes of macro blocks n B or P frame form Moton Energy Map (MEM), whch exhbts the moton ntensty at dfferent locatons n a vdeo frame.

2.2. Combnng Texture and Moton Energy Moton energy map s often unstable due to the naccuracy of the MPEG moton estmaton algorthm. Therefore one cannot rely alone on the features extracted from the MPEG moton vector. We use a jont measurement combnng both texture and moton energy. The combnaton occur n each I frame, where the DCT features come from the current I-frame, and the moton features come from the latest B or P frame (temporal morphologcal flter can be used to stablze the MEM usng multple B and P frames). Snce the moton energy map s extracted from the macro block. Ther resoluton s half of the texture map. In order to be combned wth texture energy map, they are upscaled to the dentcal sze wth the texture map. The combnaton makes the hgh jont measurement correspondng to hgh texture energy and low moton energy, thus we have the followng equaton: TM E > λ ) ( E > λ ) ( U ( E,2) < ) (2) = ( h 1 v 2 m λ3 Where E m the moton energy map, U ( E m,2) s the operator of upscalng by 2, whch s actually upsamplng followed by an nterpolaton operator. The constants λ 1, λ 2, λ 3 are the thresholds for the horzontal, vertcal texture energy and moton energy. The bnarzed map may be wth rregular boundary, but they can be further de-nosed usng morphologcal flters. Fgure 2 llustrate the texture energy map, moton energy map and combned measurements. (a) (b) Fg 2. Localzaton by Texture-Moton Analyss. (a) Frame wth Intensve moton (zoom n) (b) Frame wthout any moton. From left to rght: orgnal mage; texture energy map (after tresholdng); moton energy map (after negatve, upscalng, thresholdng); combned measurement (before morphologcal flter). 3. Color Space Parttonng for Hypothetcal Bnary Image Generaton

The dea of color space parttonng s based on the fact that most of the text overlays are of unform color or near unform color. Ths also means the color dstrbuton of a vdeotext lne s very localzed n the color space. On the hand, the background s often rendered wth hgh color contrast to the text overlay. Thus parttonng the color space nto a few of sub regons can separate the color layer of the text overlay from ts background. The color space based approach has been used by prevous algorthms, for example A.K.Jan [8] uses color space decomposton to separate the layer of the texts n mages. Also some approaches use color segmentaton method n color space or gray space, such as [2][6]. A problem of color segmentaton s that the number of the color clusters s hard to determne a pror. Another problem s the texts wth too small szes may be merged to the background object f the number of the color cluster s not selected properly. Here we use a straghtforward approach: we frst convert the RGB color space nto HSV color space. Afterwards we dvde the whole HSV space nto n m l cubes. Whch means the H drecton s dvded nto n segments, S drecton m segments, V drecton l segments (segments can be slghtly overlapped). Suppose a color vector n color space s denoted as C = ( H C, SC, VC ). Then the partton can be represented as color range from C to C. A bnary mage s generated for each color space partton as: B = I C ) ( I < C ) (3) ( Where I s the gven color mage. These bnary mages wll be verfed usng the character block groupng and layout analyss. 4. Character block groupng and Layout Analyss The dea of usng layout analyss s to verfy the bnary text regons n each generated bnary mage, and a text regon s dentfed f one of these bnary mages contans texts. The layout analyss procedure ncludes three phases: connected component analyss, groupng of the character blocks, layout verfcaton. In each bnary mage, the connected component analyss (CCA) algorthm wll be frst performed to generate the character blocks. The character block s the outmost boundng box of a gven bnary component after usng CCA. Sze and shape flters wll be performed to elmnate the blocks wth too small or too large szes, and regons wth abnormal aspect rato. The groupng algorthm uses the followng rules to cluster two blocks: 1. The blocks should be algned such that they are wth the same bottom lne. 2. The heghts of the blocks should be close to each other. 3. The blocks should be close enough, whch means the nearest dstance of two blocks should not exceed certan rato of the average heght. The groupng procedure would group the ndvdual blocks nto hypothetcal text lnes. Then the verfcaton procedure s performed n these text lnes. We use a smple verfcaton process: The character number n a text lne should exceed certan constant.

The typcal value of the constant s 3, because the lengths of most words n Englsh are larger than 3. One example of groupng and layout analyss s shown n Fgure 3. (1) Orgnal Vdeo Frame (2) ROIs after Localzaton (3) Text Detecton Results usng groupng and layout analyss (4) Character Groupng and layout analyss n ROIs Fgure 3. Character Block Groupng and Layout Analyss. 5. Temporal Consstency Verfcaton Multple frame verfcaton s to flter out the detecton false alarms usng temporal consstency verfcaton of the boundng boxes correspondng to the text lne regons. A text lne on vdeo usually stays on the vdeo wth a sgnfcantly long tme. But the false alarm regons are usually transent n one or two I-frames. Temporal consstency verfcaton s performed for the detected boundng boxes. It calculates the overlap rato of each boundng box n the current I-frame wth all boundng boxes n the prevous I-frame. And take the boundng box wth the largest overlap rato as the most lkely match (MLM). The overlap rato s calculated as followng: O( r, r ) 1 2 2a( r1 r2 ) a( r ) + a( r ) = (4) Where a(r) s the area of regon r. If the overlap rato wth the MLM s sgnfcant (larger than certan constant), then the two boundng boxes are regarded as temporarly consstent. N Boundng boxes are sad to be n-consstent f each of them are temporarly consstent wth the boundng box n the nearest I-frames n n consecutve I-frames. A vdeotext lne s detected f the text lne s n-consstent, where n s larger than a constant, otherwse t s classfed as a false alarm. 6. Experments The algorthm s tested usng NIST TREC-2002 benchmark, and capton detecton n News vdeo. TREC Vdeo Track s an open metrc-based evaluaton system amng at promotng communcaton and progress n dgtal vdeo retreval feld. The task of TREC-2002 s to 1 2

ndex and retreve TREC vdeo based on concept detecton (ncludng face, ndoor, outdoor, text overlay etc.) and query-retreval framework. The benchmark used 23.26 hours vdeos for the development of concept detectors and 5.02 hours vdeos as the Feature Test Set (FTS) (.e. for feature extracton evaluaton). The testng results usng our developed method on 5 hours Feature Valdate (FV) set n TREC-2002 vdeo data showed 0.4164 average precson, whch compares wth the average precson 0.3271 of Dora, Shm and Bolle s system (DSB system) [6][11], and average precson 0.5018 of B.Tseng, C.Ln and J.Smth s fuson system results [10], whch combnes the descrbed system and the DSB system. The average precson of our system on the TREC Test Set s 0.2941, whle the DSB system acheves precson of 0.3324 and the fused system [10] acheves 0.4181 precson. 1 Fgure 5 shows some text detecton results on TREC-2002 vdeo data. Fgure 5. Overlay Text Detecton Example of TREC 2002 Vdeos Yet another testng excrement s conducted on News vdeos, News vdeos nclude a large amount of overlay texts. The detecton and recognton of the text overlay s very useful for News vdeo retreval. The text overlay n News vdeos may present n varous postons, thus the fxed poston assumpton used n our sports vdeo applcaton [] does not work very well for general overlay text detecton. The expermental data nclude four News vdeos, three of whch are US news vdeos from three dfferent channels, one of whch s Tawan News vdeo. The overall length of the vdeos s 2.13 hours. The capton styles on these vdeos are dfferent from each other as well as fonts. These vdeos are of MPEG-1 format wth 320x240 resolutons. The overall length of vdeo s about 2 hours. The followng table shows the detecton results. Some of the detecton results are shown n Fgure 6. Table 1. Text Overlay Detecton n News Vdeo US 1 (Ch 7) US 2 (Ch 11) US 3 (Ch 2) TW 1 (TTV) Average Recall 97.2% (105/108) 95.7% (90/94) 95.7%(90/94) 98.1%(154/157) 96.9% Precson 77.2% (105/136) 58.4% (90/154) 62.5%(90/144) 86.3%(154/179) 71.6% Legends: Ch: Channel, US: Unted States, TW: Tawan The experments showed that the performance on News vdeo s better than the performance n TREC vdeo. Ths s because most of overlay texts n News vdeos are 1 Acknowledgement to Dr. Belle T. Tseng and Dr. Chng-Yung Ln to provde TREC-2002 text overlay testng results

wth clean background and hgh contrast. Fgure 6.shows some examples of text detecton results n News vdeo. Fgure 6. Examples of Overlay Text Detecton Results on Dgtal News Vdeo 7. Conclusons and Future Work The report gves a rule-based approach to detect and extract the text lnes on the vdeo frames. The system ncludes the compressed doman feature processng to extract Regon of Interests, color space parttonng, regon groupng and layout analyss, At last temporal verfcaton s used to enhance the stablty and reduce the false alarms. Future work s to extend the rule-based approach to the probablstc framework to general the accuracy of the whole system. 8. Acknowledgement Acknowledgement to Dr. Belle L. Tseng and Dr. Chng-Yung Ln, IBM Watson Research Center for helpful dscusson and testng results on TREC-2002 vdeo data. 9. Reference [1] T. Sato, T. Kanade, E. Hughes, and M. Smth, "Vdeo OCR: Indexng Dgtal News Lbrares by Recognton of Supermposed Captons", Multmeda Systems, 7:385-394, 1999. [2] R.Lenhart, W. Effelsberg, "Automatc text segmentaton and text recognton for vdeo ndexng", Multmeda System, 2000. [3] H. L, D. Doermann, O. Ka, Automatc text detecton and trackng n dgtal vdeo, IEEE Transacton. on Image processng, Vol 9, No. 1, January 2000. [4] Y. Zhong, H, Zhang, and A. K.Jan, "Automatc Capton Localzaton n Compressed Vdeo", IEEE Trans on PAMI, Vol 22. No.4 Aprl 2000. [5] M. Bertn, C. Colombo and A. D. Bmbo, "Automatc capton localzaton n vdeos usng salent ponts". In Proceedngs IEEE Internatonal Conference on Multmeda and Expo ICME 2001, Toko, Japan, August 2001.

[6] J.C. Shm, C. Dora and R. Bolle, Automatc Text Extracton from Vdeo for Content-Based Annotaton and Retreval," n Proc. 14th Internatonal Conference on Pattern Recognton, vol 1, pp. 618-620, Brsbane, Australa, August 1998. [7] L. Agnhotr and N. Dmtrova. Text Detecton n Vdeo Segments. Proc. Of workshop on Content Based access to Image and Vdeo Lbrares, pp 109-113, June 1999. [8] A.K. Jan and B.Yu, automatc Text Locaton n Images and Vdeo Frames, Pattern Recognton, vol.31, no.12, pp. 2055-2076, 1998. [9] D. Zhang, and S.F. Chang, General and Doman-specfc Technques for Detectng and Recognzng Supermposed Text n Vdeo, Proceedng of Internatonal Conference on Image Processng, Rochester, New York, USA. [10] B.L.Tseng, C.Y. Ln, D. Zhang and J.R. Smth, Improved Text Overlay Detecton n Vdeos Usng a Fuson-Based Classfer, IBM T.J. Watson Research Center, Yorktown Heghts, New York, 2002. [11] C. Dora, Enhancements to Vdeotext Detecton Algorthms, Confdence Measures, and TREC 2002 performance, IBM T.J. Watson Research Center, Yorktown Heghts, New York, September 2002.