A Bayesian Framework for Fusing Multiple Word Knowledge Models in Videotext Recognition

Size: px

Start display at page:

Download "A Bayesian Framework for Fusing Multiple Word Knowledge Models in Videotext Recognition"

Marcus Scott
5 years ago
Views:

1 A Bayesan Framework for Fusng Multple Word Knowledge Models n Vdeotext Recognton DongQng Zhang and Shh-Fu Chang Department of Electrcal Engneerng, Columba Unversty New York, NY 0027, USA. {dqzhang, sfchang}@ee.columba.edu Abstract Vdeotext recognton s challengng due to low resoluton, dverse fonts/styles, and cluttered background. Past methods enhanced recognton by usng multple frame averagng, mage nterpolaton and lexcon correcton, but recognton usng mult-modalty language models has not been explored. In ths paper, we present a formal Bayesan framework for vdeotext recognton by combnng multple knowledge usng mxture models, and descrbe a learnng approach based on Expectaton-Maxmzaton (EM). In order to handle unseen words, a back-off smoothng approach derved from the Bayesan model s also presented. We exploted a prototype that fuses the model from closed capton and that from the Brtsh Natonal Corpus. The model from closed capton s based on a unque tme dstance dstrbuton model of vdeotext words and closed capton words. Our method acheves a sgnfcant performance gan, wth word recognton rate of 76.8% and character recognton rate of 86.7%. The proposed methods also reduce false vdeotext detecton sgnfcantly, wth a false alarm rate of 8.2% wthout substantal loss of recall. Keywords: Vdeotext recognton, Vdeo OCR, Vdeo ndexng, Informaton Fusng. Multmodal Recognton.. Introducton Vdeotext recognton s dffcult due to low resoluton, dverse fonts, sze, colors, styles, and cluttered background. There are two categores of vdeotext n dgtal vdeos: overlay text, whch s added by vdeo edtors; scene text, whch s embedded n real-world obects. Although overlay text and scene text share some common propertes, overlay text s easer to detect than scene text n general and s the focus of ths paper. A complete vdeotext recognton system nvolves both ssues of detecton and recognton. Vdeotext detecton has been extensvely studed n recent years [,2,3,4], but vdeotext recognton s much less explored. Some relevant works n vdeotext recognton nclude template matchng [], SVM classfer [5], and those usng document OCR engnes [2] etc. Enhancement schemes have been studed by many researchers, for example, temporal averagng of multple frame [,4], spatal nterpolaton [4], font context [6] and word correcton by dctonary []. But the potental of usng language models, especally multmodal models, has not been explored. The most related dea s word correcton usng edt dstance by dctonary []. But such method works well only when the character recognton error rate s low. The language model has been wdely adopted n speech recognton [7] and handwrtten recognton [8]. To construct a language model, one needs text corpora contanng a large number of text documents. The problem encountered by vdeotext recognton s the dffculty n acqurng suffcent data from vdeos for language model constructon. Language models can be created from general lngustc corpora, but t may be naccurate. Recognton usng multmodalty s another way to enhance performance. Today s broadcast vdeos usually are assocated wth many text sources, such as closed captons, and onlne web documents. These documents can be used to enhance the vdeotext recognton, snce they contan words whch are often related to words n vdeotext. However, solely relyng on external document source s not suffcent. Take the example of closed capton, only about 40% to 50% of vdeotext words can be found n closed capton. Therefore, there s great promse n combnng language models from dfferent sources wth dfferent modaltes. Ths paper ams at ths problem by constructng a Bayesan framework to fuse the word knowledge models from multple sources. The framework s establshed usng mxture models and ts tranng approach s derved from the Expectaton-Maxmzaton (EM) algorthm. In order to ncrease the recognton performance of characters and unseen words, a smoothng scheme s derved to back-off the word recognton to the baselne character recognton approach. To valdate the framework n the practcal doman, we use the closed captons n vdeos and lngustc corpus to extract the multple word knowledge models. The knowledge model from closed capton s bult by learnng a unque dstrbuton model of the tme dstance between the vdeotext and ther matched counterpart n closed capton.

2 The general lngustc knowledge model s extracted from the Brtsh Natonal Corpus. We also developed a multple frame ntegraton technque as a post processng stage. Besdes usng multple frame averagng [], we explored a multple frame votng scheme, whch frst dentfy dentcal text blocks n dfferent frames, then use votng process to select the domnant word recognton output among the text blocks. Fgure shows our system dagram for vdeotext recognton fusng multple word knowledge models. We evaluate the system on sx news vdeos from three dfferent channels wth about 200 vdeotext words. The experments showed a 5% accuracy mprovement comparng the proposed method wth the baselne technque. The combned model also performs better than ndvdual models by 4.4%. When used as a postprocessng step, the word recognton technque plus temporal votng also help reduce vdeotext detecton false alarms sgnfcantly. The paper s organzed as follows: Secton 2 brefly descrbes the pre-processng approaches ncludng detecton, bnarzaton and segmentaton. Secton 3 presents the baselne character recognton system. Secton 4 descrbes the Bayesan framework for word recognton. Secton 5 presents a prototype model usng closed capton and the Brtsh Natonal Corpus. Secton 6 descrbes experments wth the results. Text Detect & Vdeo Segmentaton Input Refne Segmentaton Baysan Word Knowledge Models Character Lkelhood >>WB News.. > Today s the thrd Speech closed Transcrpt capton Bayesan Word Recognton Lngustc Corpus Vdeotext Words Output Fgure. Flowchart of the proposed vdeotext recognton system. It fuses word knowledge models from closed capton and lngustc corpus. 2. Pre-processng We frst brefly descrbe the preprocessng stage ncludng vdeotext detecton, bnarzaton and segmentaton. Careful desgn of these modules s mportant for later robust recognton processes. 2.. Vdeotext detecton We use the vdeotext detecton algorthm developed n our pror works [9,0] to extract the vdeotexts from the vdeos. The system frst computes texture and moton features by usng the DCT coeffcents and moton vectors drectly from the MPEG compressed streams wthout full decodng. These features are used to detect canddate vdeotext regons, wthn each of whch color layerng s used to generate several hypothetcal bnary mages. Afterwards a groupng process s performed n each bnary mage to fnd the character blocks. Fnally, a layout analyss process s appled to verfy the layout of these character blocks usng a rule-based approach Bnarzaton and character segmentaton Bnarzaton and character segmentaton s dffcult for vdeotext due to color varaton and small spacng between characters. We developed teratve and multhypothess methods to handle these problems. Fxed threshold value s not suted for vdeotext bnarzaton, because vdeotext ntensty may show great varatons. We developed an teratve threshold selecton method to dynamcally adust the ntensty threshold value untl the broken strokes of characters appear. Such dea s smlar to that proposed n []. The character segmentaton method s based on local mnma searchng n the vertcal proecton profle [0]. A segmentaton lne s dentfed by thresholdng the proecton profle. To reduce the recognton errors caused by character segmentaton, multple segmentaton hypotheses are used to produce canddate characters. Pror work n [] searched for the optmal hypothess usng dynamc programmng. In our case, snce the number of canddate segmentaton ponts s usually small (one to twenty, mostly less than four), an exhaustve search s performed. Word segmentaton s needed to fnd complete word segments for recognton. To realze ths, the medan value of the character spacng s frst calculated. If the spacng between two characters s larger than two tmes the medan value, the segmentaton lne s marked as a word boundary. 3. Character recognton The character recognton step nvolves the feature extracton from a sngle character and shapng of character condtonal densty functons. 3.. Character feature extracton The feature set for character recognton nclude {Zernke magntude; Drecton proporton; st -order perpheral features; second-order perpheral features; vertcal proecton profle; horzontal proecton profle.} These features are selected from a larger feature set manually. For Zernke moment features, readers are referred to the paper [2] for complete descrpton. And the descrpton of other features can be found n [3]. These features lead to an overall dmenson of Character condtonal densty functon The character condton densty functon s modeled usng Parzen wndow [4]. One can also use Gaussan Mxture Model (GMM). However, the GMM has the

3 overfttng problem when the dmenson of data space s hgh. Regularzaton can be ntroduced to handle ths problem, such as usng Bayesan penalty term [4]. However, n our experments, we found that the Parzen wndow approach outperforms the regualzed GMM. For Parzen Wndow, the sample ponts are generated usng a dstorton modelng procedure. We apply a varety of geometrcal dstortons to each standard font mage to obtan tranng samples. Dstortons nclude 4 fonts, 3 aspect ratos, 3 character weghts, and 5 szes. The sze varaton has lttle mpact on recognton, therefore we average the feature vectors correspondng to dfferent szes. Ths leads to 36 sample data for each character. A Gaussan kernel s used for the Parzen wndow method. The densty functon can be adusted by changng the varance of the Gaussan kernel. In order to maxmze the character recognton performance, the varance of the Gaussan kernel needs to be tuned usng tranng data. Gven a feature vector for a character mage, a baselne system for the character recognton s to compare the lkelhood values of dfferent characters and select the character correspondng to the hghest lkelhood. We wll refer to ths method as baselne character recognton method throughout the paper. 4. Bayes word recognton framework The vdeotext word recognton problem can be formulated usng Bayesan method or the maxmum a- posteror (MAP) recognton as: wˆ = arg max w x) w () = arg max log x w) + log w) w [ ] where x s the word feature vector, and w s a canddate word. x w) s called word observaton model constructed from the character condtonal densty functon. p (w) s called language model n the communty of speech recognton. It specfes the pror probably of each word. Here we not only use the lngustc corpus but also the models from other sources, such as closed capton, thus we call w) as Word Knowledge Model (WKM). 4.. Word condtonal densty functon Word observaton model s constructed from the sngle character condtonal densty functon. Suppose after segmentaton, N character mages are segmented from a word mage, and the feature vectors of these characters are x, x2,..., xn. Then the constructed word feature vector s x = { x, x2,..., x N }. The word observaton model denoted by word lkelhood functon therefore s: N x w) = x w) = x c ), = = 0, N = w = N w N where c Α, Α s the alphabet, whch currently nclude 62 characters (26 letters wth lower and upper case, plus 0 dgts) Fusng multple word knowledge models As dscussed earler, the language model w) could be obtaned by usng lngustc corpus; but t may be naccurate due to the lmt of tranng data. Combnaton of multple models could be a remedy to ths problem by addng other relevant knowledge nto the general model. These addtonal sources can be acqured easly n today s dstrbuted nformaton envronment, for example closed capton n the vdeo stream, or onlne documents on related web stes etc. Suppose that we have obtaned or learned the language models from dfferent sources. We denote such models as p ( w K ), where K denotes the nformaton source. Each word knowledge model (WKM) covers a subset of all possble words. Suppose the subset covered by each model s S. We use a lnear combnaton of these WKMs to form a mxture word knowledge model: subect to: = C = K (2) w) α w ) (3) p ( w K ) = and α = (4) w S C = where C s the number of sources. The combned model wll cover all the words that belong to S = U S. To obtan the weghng vector a = { α α,..., }, we, 2 use the vdeotexts n the tranng set. The data needed for tranng s much less than that requred by constructng a general language model due to the small sze of the parameter space. The optmal weghtng vector should maxmze the ont probablty of the tranng set based on the maxmum lkelhood tranng,.e.: = arg max p ( w α) a, subect to α = (5) α w T c = where T s the tranng set. Although ths s a standard constrant maxmzaton problem, t s actually dffcult to solve and get the closed form soluton. However, t can be solved teratvely by usng the Expectaton-Maxmzaton (EM) method [5]. The update equatons based on EM s: old w K ) α N w ) = C, α = w ) (6) old N = w K ) α = α c

4 where N s the number of the tranng vdeotext words Recognton of unseen word The combned model s usually unable to cover all vdeotext words n a vdeo. For nstance, n news vdeo, about 5% vdeotext words cannot be found n ether closed capton or lngustc corpus. Drectly applyng language models to those unseen words may change correctly recognzed words and thus ncrease recognton errors. To handle ths problem, we use a method to back-off the word recognton process to character recognton n certan condton. Such method has been used n speech recognton [6]. We derve the back-off method based on the Bayesan recognton framework, where the word knowledge model s decomposed as: w) = w w S) w S) + w w S) w S) = w w S)( c) + w w S) c where S denotes the word dctonary covered by the knowledge models, and c s the probablty that a vdeotext word falls out of S. w w S) s the pror specfed by the knowledge model. For an unseen word, ths term wll be zero. w w S) s a hypothetcal dstrbuton for all words that are not n S, whose value s zero for all seen words. Based on these, we wll get the followng back-off condton when consderng a canddate word w S from S: [ log w ) + log w w S) ] log x w ) d S S S < S (7) x (8) where w s the baselne character recognton output. d S s the back-off threshold, whch can be traned usng a straghtforward method. Note that back-off s only appled when the baselne recognton result cannot be found n S. The dervaton of ths equaton and estmaton of d usng the tranng set s dscussed n [7] Post-processng A unque nature of vdeotext comparng wth document mage text s the redundancy of text mages: capton text usually stays on the vdeo for a few seconds, resultng n duplcates of the same text blocks n consecutve frames. Pror systems employed temporal averagng to take advantage of such redundancy. However, we found that although temporal averagng s able to reduce the addtve nose, t cannot fully avod the false recognton caused by segmentaton error or character corrupton by background perturbaton. Thus we propose to use a multple frame votng method usng recognton results from each ndvdual frame. To realze ths, we group smlar text blocks wthn a local temporal wndow together. The smlarty s measured usng sum of pxel-to-pxel color dstances between each vdeotext blocks. The votng process s performed by selectng the most frequent output among all word recognton output n the same group. Such algorthm effectvely elmnates the false recognton due to erroneous character segmentaton. The above temporal votng process not only mproves the word recognton accuracy, but also mproves the text detecton accuracy. Detecton false alarm s fltered out f the posteror of a word s lower than certan threshold before votng, or the word count of the most frequent word n a group s one. 5. Prototype usng closed capton and Brtsh natonal corpus We realze a prototype of the proposed framework and algorthms by usng the closed capton (CC) stream assocated wth the vdeo and an external lngustc corpus, Brtsh Natonal Corpus (BNC). The multple knowledge models under these two sources can be wrtten as: w) = α w CC) + α w BNC) (9) cc 5.. Buldng the word knowledge model from closed capton For a word drawn from the CC model, ts pror s assumed to only depend on two factors: () the tme dstance between the CC word and the vdeotext (VT) word beng recognzed, t = t sw t vw, and (2) The part-ofspeech (POS) of the CC word, S. Words far from the vdeotext word have lower pror probabltes. Words of dfferent POS (e.g., verb vs. noun) have dfferent prors of appearng n the vdeotext. Usng CC we can construct a CC wordlst (CCW), whch ncludes all words that occur n CC at least once. If there are multple nstances of a word, CCW only keeps one nstance. Thus we model the followng word pror: p ( w = w' CC) = w = w t, S) (0) C w = w' where w s the word n CCW that may appear multple tmes, w s the -th nstance of the word w appearng n the CC stream, BNC t, S s the tme dstance and POS of w respectvely, and C s a normalzaton constant C = w' CCW w = w' w = w t, S ) () Because tranng such model s complcated due to the presence of POS, we use a smplfed model: when the POS of w s a stop word or preposton, the probablty s zero, otherwse t depends solely on the tme dstance. In other words, only non-stop word and non-preposton words are consdered n tranng and recognton. The lkelhood functon can be n varous functon forms, whch can be determned by comparng the emprcal dstrbuton and the estmated dstrbuton usng

5 Ch-square hypothess test. We used two hypotheses: Gaussan functon and exponental functon. The hypothess test shows that the exponental functon s closer to the emprcal dstrbuton. The non-causal exponental tme dstance densty functon we adopted s as follows: t λ w = w t, S SP) = e (2) 2λ SP denotes stop word or proposton word. Ths s a double exponental model (DPM). For a causal model, t s straghtforward to modfy the equaton (2) by removng the rght tal of the DPM. To tran the parameter of ths model, a standard maxmum lkelhood approach s used [5] usng the pool of all matched word pars n CC and vdeotext. Based on our experments, λ l = provdes satsfactory results. Accordng to CC model, a word that cannot be found n CC wll be assgned a zero probablty of p ( w CC) Knowledge model from BNC Word knowledge model s also extracted from the Brtsh Natonal Corpus (BNC) [8]. Brtsh Natonal Corpus ncludes a large number of text documents for tranng language models. BNC also provdes the word lsts wth the use frequency of each word. We use the wrtten Englsh verson lsts contanng about 200,000 words. The lst ncludes all nflected forms of each word stem as well as ther frequency. In vdeotext, stop words are rarely used, but they hold hghest frequency n the BNC word lst. In order to get a more accurate dstrbuton functon, the word frequences of these stop words are manually re-assgned to a small value. After these processes, the word knowledge model extracted from the BNC s the normalzed verson of the word frequency: P ( w BNC) = Freq( w) Freq( w ) (3) w BNC There s spellng dfference between Brtsh words and Englsh words [9]. However, n BNC word lst, both Brtsh spellng and Amercan spellng are ncluded [8] for most words. In experments, we confrmed spellng dfferences dd not result n performance degradaton. 6. Experments Our experment data nclude sx news vdeos from three channels broadcasted on dfferent days. The vdeos nclude dfferent stores, dfferent fonts and ntensty values of vdeotext. The format of the vdeos s MPEG- wth SIF resoluton (352x240 pxels). The overall duraton of the test set s about sx hours. A cross valdaton process s used to evaluate the algorthms. That s, the methods are traned usng vdeos from two channels and are tested usng the vdeos from remanng channels. Durng the tranng process, the estmated parameter set ncludes parameters of the tme dstance dstrbuton for the closed capton model, the weghtng vectors of the mxture model, and the back-off threshold. The varance of the Gaussan kernel for the Parzen wndow s also determned emprcally usng the tranng set. In the testng stage, the detecton program s frst carred out to detect the super-mposed text blocks. The overall detecton recall rate s 97% and the ntal precson rate of detecton s 70%. The detected text blocks are then passed to bnarzaton, segmentaton, recognton, and post-processng. After word recognton and post-processng, the false detectons are fltered, leadng to an mproved precson rate of 9.8% wth degraded recall rate of 95.6 %. The performance of recognton wthn the correctly detected set s shown n Table. Here one word recognton error s counted as long as there are one or more character recognton errors n the word. The mprovement n character recognton s large (+9.6%); the mprovement n word accuracy s even more sgnfcant (+5%). Table. Recognton Accuracy Vdeos Char Accuracy Word Accuracy B K B K w#: % 86.7% 25.8% 76.8% Legends: B: baselne character recognton, K: Knowledge based recognton, w#: total number of words. Fgure 2 shows some examples of the vdeotext recognton results, wth dfferent types of success and falure grouped together. Under each text mage, two recognton results are shown the left one shows the result usng the baselne method whle the rght one shows the result usng the knowledge-based recognton method combnng both BNC and CC models. The one n the bold face s the fnal result selected by our system usng the back-off procedure descrbed n Secton 4.3. Table 2. Contrbuton of CC and BNC Vdeos BNC CC CC+BNC CC Cont w#: % 48.6% 76.8% 4.4% Legends: BNC: use BNC only, CC: use CC only, CC+BNC: use both BNC and CC, CC Cont: CC We also conducted separate tests to study the ndvdual contrbutons from each knowledge model. In table 2, the BNC column shows the performance usng the BNC model only, the CC column shows the performance usng the closed capton model only, the CC+BNC column shows the results combnng both models. The results show that when used alone, the BNC model s

6 untop8bttabtb unforgettable celbforvb calforna ph34delph ahstorcal paqmbnt payment; benl8e qlovep doachlm dense glover oachm; (a) ttm6 tme; pr0tests Pl8nnbd protests planned (b) grape scandal grade scandal schuylerylle secretaryshp a3sst8nt pdq8ad Assstant adabas; Yell us at ten mlneola mdsole dclwntown brookt chnatown brooks tell us at ten (c) (d) (e) Fgure 2. (a) Some results of knowledge models (b) Recognton of vdeotexts wth varous styles (c) False recognton corrected by the surroundng CC words (d) Back-off trggered due to unseen words (e) False recognton due to poor segmentaton and thresholdng. more effectve than the CC model. When they are combned, the CC model adds 4.4% accuracy mprovement on top of the result usng the BNC model only. When we further analyzed the data, we found the gan most came from the refnement to the word pror probablty. Fgure 2(c) shows several examples of errors corrected by addng the CC model. 7. Concluson We have developed a Bayesan framework for vdeotext recognton, n whch the pror probabltes of words are estmated by combnng multple word knowledge models. Our current prototype ncludes synchronzed closed capton and lngustc corpus, Brtsh Natonal Corpus, as knowledge models. We used an EM based method for learnng the fusng model. We have also developed a back-off process to handle unseen words n the model. To estmate the prors for words n the closed captons, we used an effectve statstcal model takng nto account the tme dstances of the closed capton word to the vdeotext. The experments show that such multmodalty knowledge fusng method results n sgnfcant performance gan. When combnng the word recognton and temporal votng n a post-processng stage, the false detecton of text detecton s also sgnfcantly reduced. 8. References [] T. Sato, T. Kanade, E. Hughes, and M. Smth, "Vdeo OCR: Indexng Dgtal News Lbrares by Recognton of Supermposed Captons", Multmeda Systems, 7: , 999. [2] R. Lenhart, W. Effelsberg, "Automatc text segmentaton and text recognton for vdeo ndexng", Multmeda System, [3] J.C. Shm, C. Dora and R. Bolle, Automatc Text Extracton from Vdeo for Content-Based Annotaton and Retreval", Proc. 4th Internatonal Conference on Pattern Recognton, volume, pp , Brsbane, Australa, August 998. [4] H. L; D. Doermann, O. Ka, Automatc text detecton and trackng n dgtal vdeo, IEEE Trans. on Image Processng, Vol 9, No., January [5] C. Dora, H. Aradhye, J.C. Shm, End-to-End Vdeotext Recognton for Multmeda Content Analyss. IEEE Conference on Multmeda and Exhbton (ICME 200). [6] H. Aradhye, C. Dora, J.C. Shm, Study of Embedded Font Context and Kernel Space Methods for Improved Vdeotext Recognton, IEEE Internatonal Conference on Image Processng (ICIP 200). [7] B. Gold, N. Morgan, Speech and Audo processng, John Wley & Sons, Inc (999). [8] R. Plamondon, S.N. Srhar, On-Lne and Off-Lne Handwrtng Recognton: A comprehensve Survey, IEEE Trans. on PAMI, Vol. 22, No., Janury [9] D. Zhang, and S.F. Chang, "Accurate Overlay Text Extracton for Dgtal Vdeo Analyss", Columba Unversty Advent Group Techncal Report 2003 #005. [0] D. Zhang, R.K. Raendran and S.F. Chang, "General and Doman-specfc Technques for Detectng and Recognzng Supermposed Text n Vdeo", Proceedng of Internatonal Conference on Image Processng, Rochester, New York, USA. [] T. Rdler, S. Calvard, "Pcture Thresholdng Usng an Iteratve Selecton Method", IEEE transactons on Systems, Man and Cybernetcs, August, 978. [2] A. Khotanzad and Y.H. Hong, Invarant Image Recognton by Zernke Moments, IEEE Transactons on Pattern Analyss and Machne Intellgence, Vol 2, No 5, May 990. [3] R. Romero, D. Touretzkey, and R.H. Thbadeau, Optcal Chnse Character Recognton Usng Probablstc Neural Networks, CMU Techncal Report. [4] D. Ormonet and V. Tresp. Improved Gaussan Mxture Densty Estmates Usng Bayesan Penalty Terms and Network Averagng. In Advances n Neural Informaton Processng Systems, volume 8, The MIT Press, 996. [5] R.O. Duda, P.E. Hart, D.G. Stock, Pattern Classfcaton. Wley-Interscence, New York, NY, 2 ed., [6] S.M. Katz. Estmaton of Probabltes from Sparse Data for the Language Model Component of a Speech Recognzer. IEEE Trans. on Acoustcs, Speech and Sgnal Processng, 35(3): , 987. [7] D. Zhang, S.F. Chang, A Mult-Model Bayesan Framework for Vdeotext Recognton, ADVENT Techncal Report 2003, Columba Unversty. [8] Brtsh Natonal Corpus, Web homepage: [9] Dctonary of Amercan and Brtsh Us(e)age. URL:

Accurate Overlay Text Extraction for Digital Video Analysis

Accurate Overlay Text Extraction for Digital Video Analysis Accurate Overlay Text Extracton for Dgtal Vdeo Analyss Dongqng Zhang, and Shh-Fu Chang Electrcal Engneerng Department, Columba Unversty, New York, NY 10027. (Emal: dqzhang, sfchang@ee.columba.edu) Abstract