Segmentation Based Online Word Recognition: A Conditional Random Field Driven Beam Search Strategy

2013 12th Internatonal Conference on Document Analyss and Recognton Segmentaton Based Onlne Word Recognton: A Condtonal Random Feld Drven Beam Search Strategy Art Shvram 1, Blan Zhu 2, Srrangaraj Setlur 1, Masak Nakagawa 2 and Venu Govndaraju 1 1 Center for Unfed Bometrcs and Sensors, Department of Computer Scence and Engneerng, Unversty at Buffalo, NY 2 Department of Computer and Informaton Scences, Tokyo Unversty Agrculture and Technology, Tokyo, Japan {ashvram, setlur, govnd}@buffalo.edu, {zhublan,nakagawa}@cc.tuat.ac.jp Abstract In ths paper we undertake recognton of onlne unconstraned cursve handwrtten Englsh words. In contrast to popular dynamc programmng or HMM-based approaches we propose a Condtonal Random Feld (CRF) drven beam search strategy appled n a combned segmentaton-andrecognton framework. Frst, a canddate segmentaton lattce s bult usng over-segmented prmtves of the word patterns. Recognton s accomplshed by synchronously matchng lexcon words wth nodes of the lattce. Probable search paths are evaluated by ntegratng character recognton scores wth geometrc and spatal characterstcs of the handwrtten segments n a CRF (condtonal random feld) model. To make computaton effcent, we use beam search to prune the set of lkely search paths. Ths overall system has been benchmarked on a new publcly avalable dataset - IBM_UB_1 as well as on the exstng UNIPEN dataset for comparson. Keywords recognton; tre-lexcon; beam search; Condtonal Random Feld; onlne; cursve; unconstraned handwrtng (key words). I. INTRODUCTION The pervasveness of moble touch screen computng devces lke tablets and smartphones has led to a push towards more flud nteracton wth these electroncs. A natural way of enterng text s by wrtng. Ths has ushered the development of applcatons that seek to recognze electronc handwrtten content. Google s recently launched Google-handwrte feature facltates handwrtten search queres [1]; also, the German automotve manufacturer Aud equpped a few of ther models wth handwrtng-supportve on-board computers to take n user nstructons as an alternatve to pressng buttons [1]. An mportant ssue wth such applcatons s that technology should be robust to recognzng handwrtng wth all the ntrnsc stylstc ssues and nose pecular to a person s wrtng rather that requrng the user to modfy hs/her wrtng style to accommodate the system. Nether should t requre person-specfc, extensve ntal tranng. Only then can t be looked at as a natural user experence. Hence, creatng a wrter-ndependent, realtme, onlne, cursve handwrtng recognton system s necessary. Ths forms the man goal of the current research. The system we descrbe n ths paper requres mnmal preprocessng of data for stylstc normalzatons and adopts a Condtonal Random Feld drven beam search algorthm that n turn, uses a tre-structure to effcently search through lexcon entres. We buld and test ths system on a new publcly avalable dataset, namely, IBM_UB_1 [2]. Ths data was collected on the CrossPad devce whch was modeled as a notepad where users wrote on actual paper usng a specal pen. Thus, samples n ths dataset reflect the varety and complexty nvolved n real lfe handwrtng. II. MOTIVATION When dealng wth automatc recognton of wrtng, a number of ssues arse that vary n complexty and nature dependng on the mode of data capture (onlne or offlne), pattern of wrtng (dscrete, cursve, mxed), representaton of data to the recognzer (characters, sub-characters, wordlevel features etc.) and the underlyng objectve of the recognton task [3]. In the onlne mode, the man objectve s to recognze the word as the user wrtes t [4].e., real-tme recognton. Ths requres fast processng algorthms. Wth ths objectve n mnd we have taken two desgn decsons (1) we buld our lexcon usng a tre-structure that speeds up processng consderably n comparson to a flat-structure as shown n [5], and, (2) we use a beam search algorthm that prunes out lexcon paths that are not lkely to yeld good results. Ths s n contrast to past research approaches whch rely on performng an exhaustve search of all words n the lexcon. Examples of such approaches nclude the HMMbased models n [6] and [7] as also n the work of Graves et al. [8] who use a recurrent neural network (RNN) to recognze unconstraned Englsh words n sentences. Wth regard to data representaton, there are two broad approaches to recognze wrtten words. One, whch s often called the analytcal approach, treats a word as a sequence of sub-unts (such as characters, strokes etc.). It frst breaks the word down to basc unts, analyzes them, and tes them together [9]. For example, n the HMM based systems of Hu et al. [7] and Lwck et al. [6] each character s modeled as a separate HMM and lexcon word HMMs are constructed by concatenatng character HMMs. In contrast, the other approach treats a word as a whole ndvsble unt and ams to recognze t usng characterstcs of the entre word. Ths s referred to as the holstc or word-based approach [9]. The man advantage of an analytcal approach s that a large 1520-5363/13 $26.00 2013 IEEE DOI 10.1109/ICDAR.2013.174 852

number of words can be modeled usng a fnte set of subunts such as characters, and therefore, we opt for an oversegmentaton based algorthm that dvdes each word nto prmtve segments (character or part of a character) that are later merged n an ntegrated segmentaton-and-recognton framework. However, breakng apart a word nto canddate character segments poses sgnfcant challenges. Ths s especally the case for words wrtten ether entrely n cursve or a mxed style [4]. Thus, n order to better dentfy characters n the word pattern, we use a model-drven (Condtonal Random Feld) approach that combnes shape and geometrc features of lkely prmtves to construct possble character segments and smultaneously recognze the word. To the best of our knowledge the only other CRF mplementaton for Englsh word recognton pertans to offlne handwrtng [10]. Shetty et al. s approach also sgnfcantly dffers from ours n that they employ dynamc programmng as opposed to our tre-lexcon based beam search. Moreover, our lexcon s over an order of magntude larger n comparson (5000 vs. 300). Followng our motvaton and desgn choces, we now explan the recognton system n detal below. III. PREPROCESSING AND OVER-SEGMENTATION Gven a word, we frst normalze t to a predetermned heght and extract feature ponts usng the method descrbed n [11]. The feature ponts are so extracted that the overall shape of the word pattern as well as the pen-down and penup ponts of each stroke are retaned whle smultaneously down-samplng the number of onlne coordnate ponts substantally. Subsequently, we remove delayed strokes and detect the baselne and corpus lne of the word usng regresson lnes that approxmate the local mnma and maxma of the stroke trajectory as descrbed n [5] (Fgure 1). We then set the pen-down, pen-up and the mnma and maxma ponts as canddate segmentaton ponts P={p 1, p 2, p 3,, p g } whch are used to over-segment the word nto prmtves. that each canddate character pattern that exsts between segmentaton ponts p k and p l (k, l 1~g) s followed by canddate character patterns that start from segmentaton pont p l+1 (nstead of segmentaton pont p l ). We also do not start a canddate character pattern from a pen-down pont and do not end a canddate character pattern at a pen-up pont. Fgure 2 shows a segmentaton canddate lattce (henceforth referred to as lattce ) for a sample word. Paths n the lattce are assumed to end at a termnal node. For each node n the lattce a vector of possble lengths to the termnal node s calculated. Ths s done by frst settng the length of all canddate nodes wth the fnal prmtve segment as one and then workng backwards one level at a tme. Thus, for each precedng node, the length vector s calculated by addng one to the length of ts succeedng nodes n the lattce sequence. Ths s shown n Fgure 2, where the numbers n each node box refer to the possble lengths. Ths length vector s used n conjuncton wth the lexcon to prune out unlkely search paths (.e., lexcon words wth lengths dfferent from that of the current lattce path) thereby mprovng both recognton accuracy and speed. B. Search and Recognton The second module nvolves matchng sequences n the lattce wth entres n the lexcon. For ths purpose, the lexcon s set up as a tre [5] (Fgure 3). In the lattce, the search space s expanded for each depth (or level) by synchronously matchng every node at a partcular depth to characters at a smlar depth n the tre. Ths s done by restrctng the search to only those word paths n the tre that have the same precedng path sequence as well as the same succeedng path lengths as that of the lattce node beng expanded. Each search path endng at a matched nodecharacter par (probable word strng) s evaluated accordng to a path evaluaton crteron set by the CRF model detaled n the next secton. Here, adoptng a beam search, a prespecfed beam wdth s used to prune the search paths at every depth level,.e., for a beam wdth of n, at every level the paths wth the top n scores are retaned. All other paths outsde of ths wdth are pruned away. (a) (b) Fgure 1: Preprocessng. (a) Orgnal word (b) Preprocessed word black dots denote feature ponts, red denote pen-down/up; blue and green denote maxma and mnma respectvely. IV. RECOGNITION SYSTEM A. Canddate lattce constructon The frst module of the recognton system nvolves creatng a canddate lattce for the word to be recognzed from the over-segmented prmtves. One or more consecutve prmtve segments are combned to form a canddate (probable) character pattern. All possble sequence combnatons of these canddate character pattern strngs are represented n a lattce where each node n the lattce denotes a canddate character pattern and each edge denotes a segmentaton pont leadng to the next pattern n the sequence. In order to skp lgatures, we defne an edge such Fgure 2: Segmentaton Canddate Lattce 853

We walk through the above descrbed charactersynchronous search procedure usng a synthetc example (Fgures 2, 3 and 4). In Fgure 4, d 1 - d 4 represent the depth levels n the search space. Startng wth the root node, we observe that lattce nodes (1), (2) and (3) (Fgure 2) are expanded synchronously at depth d 1 for possble character matches from the lexcon (Fgure 3). At N 1-1 lattce node (1) can be matched to three possble characters n, r and t. The case s smlar for N 1-2. At N 1-3 snce the length vector for node (3) accommodates a length-to-termnal of 5, 4 or 3 and the lexcon paths from t do not match these possble lengths (they possess lengths of 2 or 6), we drop these paths. Ths forms the expanson for depth d 1 where we have eght pattern-character matches n total. At ths stage, all eght paths are evaluated accordng to the path evaluaton crteron and a few top scorng paths are selected whle others are pruned out. The number of selected paths s called the beam wdth. In our synthetc example the beam wdth s two and for node (2), r s selected whereas for node (3), n s selected (hghlghted n red n Fgure 4). Smlarly, at depth d 2 ths process s repeated to expand the search. At depth d 2, followng the lattce, we note that for node (2) the possble successor nodes are (11), (12), (13), (14) and (15) (whch refer to N 2-1, N 2-2, N 2-3, N 2-4 and N 2-5 respectvely) whereas for node (3), the possble successor nodes are (16), (17), (18), (19) and (20) (whch refer to N 2-6, N 2-7, N 2-8, N 2-9 and N 2-10 respectvely). For N 2-1, the two possble character categores are a and o and both are retaned snce paths from both satsfy the length requrement of node (11). At N 2-2, the two possble lexcon characters are a and o. Here, only one satsfes the length requrement of node (12) ( o whch has the length 3). Hence, the path from a s dropped. Smlarly we process the cases for N 2-3 up to N 2-10 for depth d 2. Fnally, ths expanson s repeated for depth d 3 and d 4. At d 4, the path endng at node (35) w par s carred forward from d 3 because w s a termnal node n the lexcon (as long as ths path score falls wthn the beam wdth). Ths procedure yelds two canddates note and new for recognton. Fgure 3: Lexcon Tre It may be observed here that a partcular canddate character pattern (lattce node) may appear n several search paths (possble word sequences). Ths would requre matchng of the same node wth possble character matches from the tre for dfferent path evaluatons. Evaluatng such nodecharacter matches usng a character recognzer requres extracton of feature ponts from the pattern prmtves. In order to avod redundant processng, once a node s features are extracted, we store them. These are then used n subsequent calls. Addtonally, for a node and a character category, we store the recognton score at the frst recognton nstance and use ths stored node-character par score when called upon for subsequent evaluatons. In our experments, these steps greatly mproved recognton speed (by about eght tmes). Fgure 4: Character Synchronous Search V. CRFS FOR WORD RECOGNITION A. CRFs on Segmentatons Consder an nput word pattern X whch s oversegmented nto a sequence of prmtves usng a set of canddate segmentaton ponts P={p 1, p 2, p 3,, p g }. Relatng ths to the segmentaton lattce descrbed earler, one can see that a start-to-end path on the lattce s analogous to a probable segmentaton sequence S of X. Further, assume that Y refers to the label sequence of S. Fgure 5: Neghborhood graphs of Probable Segmentatons Thus, for a gven word, there are many probable segmentatons {S 1, S 2, S 3,,S m }, where each S ={s 1, s 2, s 3,,s n_} and s j denotes a canddate character pattern between segmentaton ponts p k and p l (k, l 1~g ). For each S we construct a neghborhood graph g (Fgure 5), such that each node denotes a canddate character pattern and each lnk 854

represents the relatonshp between neghborng canddate character patterns. Let (s j-1,s j) denote a par of neghborng canddate characters s j-1 and s j. In our CRF framework, s j and (s j-1,s j) correspond to unary (sngle) and bnary (parwse) clques, respectvely. From the defnton of CRF, P(Y S,X) can be approxmated by an arbtrary real-valued energy functon E(X,S, ) wth clque set N and parameters as: exp( E( X, S, Λ)) P( Y S, X ) = Z( S, X ) Z( S, X ) = exp( E( X, S, Y ', Λ)) ( Y ') Snce Z(S,X), the normalzaton constant does not depend on t may be gnored f we do not requre strct probablty values. Then, the problem of fndng the best label Y, that nvolves maxmzng P(Y S,X), becomes equvalent to mnmzng the total graph energy: Y* = argmax y P(Y S,X) = argmn y E(X, S, ) (2) Utlzng only unary and bnary clques n our CRF model, the total energy functon can be defned as K k E( X, S, Λ) = λk f ( s j, s j ) ( y j 1, y j ) 1 ( s j =, s j ) S k 1 1 (3) k where f ( s, ) ( 1, ) 1 s y j y are feature functons on a bnary j j j clque (s j-1,s j), ={ k } are weghtng parameters, y j-1 and y j denote the labels of s j-1 and s j, respectvely. loss of generalty, n eq. (3) we use only bnary clques to descrbe the feature functons as we assume that they (s j-1,s j) subsume unary clques s j. Here, the total energy functon s used to measure the plausblty of and the smaller E(X,S, ) s, the larger wll P(Y S,X) be. In the word recognton task the focus now s on selectng the best path from all possble segmentatons {S 1, S 2, S 3,,S m }. Snce all paths need not be of unform length, we need to normalze the path scores. Therefore, we use the followng path evaluaton crteron to select the best path from all segmentatons-cum-recogntons. E( X, S, Λ) EC( S, X ) = N( S, X ) (4) N S, ) Where ( X denotes the number of bnary clques (length of word). B. Parameter Learnng We apply the MCE crteron [12] optmzed by stochastc gradent descent [13] to fnd the optmal parameter vector by maxmzng the dfference between the evaluaton crteron of the most confusng (S,Y) and that of the correct one: L (, X ) = (mn( EC ( S, Y, X )) EC ( S, Y, X )) MCE ( x) = (1 + e x ) 1 f t t (5) (1) where EC(S t,y t,x) and EC f (S,X) are the evaluaton crtera of the true path and of the most confusng path respectvely. C. Feature Functons In CRF, feature functons are used to capture the node attrbutes (unary) and local dependences between nodes (bnary n our case). Further, they can be broadly classfed nto class-relevant/rrelevant on the bass of dependence/ndependence on the character class. In our model we use two character recognton scores and fve geometrc feature functons. Of the two character recognton scores for each canddate pattern, one s gven by a P2DBMN-M classfer on drecton hstogram features [14] and another s gven by an MRF classfer [11]. Among the geometrc features are three unary features that capture the character structure (such as the number of down strokes), sze (heght, wdth) and ts relatve poston n the word (dstance from the center lne). The bnary feature functons measure overlap and postonal dfferences between adjacent character patterns. Though the latter s to be calculated for all possble pars of characters, we smplfy the process by clusterng the characters nto four superclasses accordng to the mean vector of ther unary poston features. Each of these geometrc features are extracted as feature vectors and transformed to log-lkelhood scores usng quadratc dscrmnant functon () classfers. The feature functons are summarzed n Table 1. TABLE I. SUMMARY OF FEATURE FUNCTIONS Type Features Classfer f 1 Unary class-rel. Character shape P2DBMN- M f 2 Unary class-rel. Character structure MRF f 3 Unary class-rel. Unary geometrc (Down stroke number and nner gap) f 4 Unary class-rel. Unary geometrc (Character sze) f 5 Unary class-rel. Unary geometrc (Sngle-character poston) f 6 Bnary class-rrel. Bnary geometrc (Par-character poston) f 7 Bnary class-rel. Bnary geometrc (horzontal overlap) VI. EXPERIMENTATION In order to buld and emprcally valdate our model we concurrently utlzed two publcly avalable datasets IBM _UB_1 and UNIPEN [15]. We chose ths concurrent approach n order to overcome the absence of true character segmentaton ponts for words n the IBM_UB_1 dataset. Further, ths also helped us benchmark aganst an already exstng and wdely adopted handwrtng dataset (UNIPEN). IBM_UB_1 conssts of a twn-folo structure where each author-specfc fle conssts of a summary-query par. The summary s a full page of unconstraned cursve wrtng whereas the query page conssts of approxmately 25 handwrtten words appearng n the summary counterpart. For ths research we used only the data from the query pages whch s at the word-level granularty. Wthn the UNIPEN dataset we found that a subset of samples n the tran_r01_v07 folder were shared across two categores solated characters (Benchmark #3) and solated words (Benchmark #6). Specfcally, the art folder wth 6 855

wrters, cea wth 6 wrters, ceb wth 4 wrters and, ka wth 28 wrters. By mappng coordnate nformaton from character to word categores for these samples, we were able to obtan words samples wth true character-level segmentaton labels. Usng ths data (14,691 characters, 2,127 words, 44 wrters), we traned our CRF-based framework to obtan ntal (a) weghtng parameters and (b) classfers of the geometrc feature functons. Wth ths ntalzed model and a lexcon of a sngle word (the ground truth word) we were able to deduce the approxmate segmentaton for each word n the IBM_UB_1 dataset. The creaton of character-level segmentaton ponts for words n IBM_UB_1 represents the frst phase of expermentaton. In the second phase, we retraned and rebult our model usng the updated IBM_UB_1 data. For ths we selected four pages at random from 20 wrters as testng data and used the remander for tranng. Due to the data-ntensve nature of CRF tranng, a majorty of the data was used for tranng. The tranng set conssted of 61,105 words and 355,895 characters across 62 categores (dgt, uppercase and lowercase Latn alphabet), whle the test set conssted of 1,795 words and 10,987 characters. Further, we used the above retraned model for testng on the UNIPEN subset too to provde results for comparson (Table 2). For both experments the beam band s set at 1000. Dataset TABLE II. RECOGNITION RESULTS Lexcon Sze Test Set Sze (words) (words) Recognton Rate IBM_UB_1 5000 1795 78.72% UNIPEN 2127 2127 92% Though we provde recognton rates for the UNIPEN dataset, t must be noted that t conssts prmarly of laboratory samples and s not reflectve of free, unconstraned handwrtng from the feld. The only other relatvely close dataset to IBM_UB_1 s the IAM-OnDB dataset. Here too, snce the dataset conssts of sentences and not ndvdual words we are unable to perform approprate benchmark tests. Extant research usng IAM-OnDB has reled upon language models for enhanced recognton performance [6, 8]. Nevertheless, our model wthout the advantage of language models compares favorably (Table 3). A second ssue to note s that n [8], whch uses RNN wth a language model, the metrc used for assessng recognton s smlar to strng matchng usng Levenshten dstance. Ths measure allows for more dfferences n characters wthn a word and thus s more forgvng than a smple bnary correct/ncorrect metrc. Our results are presented usng the more strngent correct/ncorrect metrc. TABLE III. ONLINE UNCONSTRAINED ENGLISH WORD RECOGNITION COMPARATIVE RESULTS Recognton Dataset Data Type Model Rate IBM_UB_1 Word Level CRF Beam 78.72% IAM OnDB Sentence Level HMM + LM 70.8% IAM OnDB Sentence Level RNN + LM 79% Recognton results wth dfferent combnatons of feature functons are summarzed below (Table 4). As evdent, the P2DBMN-M and MRF based features and search length restrctons contrbute sgnfcantly to overall accuracy. Method Perfor. Word rec. TABLE IV. f 1 f 2 EXPERIMENT RESULTS f 3-f 7 length restrctons Usng all feature functons rate 71.92% 72.53% 77.05% 72.70% 78.72% VII. CONCLUSION We explored a CRF-drven beam search method to recognze unconstraned cursve onlne Englsh words. Combnng a tre-lexcon wth a character-synchronous lattce search algorthm, we acheve recognton rates that compare favorably to the current state of the art. Our contrbuton s two-fold: (a) applcaton of a beam search strategy to enable effcent processng of the search space, and, (b) mergng beam search wth a CRF model that combnes both feature probablty scores and character recognton scores to mprove performance. Ths, we beleve, s a promsng avenue for future research. REFERENCES [1] K. Shabanova. Do Androds dream of handwrtng recognton?, http://www.manufacturng.net/artcles/2012/10/do-androds-dream-ofhandwrtng-recognton. [2] U. a. B. Center for Unfed Bometrcs and Sensors, IBM-UB Onlne and Offlne Mult-lngual Handwrtng Data Set, 2012. [3] S. Connell, and A. K. Jan, Onlne handwrtng recognton usng multple pattern class models, unpublshed, 2000. [4] C. C. Tappert, and S.-H. Cha, Englsh language handwrtng recognton nterfaces: San Francsco: Morgan Kaufmann, 2007. [5] S. Jaeger, S. Manke, J. Rechert, and A. Wabel, Onlne handwrtng recognton: the NPen++ recognzer, Internatonal Journal on Document Analyss and Recognton, vol. 3, no. 3, pp. 169-180, 2001. [6] M. Lwck, and H. Bunke, HMM-based on-lne recognton of handwrtten whteboard notes, Fronters n Handwrtng Recognton, Proc 10 th Int. Workshop on, pp. 595-599, 2006. [7] J. Hu, M. K. Brown, and W. Turn, HMM based onlne handwrtng recognton, Pattern Analyss and Machne Intellgence, IEEE Transactons on, vol. 18, no. 10, pp. 1039-1045, 1996. [8] A. Graves, S. Fernández, M. Lwck, H. Bunke, and J. Schmdhuber, Unconstraned onlne handwrtng recognton wth recurrent neural networks, Adv. n Neural Informaton Processng Systems, vol. 20, pp. 1-8, 2008. [9] S. Madhvanath, and V. Govndaraju, The role of holstc paradgms n handwrtten word recognton, Pattern Analyss and Machne Intellgence, IEEE Transactons on, vol. 23, no. 2, pp. 149-164, 2001. [10] S. Shetty, H. Srnvasan, and S. Srhar, Handwrtten word recognton usng condtonal random felds. Proc. 9 th Int. Conf. Document Analyss and Recognton, pp. 1098-1102, 2007. [11] B. Zhu, and M. Nakagawa, A MRF Model wth parameter optmzaton by CRF for on-lne recognton of handwrtten Japanese characters, Proc. Doc. Recognton and Retreval XVIII, pp. 1-10, 2011. [12] B.-H. Juang and S. Katagr, Dscrmnatve learnng for mnmum error classfcaton, IEEE Trans. Sgnal Processng, 40(12), pp. 3043-3054, 1992. [13] H. Robbns and S. Monro, A stochastc approxmaton method, Ann. Math. Stat. 22, pp. 400-407, 1951. [14] C.-L. Lu and X.-D. Zhou, Onlne Japanese Character Recognton Usng Trajectory-based Normalzaton and Drecton Feature Extracton, Fronters n Handwrtng Recognton, 10th Int. Workshop on, pp.217-222, 2006. [15] I. Guyon, L. Schomaker, R. Plamondon, M. Lberman, and S. Janet, UNIPEN project of on-lne data exchange and recognzer benchmarks. Pattern Recognton, Computer Vson & Image Processng., Proc. of the 12th IAPR Int. Conf. on, vol.2., pp. 29-33, 1994. 856