Online Handwritten Cursive Word Recognition Using Segmentation-free and Segmentation-based Methods

3th Asan Conference on Pattern Recognton (ACPR2015), November 3-6, 2015, Kuala Lumpur, Malasa. Onlne Handwrtten Cursve Word Recognton Usng Segmentaton-free and Segmentaton-based Methods Blan Zhua, Art Shvramb, Venu Govndarajub and Masak Nakagawaa a Department of Computer and Informaton Scences, Toko Unverst Agrculture and Technolog, Toko, Japan b Center for Unfed Bometrcs and Sensors, Unverst at Buffalo, Buffalo, US {ashvram, govnd}@buffalo.edu {zhublan, nakagawa}@cc.tuat.ac.jp Abstract Ths paper descrbes a comparson between onlne handwrtten cursve word recognton usng segmentaton-free method and that usng segmentaton-based method. To search the optmal segmentaton and recognton path as the recognton result, we attempt two methods: segmentaton-free and segmentaton-based, where we expand the search space usng a character-snchronous beam search strateg. The probable search paths are evaluated b ntegratng character recognton scores wth geometrc characterstcs of the character patterns n a Condtonal Random Feld (CRF) model. Our methods restrct the search paths from the tre lexcon of words and precedng paths durng path search. We show ths comparson on a publcl avalable dataset (IAM-OnDB). 1. Introducton The development of pen-based or touch-based nput devces such as tablets and smart phones has led to a push towards more flud nteractons wth these electroncs. Realzng onlne handwrtten character recognton wth hgh performance s vtal, especall for applcatons such as the natural nput of text on smart phones, to provde a satsfactor user experence. Wthout character recognton cues, characters cannot be segmented unambguousl due to the fact that segmentaton ponts between characters are not obvous. A feasble wa to overcome the ambgut of segmentaton s called ntegrated segmentaton and recognton, whch s classfed nto segmentaton-free and segmentaton-based methods. The segmentaton-based method attempts to splt cursve words nto character patterns at ther true boundares and label the splt character patterns. A word pattern s over-segmented nto prmtve segments such that each segment comprses of a sngle character or part of a character. The segments are combned to generate canddate character patterns (formng a canddate lattce), whch are evaluated usng character recognton, ncorporatng geometrc and lngustc contexts. On the other hand, the segmentaton-free method avods problems assocated wth segmentaton, and utlzes the abltes of one-dmensonal structural models such as HMMs or MRFs for concatenatng character models to construct word models based on the provded lexcon of words durng recognton, and erases and selects segmentaton ponts usng character models. Onlne handwrtten word recognton usng a recurrent neural network [1] performs word recognton b contnuousl movng the nput wndow of the network across the frame sequence representaton of a word, thus generatng actvaton traces at the output of the network. These output traces are subsequentl examned to determne the ASCII strng(s) best representng the word mage. It has got a lot of attenton and dscusson whch s better of the segmentaton-free and segmentaton-based methods for handwrtten strng recognton. In ths paper, we attempt the two methods and make a comparson between them for onlne handwrtten cursve word recognton. We tr the two methods (segmentaton-free and segmentaton-based) to search the optmal path as the recognton result. We expand the search space usng a character-snchronous beam search strateg and the probable search paths are evaluated b a path evaluaton crteron n a CRF model. To evaluate character patterns, we combne a MRF model [2] wth a P2DBMN-MQDF (pseudo 2D b-moment normalzaton - modfed quadratc dscrmnant functon) recognzer [3]. The rest of ths paper s organzed as follows: Secton 2 begns wth the descrpton of the preprocessng steps. Secton 3 descrbes the CRFs for our word recognton, Secton 4 presents the two search methods (segmentaton-free and segmentaton-based), Secton 5 presents the expermental results, and Secton 6 makes our conclusons. 2. Preprocessng Gven a word, we compute baselnes usng lnear regresson lnes that approxmate the local mnma or the local maxma. Then, we normalze the dfferent slants of 225

Fgure 1: Extractng feature ponts and neghborhood graphs. words, b shearng ever word accordng to ts slant angle. After transformng ever word to a gven corpus heght (the dstance between the baselne and the corpus lne), we re-sample the trajector ponts to ensure that pen-tp coordnates are equdstant usng lnear nterpolaton. Then, smoothng s accomplshed b applng a Gaussan flter to each coordnate pont n the sequence. Crtcal ponts such as start and end ponts of strokes and ponts of hgh curvature are detected and ncluded n the re-sampled/smoothed trajector output. We also remove delaed strokes (e.g., crosses on t s, dots on s etc.). Fnall, we extract feature ponts usng the method developed b Ramner [2]. The start and end ponts of ever stroke are pcked up as feature ponts. Then, the most dstant pont from the straght lne between adjacent feature ponts s selected as a feature pont f the dstance to the straght lne s greater than a threshold value. Ths selecton s done recursvel untl no more feature ponts are selected. 3. Word recognton CRFs 3.1. CRFs Shvram et al. [4] have proposed a CRF model for onlne handwrtten word recognton, where the CRF model was constructed from over-segmented ponts. Here we appl a smlar model, but formulate t based on feature ponts to appl the segmentaton-free method. Consder an nput word pattern X whch has a sequence of feature ponts F={f 1, f 2, f 3,, f g } as shown n Fg. 1. There are man probable segmentatons {S 1, S 2, S 3,, S m }, and each S ={s 1, s 2, s 3,,s n_} where s j denotes a canddate character pattern between feature ponts p k and p l (k, l = 1~g ). For each S we construct a neghborhood graph g as shown n Fg. 1, where each node denotes a canddate character pattern and each lnk represents the relatonshp between neghborng canddate character patterns. Let (s j-1,s j) denote a par of neghborng canddate characters s j-1 and s j. s j and (s j-1,s j) correspond to unar (sngle) and bnar (par-wse) clques, respectvel. Assumng that Y refers to the label sequence of S, P(Y S,X) can be approxmated b a real-valued energ functon E(X,S,Y,Λ) wth clque set N and parameters Λ: exp( E( X, S, Y, Λ)) P( Y S, X ) = Z( S, X ) Z( S, X ) = ( Y ') exp( E( X, S, Y', Λ)) Z(S,X), the normalzaton constant ma be gnored f we do not requre strct probablt values. Then, the problem of fndng the best label Y, that nvolves maxmzng (1) P(Y S,X), becomes equvalent to mnmzng the total graph energ: Y = arg max P(Y S,X) = arg mn E(X,S,Y,Λ) (2) To maxmze effcenc we utlze onl unar and bnar clques. Thus, the total energ functon s defned as k E( X, S, Y, Λ) = λk f( s j, s j )( j 1, j ) 1 ( s j =, s j ) S k 1 1 (3) k where f (, ( s, ) 1 ) 1 s j are feature functons on a bnar j j j clque (s j-1,s j), Λ={λ k } are weghtng parameters, j-1 and j denote the labels of s j-1 and s j, respectvel. In eq. (3) we consder onl bnar clques (s j-1,s j) snce the subsume the feature functons of unar clques s j. The total energ functon s used to measure the plausblt of Y, and the smaller E(X,S,Y,Λ) s, the larger wll P(Y S,X) be. For all segmentatons {S 1, S 2, S 3,, S m }, we select the best path. Denotng the number of feature ponts on S b N(S, X), we use the followng path evaluaton crteron to evaluate all segmentaton and recognton paths. (4) EC E ( X, S, Y, N, Λ ) ( S, Y, X ) = N ( S, X ) N(S, X) s the same for all segmentaton hpotheses on the total path length, and t would dffer whle evaluatng partal paths n a beam search. Therefore, t s necessar to normalze the path scores usng N(S, X) when evaluatng partal paths n a beam search. When usng a Vterb search such as dnamc programmng (DP) t would make the model analogous to a sem-crf framework as n [5]. 3.2. Feature functons The feature functons are unar f defned on sngle clques, and bnar on par-wse clques. On the other hand, the feature functons are class-relevant f dependent on character class (or class-par), otherwse are class-rrelevant. So, there are totall four tpes of feature functons: unar class-relevant, unar class-rrelevant, bnar class-relevant and bnar class-rrelevant. There are two output scores of character recognzers on canddate character patterns (two unar class-relevant feature functons): a P2DBMN-MQDF classfer on drecton hstogram features [3] and a MRF classfer [2]. The P2DBMN-MQDF classfer evaluates shapes of character patterns whle the MRF classfer evaluates structures of character patterns. The feature ponts extracted (secton 2) are normalzed so that the mnmum of X coordnate s 0 whle Y coordnates are unchanged, and, are used to construct one-dmensonal structural character MRFs to perform recognton. On the P2DBMN-MQDF classfer, feature vector dmensonalt s reduced usng Fsher Lnear Dscrmnant Analss (FLDA). There are few smbol classes n Englsh (78 for our dataset) and whle usng FLDA the reduced dmensonalt must be less than the number of smbol classes. Usng too few a feature dmenson results n low recognton accurac. To resolve K 226

ths, we create sub-classes for each smbol categor b applng k-means and use these sub-class categores nstead of the true class nformaton. Usng sub-classes can also express better the dstrbuton of smbols resultng n mproved accuraces. We nput the orgnal re-sampled and smoothed ponts (secton 2) for each canddate character pattern nto the P2DBMN-MQDF recognzer to obtan a recognton score. We use three unar class-relevant feature functons to evaluate the node attrbutes (character nner structure, character sze, sngle-character poston), and use a bnar class-relevant feature functon and a bnar class-rrelevant feature functon to capture the dependences between nodes (par-character poston, horzontal overlap). We use log-lkelhood scores usng quadratc dscrmnant functon (QDF) classfers for them. 3.3. Parameter Learnng We tran the weghtng parameters Λ b a genetc algorthm (GA) usng vald data to maxmze the recognton rate on vald data. We treat each one of weghts as an element of a chromosome. For evaluatng the ftness of a chromosome, each tranng word pattern s searched for the optmal path evaluated usng the weght values n the chromosome. To save computaton, we frst set each weght value as 1, and select the top 300 recognton canddates (segmentaton-recognton paths) for each tranng word. We then tran the weght parameters b GA usng the selected 300 recognton canddates. After some teratons, we use the updated weght values to re-select top 300 recognton canddates for each tranng word pattern. We repeat recognton canddate selecton three tmes. range of the number (length) of feature ponts for each character class. For nstance, the length range of feature ponts for the character O s from 5 to 50 as shown n Fg. 2. Then, the length range of feature ponts to the termnal for each node of the tre can be calculated accordng to the length range of feature ponts for each character class, where the numbers shown n parentheses of each node box are the length range of feature ponts to the termnal for the node. We can restrct the searched paths b the length ranges resultng n mproved recognton accurac. We use the length restrcton of feature ponts for the segmentaton-free method whle usng the length restrcton of nodes (characters) to the termnal for the segmentaton-based method. For nstance, the lengths of nodes to the termnal for the character O are 3, 5 and 7. 4.1. Segmentaton-free method We use the example of the extracted feature ponts of a word pattern as shown n Fg. 1, and show a search n Fg. 3 to descrbe our processes. We conduct the search and expanson from the expanson depth d 1 to d 5. We frst search the tre lexcon from ts start nodes. In Fg. 2, the start nodes are [O], [p], and [], and based on these we expand the root node and set ts chldren nodes N 1-1 wth a character categor [O], N 1-2 wth a character categor [p] and N 1-3 wth a character categor [], where each chld node has a character categor C and a start pont of a feature sequence. The share the start pont f 1. For each node, the length from the start pont f 1 to the termnal pont f 56 s 56 so that the node N 1-3 s erased because the length range to the termnal of the tre node [] s 9-44 and does not satsf the length. For each node of N 1-1 and N 1-2, we match the feature ponts from each start pont f to the pont f +M Fgure 2: Porton of tre lexcon of words. Fgure 3: Segmentaton-free search. Fgure 4: Segmentaton canddate lattce. 4. Search methods To search the optmal path as the recognton result, we attempt segmentaton-free and segmentaton-based methods. For the two methods we use a character-snchronous beam search strateg where frst a tre lexcon s constructed from a word database, as shown n Fg. 2. When tranng character MRFs, we can count the wth the states S={s 1, s 2, s 3,,s J } of the character MRF model of the correspondng C b the Vterb search, where M s the maxmum number of the feature ponts of C and t s calculated from tranng patterns. Before matchng, we shft the X coordnates of the feature ponts from f to f +M so that the mnmum of X coordnate s 0. Then we can get MRF paths at the [End] state whch correspond to some end ponts such as f 16, f 17 and f 18 for N 1-1.We sort the scores of the MRF paths (note that each score S MRF s normalzed b 227

the number N fea of the feature ponts from the start pont to the end pont of the path to S MRF / N fea ), and select N MRF top MRF paths. In Fg. 3, N MRF s set as three and three MRF paths (from f 1 to f 16, from f 1 to f 17, from f 1 to f 18 ) of N 1-1, and three MRF paths (from f 1 to f 12, from f 1 to f 15, from f 1 to f 18 ) of N 1-2 are selected, and we call them sub-nodes. Node N 1-1 has three sub-nodes (N 1-1 -1, N 1-1 -2 and N 1-1 -3) whle N 1-2 has three sub-nodes (N 1-2 -1, N 1-2 -2 and N 1-2 -3). For all the sub-nodes up to the depth d 1, we evaluate all search paths accordng to the evaluaton crteron n eq. (3), sort them, then onl select several top paths and erase others. The number of the selected top paths s called the beam band. In Fg. 3, the beam band s set as two, and two paths endng at N 1-1 -1 and N 1-1 -3 are selected for d 1. Then, we appl the same method to process the next depth. Fnall, the expanson reaches to the depth d 5. For all nodes up to the depth d 5, we evaluate all paths accordng to the path evaluaton crteron, sort them, and then select the optmal path as the recognton result. For each node N k-j wth a character categor C and a start pont f, we need to execute MRF matchng to decde ts end ponts and sub-nodes. After that we need to extract features to recognze each sub-node b the P2DBMN-MQDF recognzer. Dfferent nodes ma be requred for the same par of C and f. Therefore, for each par of C and f, once ts end ponts and sub-nodes are decded and the scores of ts sub-nodes are calculated, we store them and use them for other nodes from the second tme. For each par of start and end ponts, we also store the features for the P2DBMN-MQDF recognzer, and use them for the second pass. Ths can greatl mprove recognton tme. We call ths storage strateg of scores and features (SSSF). 4.2. Segmentaton-based method The method selects some canddate segmentaton ponts P={p 1, p 2, p 3,, p q } from the feature ponts F={f 1, f 2, f 3,, f g }, and onl consders the selected ponts for segmentaton. We set the pen-up, pen-down and the mnma and Fgure 5: Segmentaton-based search. maxma ponts as canddate segmentaton ponts whch are used to over-segment the word nto prmtves. One or more consecutve prmtve segments are combned to form a canddate character pattern. All possble combnatons of these canddate character patterns are represented n a lattce where each node denotes a canddate character pattern and each edge denotes a segmentaton pont. Each canddate character pattern that exsts between segmentaton ponts p k and p l (k, l = 1~g) s followed b canddate character patterns that start from segmentaton pont p l. A canddate character pattern does not start from a pen-up pont, and does not end at a pen-down pont. We also restrct the wdth of each canddate character pattern to a threshold value. Fgure 4 shows a segmentaton canddate lattce for the word pattern as shown n Fg. 1. For each node n the lattce possble lengths to the termnal node are calculated. Ths s done b frst settng the length of all termnal nodes as one and then workng backwards one level at a tme. Thus, for each precedng node, the length vector s calculated b addng one to the length of ts succeedng nodes. Ths s shown n Fg. 4, where the numbers n each node box refer to the possble lengths. Ths length vector s used n conjuncton wth the lexcon to prune out unlkel search paths thereb mprovng both recognton accurac and speed. We show a search n Fg. 5. d 1 - d 5 represent the depth levels n the search space. Startng wth the root node, lattce nodes (1) - (6) (Fg. 4) are expanded snchronousl at depth d 1. At N 1-1 lattce node (1) can be matched to three possble characters (O, p, ) from the lexcon (Fg. 2). Snce the lengths for node (1) are 6-19 and the lexcon length from p s 4 and that from s 2 and the do not match these possble lengths, we drop them. The case s smlar for N 1-2 - N 1-6. Ths forms the expanson for depth d 1. At ths stage, all paths are evaluated accordng to the path evaluaton crteron n eq. (3) and a few top scorng paths are selected whle others are pruned out. The number of selected paths s called the beam wdth. In Fg. 5 the beam wdth s two and two paths endng at N 1-5 -O and N 1-6 -O are selected for d 1. Smlarl, at depth d 2 ths process s repeated to expand the search. Fnall, ths expanson s repeated for depth d 5 eldng one canddate - Offer as the recognzed word. A canddate character pattern (lattce node) ma appear n several search paths. Evaluatng node-character matches usng a character recognzer requres extracton of features from the character pattern. In order to avod redundant processng, for each node, once ts features are extracted, we store them. These are then used n subsequent calls. Addtonall, for a node and a character categor, we store the recognton score at the frst recognton and use ths stored score for subsequent evaluatons. We also call ths SSSF smlar to the segmentaton-free method. 5. Experments We evaluate the word recognzers usng a publcl avalable dataset IAM-onDB [6], whch conssts of handwrtten sentences acqured from a smart whteboard. It has 4 dsjont sets: a tranng set (5,364 lnes); two 228

valdaton sets (1,438 lnes and 1,518 lnes); and a test set (3,859 lnes). For our solated word recognton, we generate word-level ground truth usng a two step approach as elaborated n [7], where a few words gvng segmentaton errors were dropped, however ther numbers were far too less to make a sgnfcant mpact on the results. We set the tre lexcon to have onl one word (the word strng label of the recognzed word pattern), and used the word recognzer traned n the prevous research [4] to recognze each word pattern and set the recognzed segmentaton result as the character segmentaton labels of the pattern. After that, we obtaned character-level segmentaton labels for the word patterns of IAM-onDB. We used the tranng data to tran the feature functon models. Then, we tested the recognton rate for the testng data. Each of these models was traned on 78 smbols. The constructed tre lexcon contans 5,562 words (actual sze of the test set lexcon). The weghtng parameters Λ were estmated usng the valdaton sets b GA. The experments were mplemented on an Intel(R) Core(TM) 2 Quad CPU Q9550 @ 2.83GHz 2.83 GHz wth 4.00 GB memor. Table 1 shows the results. The numbers enclosed n square brackets are the average recognton tme for a word. Table: Word recognton rate (%). Method Wtout ALL Wtout Wtout geometrcwthout length Perform. MRF P2DBMN-MQDF features restrctng Seg. 300-beam 85.96[1.0s]60.63 82.48 83.67 79.69 based 1000-beam 86.54[1.2s] Seg. 300-beam 85.58[2.8s]60.10 82.12 83.11 80.56 free 1000-beam 86.12[7.4s] From these results, we can see that (1) usng all feature functon models sgnfcantl mproves the recognton accurac b combnng a MRF model wth a P2DBMN-MQDF recognzer. The MRF model apples a structural method that s weak at collectng global character nformaton, but robust aganst character shape varatons. The 2DBMN-MQDF recognzer uses an un-structural method that s robust aganst noses but weak aganst character shape varatons. B combnng them, the compensate for ther respectve dsadvantages. (2) The length restrctng can brng hgher recognton accurac. (3) The two methods (segmentaton-free and segmentaton-based) elded comparable results and the segmentaton-based method performed a lttle better. Although both methods have used the same path evaluaton crteron n a CRF model, ther recognton results are dfferent. We consder the segmentaton-based method attempts to select canddate segmentaton ponts at true characters boundares, resultng n reduced confuson and mproved recognton speed. The segmentaton-free method uses the MRF model to select N MRF top MRF paths and some end ponts from a start pont for a character categor onl n one search resultng n mproved processng speed. Therefore t s possble to sgnfcantl mprove recognton speed b combnng the two methods (segmentaton-free and segmentaton-based). (4) Our sstem outperforms a state-of-the-art recurrent neural network BLSTM wth recognton rate 85.3% for the same datasets and lexcon [1]. We also evaluate the SSSF and got a result that at beam band 300 t led to greatl mproved recognton speed (b about eght tmes for segmentaton-based method, 89 tmes for segmentaton-free method). The memor consumpton of our recognzer s about 8.5MB. 6. Concluson Ths paper presented a sstem for onlne handwrtten Englsh cursve word recognton and made a comparson between the segmentaton-free and segmentaton-based methods at the same path evaluaton crteron n a CRF model. Expermental results demonstrate the segmentaton-based method performed better because t attempts to select canddate segmentaton ponts resultng n reduced confuson and mproved recognton speed. We expect to sgnfcantl mprove recognton speed b combnng the two methods n the future. We have alread shown that combnng MRF model wth P2DBMN-MQDF acheves ver hgh recognton rate for onlne handwrtten Japanese text. In ths paper, we have shown ths combnaton s also ver effectve and successful for onlne Englsh word recognton. Acknowledgement Ths research s partall beng supported b Grant-n-ad for Scentfc Research C-15K00225. References [1] A. Graves, M. Lwck, S. Fern andez, R. Bertolam, H. Bunke, and J. Schmdhuber. A novel connectonst sstem for unconstraned handwrtng recognton. IEEE Trans PAMI, 31(5), 855 868, 2009. [2] B. Zhu and M. Nakagawa. On-lne handwrtten Japanese characters recognton usng a MRF model wth parameter optmzaton b CRF. Proc. 11th ICDAR, 603 607, 2011. [3] C.-L. Lu and X.-D. Zhou. Onlne Japanese character recognton usng trajector-based normalzaton and drecton feature extracton, Proc. 10th IWFHR, 217-222, 2006. [4] A. Shvram, B. Zhu, S. Setlur, M. Nakagawa, and V. Govndaraju. Segmentaton based onlne word recognton: A condtonal random feld drven beam search strateg. Proc. 12th ICDAR, 852 856, 2013. [5] X.-D. Zhou, D.-H. Wang, F. Tan, C.-L. Lu, and M. Nakagawa. Handwrtten Chnese/Japanese text recognton usng sem-markov condtonal random felds. IEEE Trans. PAMI, 35(10), 2413 2426, 2013. [6] M. Lwck and H. Bunke. IAM-OnDB-an onlne Englsh sentence database acqured from handwrtten text on a whteboard. Proc. 8th ICDAR, 956 961, 2005 [7] C. T. Nguen, B. Zhu, and M. Nakagawa. A sem-ncremental recognton method for on-lne handwrtten Englsh text. Proc. 14th ICFHR, 234 239, 2014. 229