Using Mouse Feedback in Computer Assisted Transcription of Handwritten Text Images

2009 10th International Conference on Document Analyi and Recognition Uing Moue Feedback in Computer Aited Trancription of Handwritten Text Image Verónica Romero, Alejandro H. Toelli, Enrique Vidal Intituto Tecnológico de Informática Univeridad Politécnica de Valencia, Spain {vromero, ahector, evidal}@iti.upv.e Abtract To date, automatic handwring recognition ytem are far from being perfect and heavy human intervention i often required to check and correct the reult of uch ytem. In order to achieve correct trancription, human knowledge can be integrated into the trancription proce, following an Interactive Predictive paradigm. We have recently propoed Moue Action a a ignificant feedback information ource for the underlying interactive ytem to improve the poductivity of the human trancriptor. In thi paper we review thi way to interact with the ytem and report comparative reult uing the publicly available IAMDB dataet. 1. Introduction Many document ued every day include handwritten text and, in many cae, it would be intereting to recognize thee text image automatically. However, tateof-the-art handwritten text recognition ytem (HTR) can not uppre the need of human work when high quality trancription are needed. HTR ytem can achieve fairly high accuracy for retricted application with rather limited vocabulary (reading of potal addree or bank check) and/or form-contrained handwriting. However, in the cae of uncontrained trancription application, the current HTR technology typically only achieve reult which do not meet the quality requeriment of practical application. Therefore, once the full recognition proce of one document ha finihed, heavy human expert reviion i required to really produce a trancription of tandard quality. Such a pot-editing olution i ratherinefficientand uncomfortable for the human corrector. Work upported by the EC (FEDER), the Spanih MEC under the MIPRCV Conolider Ingenio 2010 reearch programme (CSD2007-00018) and the grant TIN2006-15694-C02-01 by the Univeritat Politècnica de València (FPI fellowhip 2006-04) A way of taking advantage of the HTR ytem i to combine them with the knowledge of a human trancriptor, contituting the o-called Computer Aited Trancription of Text Image (CATTI) cenario [8, 5]. In thi cenario, the ytem ue the text image and a previouly validated part (prefix) of it trancription to propoe a uitable continuation. Then the uer find and correct the next ytem error, thereby providing a longer prefix which the ytem ue to ugget a new, hopefully better continuation. The reult obtained how that thi ytem can ave ignificant amount of human effort. Moue Action (MA) can be ued a an additional information ource: a oon a the uer point to the next ytem error, the ytem propoe a new, hopefully more correct, continuation. Thi way, many explicit uer correction are avoided. Preliminary reult of thi new kind of interaction were reported on tak uch a the trancription of old document and pontaneou entence extracted from urvey form [4]. In thi paper we review thi way to interact with the ytem uing the MA and report comparative reult uing the publicly available IAMDB dataet. 2. CATTI Framework In the CATTI framework, the uer i directly involved in the trancription proce ince he/he i reponible of validating and/or correctiong the HTR output. The proce tart when the HTR ytem propoe a full trancription of the input image. Then, the uer read thi trancription until finding a mitake; i.e, he or he validate a prefix of the trancription which i error-free. Now, the uer enter a word to correct the erroneou text that follow the validated prefix. Thi generate a new, extended prefix (the previou validated prefix, plu the uer amendment), which i ued by the HTR ytem to attempt a new prediction hypothei, thereby tarting a new cycle that i repeated until a final correct trancription i achieved [8, 5]. The traditional handwritten text recognition problem can 978-0-7695-3725-2/09 $25.00 2009 IEEE DOI 10.1109/ICDAR.2009.139 96

be formulated a the problem of finding a mot likely word equence, ŵ, for a given handwritten entence image repreented by a feature vector equence x, thati: ŵ =argmax w Pr(w x) = arg max Pr(x w) Pr(w) (1) w Pr(x w) i typically approximated by concatenated character Hidden Markov Model (HMM) [3] and Pr(w) i uually approximated by a n-gram word language model [2]. In the CATTI framework, in addition to x, a uervalidated prefix p of the trancription i available. The HTR hould try to complete thi prefix by earching for a mot likely uffix ŝ a: ŝ =argmax Pr( x, p) = arg max Pr(x, p) Pr( p) (2) Accordingly, the earch mut be performed over all poible uffixe of p and the language model probability Pr( p) mut account for the word that can be written after p. A in Eq. (1), Pr(x, p) can be modelled by HMM. On the other hand, to implement the language model contraint involved in Pr( p), we can take advantage of the information coming from p, a dicued in [8]: Pr( p) k+n 1 i=k+1 i n+1 ) l i=k+n i n+1 ) where the conolidated prefix i w k 1 = p and wl k+1 = i a poible uffix. The firt term of Eq. (3) account for the probability of the n 1 word of the uffix conditioned by word from the prefix, and the econd one i the uual n-gram probability for the ret of the word in the uffix. 2.1. CATTI uing word graph The earch problem correponding to Eq. 2 and 3 can be olved uing earch technique baed on word-graph. A word graph repreent the trancription with higher Pr(w x) of the image entence. In thi cae, the word graph i jut (a pruned verion of) the Viterbi earch trelli [2] obtained when trancripting the whole image entence. During the CATTI proce the ytem make ue of thi word graph to complete the prefixe accepted by the uer. A word graph can be repreented a a weighted directed acyclic graph, where each edge (e) i labeled with a word (w e )andacore(core(e)), and each node (n) i labeled with a point (horizontal poition) of the handwritten image (t n ). For each edge, we denote S e and E e it tart and end node. The graph ha a ingle tart node, that point to the tart of the text image, and a ingle end node. (3) The core of an edge i computed by multiplying the morplhological-lexical probability of the image between it tart and end node point Pr(x tee t Se w e ), by the language model probability of the given word at the edge Pr(w e ). However, in practice, the imple multiplication i modified to balance the abolute value of both probabilitie. The mot comon modification i to ue the language weight α and the inertion penalty β [1]: core(e) =logpr(x tee t Se w e )+α log Pr(w e )+β (4) The word label of any path from the tart node to the end node form a trancription hypothei, whoe core i computed a the um of the core of the edge along the path. A the word graph i a repreentation of a ubet of the poible trancription for a ource handwritten text image, it may happen that ome prefixe given by the uer can not be exactly found in the word graph. To circumvent thi problem ome heuritic need to be implemented. In thi work, we modified the core aociated to each edge in order to cope with the difference between the word in the prefix and the word in the path that bet match the given prefix. Thi heuritic can be implemented a an errorcorrecting paring dynamic programming algorithm. Moreover, thi algorithm take advantage of the incremental way in which the uer prefix i generated, paring only the new uffix appended by the uer in the lat interaction (ee [4]). The computational cot of thi approach i much lower than ue the naïve Viterbi adaptation we had ued in previou work. Therefore, uing word-graph technique the ytem i able to interact with the human trancriptor in a time efficient way. However, a drawback of thi implementation i that ome accuracy can be lot. 3. Enriching the CATTI Interaction Proce In CATTI application the uer i repeatedly interacting with the ytem. Hence, making the interaction proce eay i crucial for the ucce of the ytem. In conventional CATTI, before typing a new word in order to correct a hypothei, the uer need to poition the curor in the place where he want to type the word. Thi i done by performing a MA (or equivalent pointer-poitioning keytroke). By doing o, the uer i already providing ome very ueful information to the ytem: he i validating a prefix up to the poition where he poitioned the curor, and, in addition, he i ignalling that the following word located after the curor i incorrect. Hence, the ytem can already capture thi fact and directly propoe a new uitable uffix in which the firt word i different to the firt word of the previou uffix. In fig. 1 we can ee an example of uch behaviour. A in conventional CATTI, the proce tart when the HTR ytem propoe a full trancription ŝ of the input image x. Then, 97

x INTER-0 p ŝ oppoed thi Comment Bill in that thought m p oppoed INTER-1 ---- ----------------------------------------------------------------- ŝ thee Comment Bill in that thought c the p oppoed the INTER-2 ŝ Government Bill in that thought m p oppoed the Government Bill FINAL ŝ which brought c p t oppoed the Government Bill which brought # Figure 1. Example of CATTI operation uing MA. the uer read thi prediction until a trancription error i found (e)andmakeama(m) to poition the curor at thi point. Thi way, the uer validate an error-free trancription prefix p. Now, before the uer introduce a word to correct the erroneou one, the HTR ytem, taking into account the new prefix p and the wrong word that follow p, ugget a uitable continuation (i.e., a new ŝ). If the new ŝ correct the erroneou word (e) a new cycle tart. However, if the new ŝ ha an error in the ame poition that the previou one, the uer can enter a word, c, to correct the erroneou text e. Thi action produce a new prefix p (the previouly validated prefix, p, followed by c). Then, the HTR ytem take into account the new prefix to ugget a new uffix and a new cycle tart. Thi proce i repeated until a correct trancription of x i accepted by the uer. In fig. 1 the underlined boldface word in the final trancription i the only one which wa phiically corrected by uer. Note that in the iteration 1 a (ingle) MA doe not ucced and the correct word need to be phiically typed. However, the iteration 2 only need a MA. Thi new kind of interaction need not be retricted to a ingle pointer-poitioning MA. Several cenario arie, depending on the number of time the uer perform a MA. In the implet one, the uer only make a MA when it i neceary to diplace the curor (ingle-ma). In thi cae the MA doe not involve any extra human effort, becaue it i the ame action that the uer hould make in the conventional CATTI to poition the curor before typing the correct word. Another cenario that can be conidered conit in performing a MA ytematically before writing, although the curor i in the correct poition. In thi cae, however, there i a cot aociated to thi kind of MA, ince the uer doe need to perform additional action, which may or may not be beneficial. Thi cenario can be eaily extended allowing to the uer to make everal MA before having to write a correct word him/herelf. Since we have already dealt, in the ection 2, with the problem of finding a uitable uffix ŝ when the uer validate a prefix p and introduce a correct word c, we focu now on the problem in which the uer only make a MA. In thi cae the decoder ha to cope with the input image x, the validated prefix p and the erroneou word that follow the validated prefix e, in order to earch for a trancription uffix ŝ: ŝ =argmaxpr( x, p,e) =argmaxpr(x p,,e) Pr( p,e) (5) A in Eq. (1), Pr(x p,,e) can be modelled uing HMM. On the other hand, Pr( p,e) can be approached by adapting an n-gram language model o a to cope with the validated prefix p and with the erroneou word e. The language model preented in ection 2 would provide a model for the probability Pr( p ), but now the firt word of i conditioned by e. Therefore, ome change are needed. Let p = w k 1 be the validated prefix and = wl k+1 be a poible uffix and conidering that the wrong-recognized word e only affect the firt word of the uffix w k+1, Pr( p,e) can be computed a: Pr( p,e) Pr(w k+1 wk+2 n k,e) k+n 1 l i n+1 ) i n+1 ) (6) i=k+2 i=k+n Now, taking into account that the firt word of the poible uffix w k+1 ha to be different to the erroneou word e, Pr(w k+1 wk+2 n k,e) can be formulated a follow: δ(w Pr(w k+1 wk+2 n,e)= k k+1,e) Pr(w k+1 wk+2 n k ) δ(w w,e) Pr(w w k k+2 n ) (7) where δ(i, j) i 0 when i = j and 1 otherwie. 98

Figure 2. Example of word-graph generated after the uer validate the prefix oppoed the Government Bill. The edge correponding to the wrong-recognized word in wa diabled. A in the conventional CATTI, thi decoding can be implemented uing word-graph. The retriction entailed by (7) can be eaily implemented by deleting the edge labeled with the word e after the prefix ha been matched. An example i hown in fig. 2. 4. HTR Sytem Overview The HTR ytem ued here follow a claical architecture compoed of three module: preproceing, feature extraction and recognition. The firt one entail different wellknown tandard technique uch a kew and lant correction and ize normalization. On the other hand, the feature extraction proce tranform a preproceed text line image into a equence of 60-dimenional feature vector (ee [6]). The recognition proce i baed on HMM. Character are modeled by continuou denity left-to-right HMM with 6 tate and 64 Gauian mixture component per tate. On the other hand, each lexical word i modelled by a Stochatic Finite-State automaton, and text entence are modelled uing bi-gram. All thee finite-tate model can be integrated into a ingle global model in which the decoding proce i performed uing the word-graph obtained by the Viterbi algorithm [2]. 5. Experimental Reult In order to tet the effectivene of uing MA in the CATTI ytem, different experiment were carried out. The corpu, the different meaure and the obtained experimental reult are explained below. 5.1. IAMDB Corpu Thi publicly acceible corpu wa compiled by the Reearch Group on Computer Viion and Artificial Intelligence (FKI) at Intitute of Computer Science an Applied Mathematic (IAM). The acquiition wa baed on the Lancater-Olo/Bergen Corpu (LOB). Fig. 1 how an example of a handwritten entence image from thi corpu. The lat releaed verion (3.0) i compoed of 1,539 canned text page, handwritten by 657 different writer. The databae i provided at different egmentation level: word, line, entence and page image. In our cae, the entence egmentation level i conidered (ee [9]). The corpu wa partitioned into training and tet et. The former i compoed of 5,799 text line which add up to 2,124 entence, handwritten by 448 different writer, wherea the latter comprie 200 entence, written by 100 different writer. Table 1 ummarize all thi information. Table 1. Baic tatitic of the IAMDB corpu Number of: Training Tet Total Lex. writer 448 100 548 entence 2,124 200 2,324 word 42,832 3,957 46,789 8,938 character 216,774 20,726 237,500 78 5.2. Aement Meaure Different evaluation meaure have been adopted. On the one hand, the quality of the trancription without any ytem-uer interactivity i given by the well known word error rate (WER). On the other hand, the word troke ratio (WSR) can be defined a the number of (word level) uer interaction that are neceary to produce correct trancription uing the CATTI ytem, divided by the total number of reference word. Finally, the word click rate (WCR) can be defined a the number of additional MA per word that the uer ha to do uing the new interaction with repect to uing the conventional CATTI ytem. The relative difference between WER and WSR (called Etimated Effort-Reduction) give u an etimation of the reduction in human effort achieved, in term of word to be corrected, by uing CATTI with repect to uing a conventional HTR ytem followed by human potediting. 99

Note that the additional human effort needed for the verification of the trancription and poitioning the curor in the appropriate place i the ame in both conventional CATTI and new ingle-ma uer-catti interaction ytem. In both cae the uer hould read the trancription propoed by the ytem until he or he find an error and then poition the curor in the place where the new word ha to be typed. 5.3. Reult Table 2 how the obtained reult. In the firt part (left), we can ee an etimation of the reduction in human effort (E-R) achieved by uing the conventional CATTI ytem with repect to the claic HTR pot editing. In the econd part (right), the reult obtained with the new ingle- MA interaction mode are hown. It i important to notice Table 2. Reult obtained with the corpu IAMDB uing the conventional CATTI (left) and the new ingle-ma interaction (right) Conventional CATTI ingle-ma interaction WER WSR E-R WSR E-R 25.6% 23.4% 8.6% 19.6% 23.4% that ome of the reult in table 2 do not correpond with thoe reported in [7]. The difference are becaue in [7] full Viterbi earch wa ued, while in thi work a much fater earch technique baed on (pruned) word graph i adopted. According to table 2, the etimated human effort to produce error-free trancription uing MA i ignificantly reduced with repect to uing a conventional HTR ytem or the conventional CATTI. The new interaction mode can ave about 23% of the overall effort. Fig. 3 how the WSR, the Etimated Effort-Reduction (E-R) and the word click rate (WCR) a a function of the maximal number of MA allowed by the uer before writing the correct word. The firt point (0) correpond to the reult of the conventional CATTI, and the point S correpond to the the ingle- MA interaction conidered in the previou table. A good trade-off i obtained when the maximum number of click i around 3, becaue a ignificant amount of expected human effort i aved with a fairly low number of extra click per word. 6. Remark and Concluion In thi paper, we have conidered new uer feedback ource for CATTI. By conidering MA, we have hown that a ignificant benefit can be obtained, in term of wordtroke reduction. A imple implementation uing wordgraph ha been decribed and ome experiment have been carried out. WSR E-R 50 40 30 20 10 0 WSR WCR E-R IAMDB 0 S 1 2 3 4 5 6 N. max. click Figure 3. WSR, E-R and WCR a a function of the maximal number of MA allowed by the uer before writing the correct word. It i worth noting that alternative (n-bet) uffixe could alo be obtained with the conventional CATTI ytem. However, by conidering the rejected word to propoe the alternative uffixe, the interaction method here tudied are more effective and more comfortable for the uer. Moreover, uing the ingle-ma interaction method, a econd alternative uffix i obtained without extra human effort. Reference [1] K. T. A. Ogawa and F. Itakura. Balancing acoutic and linguitic probabilite. Proc. IEEE Conf. Acoutic, Speech, and Signal Proceing, page 181 184, 1998. [2] F. Jelinek. Statitical Method for Speech Recognition. MIT Pre, 1998. [3] L. Rabiner. A Tutorial of Hidden Markov Model and Selected Application in Speech Recognition. Proc. IEEE, 77:257 286, 1989. [4] V. Romero et al. Improvement in the computer aited tranciption ytem of handwritten text image. In Proc. of the PRIS 2008, page 103 112, June 2008. [5] V. Romero, A. H. Toelli, L. Rodríguez, and E. Vidal. Computer aited trancription for ancient text image. In Proc. of ICIAR 2007, Vol. 4633:1182 1193, 2007. [6] A. H. Toelli et al. Integrated Handwriting Recognition and Interpretation uing Finite-State Model. IJPRAI, 18(4):519 539, June 2004. [7] A. H. Toelli et al. Computer aited trancription of text image and multimodal interaction. In Proc. of the MLMI, volume 5237 of LNCS, page 296 308. 2008. [8] A. H. Toelli, V. Romero, L. Rodríguez, and E. Vidal. Computer aited trancription of handwritten text. In Proc. of ICDAR 2007, page 944 948. IEEE Computer Society, 2007. [9] M. Zimmermann, J.-C. Chappelier, and H. Bunke. Offline grammar-baed recognition of handwritten entence,. IEEE TPAMI, 28(5):818 821, May 2006. 2 1.6 1.2 0.8 0.4 0 WCR 100