Stochastic Segment Modeling for Offline Handwriting Recognition

Size: px

Start display at page:

Download "Stochastic Segment Modeling for Offline Handwriting Recognition"

Kelly Fay Farmer
5 years ago
Views:

2009 10th nternational Conference on Document Analysis and Recognition tochastic egment Modeling for Offline Handwriting Recognition Prem Natarajan, Krishna ubramanian, Anurag Bhardwaj, Rohit Prasad

1 th nternational Conference on Document Analysis and Recognition tochastic egment Modeling for Offline Handwriting Recognition Prem Natarajan, Krishna ubramanian, Anurag Bhardwaj, Rohit Prasad BBN Technologies, 10 Moulton t, Cambridge, MA {prem, Abstract n this paper, we present a novel approach for incorporating structural information into the hidden Markov Modeling (HMM) framework for offline handwriting recognition. Traditionally, structural features have been used in recognition approaches that rely on accurate segmentation of words into smaller units (sub-words or characters). However, such segmentation based approaches do not perform well on real-world handwritten images, because breaks and merges in glyphs typically create new connected components that are not observed in the training data. To mitigate the problem of having to derive accurate segmentation from connected components, we present a novel framework where the HMM based recognition system trained on shorter-span features is used to generate the 2-D character images (the tochastic egments ), and then another classifier that uses structural features extracted from the stochastic character segments generates a new set of scores. Finally, the scores from the HMM system and from structural matching are used in combination to generate a hypothesis that is better than the results from either the HMM or from structural matching alone. We demonstrate the efficacy of our approach by reporting experimental results on a large corpus of handwritten Arabic documents. 1. ntroduction Offline handwriting recognition (OHR) continues to be a challenging research problem due to a variety of reasons. Most recognition approaches that require accurate segmentation of the text into smaller units do not perform well on handwritten text. There are two primary causes for poor performance of segmentationbased approaches on real-world handwritten text. First, segmenting handwritten text for connected scripts such as Arabic is very difficult. econd, most real-world images are prone to degradations that result in breaks and merges in glyphs. This phenomenon creates new connected components that are not observed in training Anurag Bhardwaj contributed to this paper while he was a summer intern at BBN Technologies. data, and therefore the character classifier is unable to accurately recognize the glyphs. OHR systems based on hidden Markov models [1],[2],[3],[4] have been shown to outperform segmentation-based approaches. The advantage with HMM based systems is that they are segmentation-free, that is no pre-segmentation of word/line images into smaller units such as sub-words or characters is required. However, there are well-known limitations with HMM based approaches [5]. These limitations are due to two reasons: (a) the assumption of conditional independence of the observations given the state sequence, and (b) the restrictions on feature extraction imposed by frame-based observations. The limitations noted in [5] are also relevant to OHR systems as they use pixel-level features from narrow slices of the text. pecifically, the narrow windows provide very little contextual information making the conditional independence assumption in these systems unrealistic. n [6], we report on novel work on incorporating structural or longer-span shape features such as Gradient, tructure, and Concavity (GC) into the HMM framework. n that work, we use wider windows for extracting structural features than the standard shorter-span features. Following feature extraction, for each frame, the structural features are concatenated to the shorter-span feature vector for performing model training and recognition. nspired by the work in [5], we present a novel framework for combining structural matching and HMM-based recognition, which has more discriminative power than simply combining the structural and short-span features at each frame. n our framework, an HMM-based system trained on shorterspan features is used to generate the 2-D character images or stochastic segments, and then another classifier that uses structural features extracted from the stochastic character segments generates a new set of scores. The scores from the HMM system and the classifier operating on the structural features are combined to generate the best hypothesis /09 $ EEE DO /CDAR

2 2. tochastic egment Modeling n this section, we formulate the framework for stochastic segment modeling. We begin with the baseline HMM system, which is described as follows. Given a line image,, we extract features, X, using fixed size frames. These frames are typically narrow and capture short span features. The goal of the recognition task is to find a sequence of characters, that maximizes C X), the probability of a sequence of characters C given the feature vector sequence X. Using Bayes' rule, C X) may be written as: P ( C X ) X C) C) / X ) (1) n the equation above, X C) is the character glyph model which can be modeled by a hidden Markov models, and C) is the language model. We now introduce the segment model. A similar model used in speech recognition is the duration model [5]. Let be a segmentation of the image into its constituent characters C. Using Bayesian inference, we can write C ) as: C ) ) X ) C, X ) X ) C ) X ) C ) W X ) C ) C X ) W X ) C ) C X ) W C) (2) n Equation 2, C ) is the score from a VM, NN or other classifier. C X) is the posterior from a HMM. This score includes the character glyph score and also the grammar/language model score. W C) is a character width model. This can be estimated separately from training data as a distribution of normalized width. n deriving equation 2, we assert that C,X) C ); i.e., once is available to the classifier, X does not provide any more information. We also assert that X) W,C X); i.e., has two parts, the character hypothesis C and width W. Finally, we assert that W X) W C); i.e., once C is available, X provides no additional information. The introduction of C ) into equation 2 ensures a tight, interactive coupling between the long- and shortspan features. n order for Equation 2 to be computationally tractable, we restrict the set of possible segmentations,, to small perturbations around the maximum likelihood segmentation produced by the baseline HMM system. n our current paper, we use a simple VM to model C ) and show that even this simple model improves overall performance. 3. Baseline HMM-based OHR system Our HMM based recognizer is the BBN Byblos OCR system [4]. The feature extraction in BBN Byblos system involves horizontally segmenting the line image into frames and then computing a feature vector for each frame. The features used in the current baseline configuration include: Percentile of intensities, Angle, Correlation, and Energy, which we refer to as the PACE features [4]. The training module estimates multi-state, left-to-right HMMs for each character using EM algorithm for maximum likelihood training. The recognition module uses an efficient 2-pass n-best decoder [8]. The decoder performs a two-pass search using both the trained glyph models and the language models (typically a character or word n-gram model trained on a text corpus). For the forward pass of the decoder, a bigram language model is used to generate an initial lattice of hypotheses. The backward pass of the decoder uses an approximate trigram language model to generate a 1-best hypothesis as well as an n-best list or a word lattice. n this paper, we use glyph models that are trained to recognize characters. Ligatures are considered as independent characters and are modeled as such. 4. Design for tochastic egment Modeling The stochastic segment modeling framework involves the following key steps for performing recognition: 1. tochastic egment Generation: First, we generate a set of recognition hypotheses using the HMM system trained on short-span features. Then, for each hypothesis, we extract stochastic segments (2-D character images) using the character segmentation provided by the HMM. 2. egmental Classifier/corer: We extract structural features that represent shape characteristics of the character. Then, we compute a score for each character in the hypothesis using a classifier trained on the stochastic segments from the training data. n this paper, we use support vector machines (VM) as the segmental classifier. 3. core Combination from HMM and egmental Model: We use the score from the HMM and VM for each hypothesis to generate the best hypothesis. 972

3 Figure 1: Example showing the stochastic segmentation of characters in a line image. n the following we describe each of the above steps. n addition, we present the N-best rescoring procedure that we used to evaluate the efficacy of M tochastic egment Generation HMM-based OHR system produces the segmentation into characters as a byproduct of the training and recognition. Therefore, extraction of stochastic segments simply involves using the pixel position from the HMM segments to extract 2-D character images. n Figure 1, we show the segmentation of a line into constituent characters as determined by the HMM system. Figure 2. llustration of the rescoring procedure in which the VM scores are combined with the Glyph egmental Modeling using VMs We use VMs for modeling the stochastic segments using the Gradient-tructure-Concavity (GC) features that have been shown to be effective for offline handwriting recognition [6],[7]. GC features are symbolic, multi-resolution features that combine three different attributes of the shape of a character the gradient representing the local orientation of strokes; structural features that extend the gradient to longer distances and provide information about stroke trajectories; and concavity that captures stroke relationships at long distances. We used the character labels and the GC features extracted from stochastic segments to train a multiclass VM with the radial basis function (RBF) Kernel. The multi-class VM is trained such that, given a GC feature set extracted from a stochastic segment, it outputs a vector of probabilities, each of which is the probability of the given feature set belonging to a particular character. normalized GC feature belonging to the class of its associated character. For generating the composite score for each hypothesis from the segmental model, we compute the geometric mean of the VM scores from each character and use the logarithm of this score as the final VM score. Finally, we use the score from the HMM, the score from the VM, and other scores such as language model probabilities, etc. to compute an overall composite score that is used to re-rank the nbest hypothesis list. 5. Experimental Results n this section, we describe two experiments that were performed using a M built using an VM classifier and GC features. The first experiment was designed to evaluate: (a) the quality of the character segmentation from the HMM, and (b) the classification accuracy of VMs using GC features. The second experiment uses the VM/GC as a segmental scorer for rescoring n-best list together with scores generated from the HMM N-best Rescoring with M Framework A schematic diagram of the N-best rescoring procedure employed for evaluating the efficacy of our proposed stochastic segment modeling framework is shown in Figure 2. Given a test line image, we first extract PACE features from it. These features are then used by the baseline HMM system to generate a set of n-best hypotheses and character segmentation for each hypothesis in the n-best list. Using the character segmentation we extract 2-D character images. The 2D images of characters are then used to extract GC features. The segmental model, in this case the multiclass VM, is then used to generate a probability of the 5.1. Corpus Description We used two sets of corpora in our experiments one corpus is from the Applied Media Analytics (AMA), which we refer to as the AMA corpus and the second one is from the Linguistic Data Consortium (LDC), which we refer to as the LDC corpus. The AMA corpus that we use in our experiments consists of Arabic handwritten documents provided by a diverse body of writers. The collection is based on a set of 200 documents with a variety of formats and layout styles. 973

4 et #mages #Writers Train Dev Test 48 4 Table 1: LDC data used for rescoring experiments The final collection contains a TFF scanned image of each page, an XML file for each document which contains writer and page metadata, the bounding box for each word in the document in pixel coordinates, and a set of offsets representing PAWs (Parts of Arabic Word). We used a subset of the images, scanned at 300dpi for our experiments. This data set is used in our PAW classification experiments as described in ection 4.2. Our LDC corpus consisted of scanned image data of handwritten Arabic text from newswire articles, weblog posts and newsgroup posts, and the corresponding ground truth annotations including tokenized Arabic transcriptions and their English translations. t consists of 1250 images scanned at 300 dpi written by 14 different authors for training, development and testing purposes. n order to ensure a fair test set with no writer or document content in training, 229 images were held-out of the training and development sets. 125 images from this set were randomly chosen as the development set. A total of 48 images by 4 different authors constitute the test set. The details of the split are shown in the Table 1. This dataset is used in our rescoring experiments described in ection Classification using VMs n this section, we describe two experiments. The first experiment uses the VM classifier trained with GC features extracted from manually annotated PAWs in the AMA data set. The second experiment is designed to evaluate character classification accuracy using stochastic segments and also uses images from the AMA data set. n the first experiment, we used manually annotated Part-of-Arabic-Word (PAW) images and the corresponding PAW labels to train a VM classifier. The PAW images and labels were randomly chosen from the from the AMA corpus. We used the entire PAW image to extract features. A total of 6498 training samples from 34 PAW classes were used to train the classifier. The VM training setup is as described in ection 4.2 with the exception that we use PAW images instead of stochastic segments to extract features. The test set consists of 848 PAW images from the same set of 34 PAW classes. From the vector of probability scores produced by the VM for each class label, we chose the class with the highest probability as the classification label for the PAW image. The classification accuracy for this experiment was 82.8%. n the second experiment, we extract stochastic segments from word images from the AMA dataset and use the extracted character segments for training M, as described in ection 4.2. A total of 13,261 character Types of Units # classes Accuracy PAWs % tochastic 74.7% segments 40 Table 2. egment classification accuracy VM classifier. training samples from 40 character classes were used for training. This VM was then used to classify 3315 test samples and resulted in an overall accuracy of 74.7%. As shown in Table 2, the results on the PAW classification task were better than the results for the stochastic segment classification task. There could be several reasons for this degradation to have occurred. The PAWs are bigger than character segments on an average. As a result, PAWs contain more structural information than character segments. Another factor which could have resulted in the poorer performance for the character segment classification is that, since the character boundaries are stochastically determined, they may not have been very accurate or consistent, which could have resulted in features that do not have sharp distributions. We visually analyzed the extracted character images from stochastic segmentation and found them to be pretty accurate and consistent across many instances. n Figure 3, we show multiple instances of the same character appearing in different words written by different authors and see that the character segmentation is pretty consistent across these multiple instances. Hence we do not believe that the approximate stochastic segmentation is a major Figure 3: Each row shows different instances of the same character that has been stochastically segmented from the 974

5 contributor to the drop in accuracy for stochastic segments. Another factor which could have caused the drop in performance is the difference in the number of classes used in training between the two experiments. These results support our view that larger segments provide more structural information, which, when captured by a powerful feature extraction algorithm, can provide more discriminative features. These results also indicate that the GC features are capable of extracting long-span shape information N-best Rescoring with VM for mproving End-to-End Recognition Accuracy This experiment uses the VM using GC features extracted using stochastic segments to rescore an n-best list of hypotheses as described in ection 4.3. A detailed description for this experiment is provided in ection 3. The LDC corpus used for this experiment is described in ection 5.1. The amount of data used for training, development and validation is shown in Table 1. All the training data from the LDC corpus was used for training the baseline HMM system. The VM was trained using 900 randomly chosen stochastic character images for each character class. The results for this experiment, along with the results for the baseline experiment are shown in Table 3. The only difference between the two experiments is the addition of the VM scores for rescoring. The two experiments are otherwise identical. From Table 3, we see that the addition of the VM scores for rescoring improves overall system performance by 2.3% absolute. cores used for Rescoring WER (%) Glyph + LM 55.1 Glyph + LM + M 52.8 Table 3: Results from using the M for N-best rescoring. 5. Conclusions and Future work n this paper, we have described a powerful framework to integrate long-range shape information with shortspan features. Our framework is built on top of an existing HMM system and is segmentation-free, i.e. all needed segmentations are stochastically inferred. We also presented initial experimental results in which we used the stochastic segment models for N-best rescoring. The use of stochastic segment modeling resulted for rescoring gave a 2.3% gain over the baseline HMM system. Future work will involve use of lattices for stochastic segment model based rescoring as well as considering multiple segmentations instead of the 1-best segmentation into characters. 6. Acknowledgments We would like to thank Dr. Venu Govindaraju, Dr. rirangaraj etlur, and Dr. Zhixin hi at the Center for Unified Biometrics and ensors at UNY Buffalo for providing the GC feature extraction library for this research. We would also like to thank Dr. Huping Li t Applied Media Analytics, nc. and Dr. David Doermann from University of Maryland for providing us the AMA corpus. 7. References [1] C. Mokbel, H. Abi Akl, H. Greige. Automatic speech recognition of arabic digits over Telefone network.,proc. nt'l Conf. Research Trends in cience and Technology, [2] V. Märgner, M. Pechwitz, and H. ElAbed. CDAR 2005 Arabic Handwriting Recognition Competition. Proc. nt'l Conf. Document Analysis and Recognition, pp , [3] V. Märgner, M. Pechwitz, and H. ElAbed. Arabic Handwriting Recognition Competition. Proc. nt'l Conf. Document Analysis and Recognition, pp , [4] P. Natarajan,. aleem, R. Prasad, E. MacRostie, K. ubramanian, Multi-lingual Offline Handwriting Recognition Using Hidden Markov Models: A cript- ndependent Approach, pringer Book Chapter on Arabic and Chinese Handwriting Recognition, N: , Vol. 4768, pp , March [5] M. Ostendorf, V. V. Digalakis and O. A. Kimball. From HMM s to egment Models: A Unified View of tochastic Modeling for peech Recognition. EEE Transactions on peech and Audio Processing, 4(5): , 1996 [6]. aleem, K. ubramanian, M. Kamali, R. Prasad, P. Natarajan, mprovements in BBN s HMM-based Offline Arabic Handwritten Recognition ystem, submitted to CDAR [7] J. T. Favata and G. rikantan. A multiple feature/resolution approach to handprinted digit and character recognition. nternationaljournal of maging ystems Technology, 7: , [8]. Austin, R. chwartz and P. Placeway, The forwardbackward search algorithm, EEE nt. Conf. Acoustics, peech, ignal Processing, Toronto, Canada, Vol. V, 1991, pp

(12) U nited States Patent (to) Patent No.: US 8,644,611 B2 Natarajan et al. (45) Date of Patent: Feb. 4,2014

US008644611B2 (12) U nited States Patent (to) Patent No.: Natarajan et al. (4) Date of Patent: Feb. 4,14 (4) SEGMENTAL RESCORING IN TEXT RECOGNITION (7) Inventors: Premkumar Natarajan, Sudbury, MA (US);