NASTAALIGH HANDWRITTEN WORD RECOGNITION USING A CONTINUOUS-DENSITY VARIABLE-DURATION HMM

Size: px

Start display at page:

Download "NASTAALIGH HANDWRITTEN WORD RECOGNITION USING A CONTINUOUS-DENSITY VARIABLE-DURATION HMM"

Phebe Ferguson
5 years ago
Views:

1 NASTAALIGH HANDWRITTEN WORD RECOGNITION USING A CONTINUOUS-DENSITY VARIABLE-DURATION HMM Reza Safabakhsh and Peyman Adibi Computational Vision/Intelligence Laboratory Computer Engineering Department, Amirkabir University of Technology Hafez Avenue, Tehran, Iran الخلاصة سوف نقدم ف ي ه ذا البح ث نظام ا آ ام لا للتع رف عل ى آلم ات (نس تعليق الخطي ة الفارس ية) باس تخدام موديل مارآوف الخف ي وتك اثف المش اهدات المس تمرة وط ول الح الات المتغي رة.(CDVDHMM) وف ي مرحل ة التق ديم المقدم ة بع د عملي ات ب اينري وإلغ اء الن ويز والحص ول عل ى الا ج زاء المتص لة ي تم اس تخدام خوارزمية جديدة لكشف الصاعد والهابط والنقاط وساي ر الا جزاء الثانوية وشطبها من التصوير الري يسي. ثم يتم تنفيذ خوارزمية تقطيع جديدة علي أساس تحليل آانتور العلوي وعمليتين مساعديتين. والغرض م ن ه ذه الخوارزمية هو أن لاتكون هناك قدر الا مكان مشكلة عدم التقطيع. وقد تم تخصيص طول الحالات المتغي رة لا زال ة التقطي ع الزاي د. وبع د الحص ول عل ى الترتي ب م ن اليم ين إل ى اليس ار ي تم إج راء مودي ل CDVDHMM بمو خر الحروف التحتية الناتجة. والخصاي ص الثمانية التي تشتمل على تواص يف فوري ة الثلاث ة وع دد الخص اي ص الا ساس ية الت ي تس تخدم لع رض ه ذه الرم وز ف ي أج واء الخص اي ص. والبع د الخصاي صي بالنسبة لتغيير القياس غير المتغي ر. إن الح الات ف ي ه ذا النم وذج تش تمل عل ى أح رف خالص ة (ب دون أج زاء ثانوي ة) وع دة أش كال ترآيبي ة ف ي أس لوب الكتاب ة بالنس تعليق. وعلي ه ف ا ن تعل يم المودي ل ي تم بس هولة ودون الحاج ة لا س لوب التق دير الث انوي. وف ي مرحل ة التعل يم ي تم الحص ول عل ى ع دد م ن مكون ات الموديل من مجموعة التصاوير التعليمية والباقي من القاموس وبالتالي يتم الحصول على نسخة خوارزمية ويتربي المعد لة للتعريف. وتعطينا هذه الخوارزمية أفضل صورة لا آث ر م ن مس ار ع ام ال ذي يا خ ذ المواق ع و لا ا شكال المختلفة للا حرف آحالات تحتية ويساند ط ول الح الات المتغي رة. وق د تبي ن أن الاختب ارات الت ي اجريت على نماذج خطية وقاموس ذات 50 آلمة قدمت نتاي ج جيدة للاسلوب المستخدم. الكلم ات الري يس ية: تعري ف الكلم ات خط ي تقطي ع فارس ي عرب ي نس تعليق مس تمر مودي ل مارآوف الخفي. To whom correspondence should be addressed. Fax : , (safa@ce.aut.ac.ir) (adibi@ce.aut.ac.ir) April 2005 The Arabian Journal for Science and Engineering, Volume 30, Number 1B. 95

2 ABSTRACT This paper introduces a complete system for recognition of Farsi Nastaaligh handwritten words using a continuous-density variable-duration hidden Markov model, CDVDHMM [1]. In preprocessing stage, after binarization, noise reduction, and connected component specification, new algorithms are applied to find and eliminate ascenders, descenders, dots, and other secondary strokes from the original image. Then a new segmentation algorithm based on analyzing upper contour and two other processes is applied. The main goal of this algorithm is to avoid the undersegmentation problem. Considering variable duration states in the system allows covering the over-segmentation problem. By finding the right-to-left order, the sequence of obtained sub-characters is modeled by the CDVDHMM. Eight features, including three Fourier descriptors and five structural and discrete features, are applied to represent symbols in the feature space. This feature vector is invariant to size and shift. The states in the model are considered as pure characters (without secondary strokes) plus some compound forms of characters in Nastaaligh handwriting style. Thus, training the model becomes simple and does not need any re-estimation method. In the training stage, some parameters of the model are obtained from the training image set and the others from the dictionary. At the last stage, a modified version of Viterbi algorithm is applied for recognition. This algorithm provides more than one globally best path and considers different positions and forms of letters as sub-states and also supports variable duration states. Experiments on handwritten samples and a 50-word dictionary show very good performance of the system. Key Words: OCR, handwritten, word recognition, segmentation, Farsi, Arabic, Nastaaligh, cursive, HMM, CDVDHMM. 96 The Arabian Journal for Science and Engineering, Volume 30, Number 1 B. April 2005

3 NASTAALIGH HANDWRITTEN WORD RECOGNITION USING A CONTINUOUS-DENSITY VARIABLE-DURATION HMM 1. INTRODUCTION Off-line recognition of handwritten text has many applications in bank check processing, postal address and zip code recognition, and automated handwritten document entry and understanding. As a result, research interest is increasing in this field and some progress has been made. However, the performance of even the best handwritten text recognition systems is as yet far from human reading ability. Many papers have been concerned with the recognition of Latin, Japanese, and Chinese characters in recent years. But although almost one third of the people in the world use Arabic and Farsi characters for writing, little and sparse efforts for the automated recognition of these characters have been made. This is probably the result of a lack of adequate support in terms of funding, and other utilities, such as comprehensive and standard Arabic or Farsi text databases, dictionaries, etc; and certainly, of the cursive nature of writing in these languages [2]. More details on the state of the art in Arabic character recognition is presented in [2]. An important aspect in classification of character recognition systems is the existence of and the used method of segmentation in them. The concept and various methods of segmentation are reviewed in [3]. Three basic strategies for segmentation are proposed there, such that each segmentation method can be considered as a weighted combination of these three strategies. These strategies are as follows: (1) classic strategy, that attempts to dissect images to classifiable units; (2) recognition-based segmentation strategy, that looks for components of image which match to classes of system s alphabet and decides about segmentation using a feedback from the recognition stage; and (3) holistic strategy, that tries to recognize a word as a whole. The holistic methods have the advantage that the difficult dissection stage is not required in them, but their drawback is that the number of words for which the system is designed is limited and cannot be too many. On the other hand, the classic and recognition-based methods are more powerful and not limited in the number of words which they can recognize. In this paper, we have developed a system for off-line recognition of Nastaaligh handwritten words which uses a recognition-based segmentation method and applies a continuous-density variable-duration hidden Markov model for the recognition task. In Section 2, characteristics of Farsi and Arabic writings are briefly described. In Section 3, the hidden Markov model (HMM) and several word recognition systems based on HMM are discussed. Section 4 describes the operational stages of the overall system. In Section 5, experimental results for each stage of the method are presented. Section 6 concludes the paper. 2. CHARACTERISTICS OF FARSI/ARABIC CURSIVE SCRIPT Farsi/Arabic scripts are different from Latin scripts in several ways: (1) the shape of a Farsi/Arabic letter is a function of position of that letter in the word. For each letter, there may be up to four different shapes based on the letter position in the word, which are called first, middle, last, and isolated forms of the letter (Table 1). In addition, for some Farsi writing styles, there is more than one shape for a letter in a fixed position. (2) Farsi and Arabic writings are naturally cursive. Nevertheless, some characters never connect to the next letter in the word. Because of this, a word can have more than one cursive part. These cursive parts are here referred to as sub-words. (3) Farsi and Arabic scripts have various styles. Also, each writing style can contain new and compound forms of letters. Thus, if we consider, for example, an unconstrained handwritten Farsi text, the number of separate classes that must be considered will be too many. This makes the recognition process very difficult. (4) Farsi and Arabic characters can have zero, one, two, or three dots over or under them; and sometimes, the only difference between two characters is the existence or the number of these dots. 5) Farsi and Arabic text, in contrast to Latin texts, is written from right to left. A list of Farsi characters and different forms of them is presented in Table 1. The arabic alphabet is identical to the Farsi alphabet, except that Farsi has four more characters (these four characters are underlined in Table 1). Characters that are similar except for their dots or other secondary strokes can be considered as one family. For example, the family of character Be contains characters of rows 2 to 5 of Table 1. April 2005 The Arabian Journal for Science and Engineering, Volume 30, Number 1B. 97

4 Table 1. Farsi character set and shapes of each character in different positions. Character Isolated First Middle Last Alef Be Pe Te Se Jim Che He Khe Dal Zal Re Ze Zhe Sin Shin Sad Zad Ta Za Ayn Ghayn Fe Ghaf Kaf Gaf Lam Mim Noon Waw He Ye ا ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ك گ ل م ن و ه ي ا ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ك گ ل م ن و ه ي ا( ا ( ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق آ گ ل م ن و ه ي ا( ا ( ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش ص ض ط ظ ع غ ف ق ك گ ل م ن و ه ي Different scripts that use the Farsi alphabet can be divided into eight groups, and each group can include different styles [4]. Some of these styles were more common in the past and some are so in the present. Most of today s handwritten Farsi texts are written in Nastaaligh and Naskh styles; and the Nastaaligh style, due to its special beauty, is the most popular and favorite writing style among most writers. As a result, the Nastaaligh writing style is considered in this paper. The Nastaaligh style, however, despite its wide popularity, is a difficult style and has numerous rules and exceptions, which make its automatic recognition very hard. Some examples that show such rules and the difficulties of the work are presented below. 98 The Arabian Journal for Science and Engineering, Volume 30, Number 1 B. April 2005

5 The family of character Be (ب) (rows 2 to 5 of Table 1), depending on the letter following them, are written in different forms: The family of character He (ح) (rows 6 to 9 of Table 1), depending on the letter following them, appear in different forms: The family of character Kaf (آ) (rows 25 to 26 of Table 1), depending on their position in the word, appear in different forms: The character Mim,(م) is written in two different forms, rectangular and circular: The character He (ه) (row 31 of Table 1), depending on its position in the word, can appear in different forms: Cogs in the words, sometimes appear as curves: The distance between a character and the middle baseline or the vertical position of the character, depending on other characters of the word, can be different: Some characters and compound forms rest on the baseline while others do not. In fact, some parts of words are written in an angle about 30 degrees to the baseline: Even if we ignore the problems arising from multiple shapes of some characters, the above last two rules alone are sufficient to indicate the difficulty of segmenting Nastaaligh Handwritten words. April 2005 The Arabian Journal for Science and Engineering, Volume 30, Number 1B. 99

In order to create appropriate training and testing sets, we must select a minimal set of words which include various features

We assume that the training and testing sets are constrained only to the Nastaaligh style to reduce the number of patterns

Also, the words included in these sets are carefully selected in such a manner that they contain all characters, character

Figure 1 shows some samples of the words used and Figure 2 illustrates four compound forms of characters in Nastaaligh style

تخصيص باج گيرها هدهد آاآل تسريع مهمات Figure 1. Some samples of handwritten words in the test set.

Compound form made by composition of three آمك: اآمل: آمال: : ك -ك and, ك -ا, ك -ل compositions 4. The : ه -ا composition 3.

The آاآل: آك: بعدها: ظهر: In the above cases, we can also consider گ and ظ in place of ك and ط, respectively. Figure 2.

Introduction to HMM Hidden Markov models are based on doubly stochastic processes whose underlying random process is not

The transition of the system from the current state to the next state is done based on this underlying process.

A hidden Markov model with discrete observation symbols, is represented by λ = ( A, B, Π), where A is the state transition

6 In order to create appropriate training and testing sets, we must select a minimal set of words which include various features of the handwriting style. We assume that the training and testing sets are constrained only to the Nastaaligh style to reduce the number of patterns that must be recognized. Also, the words included in these sets are carefully selected in such a manner that they contain all characters, character shapes, and compound forms of characters in Nastaaligh style. Figure 1 shows some samples of the words used and Figure 2 illustrates four compound forms of characters in Nastaaligh style which are included in our system. More details about the selected training and testing sets are presented in Section 5.1. تخصيص باج گيرها هدهد آاآل تسريع مهمات Figure 1. Some samples of handwritten words in the test set. : ك -م-ا or ك -م-ل or ك -م-ك letters 1. Compound form made by composition of three آمك: اآمل: آمال: : ك -ك and, ك -ا, ك -ل compositions 4. The : ه -ا composition 3. The : ط -ه composition 2. The آاآل: آك: بعدها: ظهر: In the above cases, we can also consider گ and ظ in place of ك and ط, respectively. Figure 2. Four compound forms in the Nastaaligh Farsi writing style. 3. TEXT RECOGNITION BASED ON HIDDEN MARKOV MODELS 3.1. Introduction to HMM Hidden Markov models are based on doubly stochastic processes whose underlying random process is not directly observable (i.e. it is hidden). The transition of the system from the current state to the next state is done based on this underlying process. Observable outputs or observations are produced by another stochastic process, which is determined by symbol probabilities. A hidden Markov model with discrete observation symbols, is represented by λ = ( A, B, Π), where A is the state transition probabilities matrix, B is the discrete probability distributions of observation symbols, and Π is the probability of initial states [5]. In some applications, the distribution of observations is considered continuous and duration probability of states is considered in explicit form. For example in [1] a continuous density variable duration HMM (CDVDHMM) is applied. This model is represented by λ = ( Π, A, Γ, B, D) whose parameters are: Initial probability: Π = { π i}; π i = Pr{ i 1 = qi}, i 1 = First State (1) Transition probability: A = { aij}; aij = Pr{ q j at t + 1 qi at t } (2) Last-state probability: Γ = γ }; γ = Pr{ i = q }, i = Last State (3) { i i T i T 100 The Arabian Journal for Science and Engineering, Volume 30, Number 1 B. April 2005

7 t+ d t+ d B { j t t t t+ 1 t+ d Symbol probability: = b ( O )}; O = ( o o... o ), b O) = Pr{ O q } (4) j ( j Duration probability: D = { P( d q )}; P( d q ) = Pr{ duration( q ) d} (5) i i i = In the study of HMM s, there are three basic problems: (1) Given an observation sequence O={ o 1,, o T } and a model λ, how do we find P ( O λ) effectively? This is the scoring problem. (2) Given an observation sequence O and a model λ, how can we find the state sequence q = { q 1,..., q T } for O such that it is optimal in some sense (i.e. better explains observation sequence)? This is the recognition problem. (3) How can we find the parameters of the model which maximize P ( O λ)? This is the training problem. The solution to problem (1) is the forward or backward process. For problem (2), the most common optimization criteria is finding an optimal state sequence (an optimal path). The Viterbi algorithm yields such a sequence. To solve problem (3), one can apply the Baum Welch re-estimation algorithm. The reader is referred to [5] for more details about these solutions Text Recognition with HMM In word recognition problems, there are two main approaches to model the observation sequence (pseudocharacters) by HMM [6]. The first approach is called model discriminant HMM. In this strategy, for each class of the problem (each word in lexicon) a model is constructed. Then for recognizing an input word, the score for matching the word to each model is computed, and the class related to the model that has the maximum score gives the result of the recognition. This approach is reasonable for small dictionary sizes, say up to several hundred words. But when the size of dictionary grows to about 1000 words or larger, this approach will have excessive complexity in terms of computation and memory. The second approach, called path discriminant HMM, is to build only one model for all classes and use different paths (state sequences) to distinguish one pattern from the others. A test pattern is classified into the class which has the maximum path probability over all possible paths. This approach is a better alternative for a large or variable dictionary. Some researchers have applied path discriminant methods for Latin handwritten word recognition [1,6 8]. In [7], a second order HMM is also tested which for long words has shown a better performance in comparison to the first order model. In [6], the inputs to the system are assumed to be unconstrained handwritten words. In this system, the number of states may become too large and so the speed and precision of the system can decrease. In [1], this problem is removed by considering variable durations for states and using an over-segmentation method, which does not leave any two letters unsegmented. In addition, by considering continuous density for observation probabilities, the performance of the system is improved. The problem with this system is unreliable training of state duration probabilities in limited training databases. To remove this problem, a system is proposed in [8] whose operation is independent of state duration probabilities. In [8 12], model discriminant approaches are used. In these systems, a left-to-right (for Latin characters) or right-toleft (for Arabic characters) HMM is considered for each character, and the word model is obtained by concatenation of these HMM s. In [13] and [14], 2-D hidden Markov models are applied for recognition of printed Arabic words, and in [15], the model is used to improve the performance of recognition of Farsi printed sub-words. 4. THE RECOGNITION SYSTEM In this paper, a system for off-line recognition of Farsi Nastaaligh handwritten words is presented. The stages of the system are illustrated in the block diagram of Figure 3. First, the necessary preprocessing algorithms are applied to the image of the word. Then, the word is dissected to its letters or pseudo letters, and a set of features is extracted from the image of each segment or combination of adjacent segments, and recognition is done based on the classification of these feature vectors. In the recognition stage, the model, which is trained before, recognizes the word using these feature vectors. Since single segments and combinations of adjacent segments are examined for finding optimal letters, we can classify the segmentation method used in our system as a recognition-based method. Because of successful applications of hidden Markov models (HMMs) in word recognition systems [1, 6, 16, 17], an HMM-based method is selected for the recognition stage of the system. The applied model is a continuous-density variable-duration HMM [1]. The recognition combining adjacent segments feature extraction cycle, which is used by the Viterbi algorithm for HMM, is the key factor in optimal determination of words in this system. April 2005 The Arabian Journal for Science and Engineering, Volume 30, Number 1B. 101

8 Images of combined segments Combining adjacent segments Input word image Preprocessing Preprocessed image Segmentation Images of segments Feature extraction Feature vectors Recognition Recognized word Figure 3. Block diagram of the system 4.1. Preprocessing In preprocessing stage, the input image is first binarized by means of the iterative threshold selection method [18]. Then the two morphological operations closing with 3 3 and opening with 2 2 structural elements are applied to the image respectively to eliminate spiked noise [1]. Then, connected components are found by an algorithm which starts from the top row of the image and builds bounding rectangles. By adding consequent rows, the height and width of these rectangles are modified such that when we arrive at the bottom of the image, final bounding rectangles, i.e. connected components, are obtained [19]. The pen width is estimated by an algorithm in which the mean value of the vertical run length is computed in each column, periodically. Then, those run lengths larger than 1.5 times this mean value will not enter in computation of the mean value in consequent iterations [6]. For handwritten Farsi/Arabic words, and specially Nastaaligh writing style, baseline detection is a difficult and unreliable process. Sometimes, for Nastaaligh style, more than one or one slanted baseline must be considered. Thus, information provided by the horizontal histograms of words may not be sufficient for baseline detection. Some other methods for baseline detection are also proposed (e.g. in [20]); but due to their high complexity, they must be used only when necessary. As a result, we design our system such that it works independent of the baseline. In our system, since the model receives a sequence of segments, the right-to-left order of segments must be specified after segmentation. In the absence of the baseline, the ascenders of characters Kaf and Gaf (Figure 4(b)) and descenders of characters Jim, Che, He, Khe, Ayn, and Ghayn (Figure 4(a)), cause incorrect determination of the right-to-left order. Thus, it is desired to eliminate problematic ascenders and descenders in preprocessing stage. This elimination also eliminates the incorrect segmentation of them, and therefore decreases the segmentation errors. Furthermore, since the recognition process is done based on the pure body of characters, elimination of dots and other secondary strokes from the image is required in the preprocessing stage. These secondary strokes are processed in a post-processing stage and the recognition task becomes complete. To eliminate ascenders, descenders, and dots, several new algorithms are proposed that will be explained in the following sections. (a) بكمكش: تسريع: (b) Figure 4. Problems arise from descenders and ascenders: (a) Ayn will be considered before Ye (b) Kaf will be considered before Be Ascender Elimination Characters Kaf and Gaf have ascenders that should be eliminated. Some characteristics that discriminate ascenders from other strokes in a word are their almost 45-degree slope, straightness, and relative large length. These three features form the basis for the ascender detection and elimination algorithm. Character Kaf has only one long ascender while Gaf has one long and one short ascender. The algorithm for detection and elimination of ascenders is shown in Figure The Arabian Journal for Science and Engineering, Volume 30, Number 1 B. April 2005

9 1. For each valid connected component (CC) do: 1.1. Find the top-most point of the lower contour and call it SP (Starting Point). Let k=k1= While stop condition is not true, starting from SP, do: Traverse the lower contour downward by going to the next point of lower contour. k=k If (current move is in left-down direction): k1=k If (a0xpw > k > a1xpw AND k1/k > b1): a short ascender is detected. Mark it If ( k > a0xpw AND k1/k > b0 ): a long ascender is detected Mark it, and eliminate its lower contour points from the lower contour of this CC goto step 1.1 to search for other probably existing ascenders in this CC. 2. For each detected ascender in step 1 do: 2.1. Check validity conditions If (this ascender is valid): Eliminate it from the image by filling it with color ASCENDER_COLOR. Figure 5. Ascender detection and elimination algorithm. In this algorithm, PW is the estimated pen width. The parameter values showing the best results in experiments are 4.95, 1.4, 0.58, and 0.33 for a0, a1, b0, and b1, respectively. The lower contour for each connected component is found from the chain code of the outer contour obtained during finding connected components. East, North-East, and South- East directions in the chain code of the outer contour represent the lower contour. In step 1.2 of the algorithm, the stop condition becomes true if one of these situations occur while the lower contour is traversed: (i) Movements in the left or left-up directions or a combination of these two directions continue in more than 3, 1, and 3 pixels, respectively. (ii) A jump with a displacement of more than 1 pixel upward or to the right, or more than 4 pixels downward, or more than 3 pixels to the left direction. (iii) Reaching the last point of the lower contour. In step 2 a long ascender is considered to be invalid if it has a long overlap with other strokes near the head of ascender or there is a relatively large change in the stroke width near its head, or there exists a downward vertical part at the head of the ascender. A short ascender is valid if it approximately covers all the space of its connected component and is higher upper than and very close to a long ascender. Figure 6 illustrates the application of the algorithm to a word having two long and one short ascenders. Figure 6(b) shows the lower contour of the word in Figure 6 (a). At first, the lower contour is traversed from SP1 to EP1, where EP1 is the last pixel in the lower contour of this connected component. Variables k and k1 satisfy the conditions of a long ascender. Thus SP2 is selected as the new starting point. At EP2 a long jump to the right terminates the loop and another long ascender is detected. Again starting from SP3, the last point of current connected component, i.e. EP3, is reached and the values of k and k1 denote a short ascender. Validity conditions are true for all these ascenders and they are eliminated from the image as shown in Figure 6(c). (a) (b) (c) Figure 6. (a) A word with characters Kaf and Gaf. (b) Three ascenders are detected and illustrated with their starting points (SP) and end points (EP). (c) All detected ascenders are valid and are eliminated from the image. In Figure 7, two samples of operation of this algorithm are shown. Figure 7(a) shows a successful elimination of the ascenders of the two Kaf letters, while Figure 7(b) illustrates a mistake of the algorithm. In this figure, character Te, because of its 45 degrees slope and relatively large length, is also eliminated incorrectly as an ascender. April 2005 The Arabian Journal for Science and Engineering, Volume 30, Number 1B. 103

10 (a) Figure 7. Operation of ascender elimination algorithm: (a) correct operation; (b) incorrect operation Descender Elimination The algorithm which detects and eliminates the descenders of characters Jim, Che, He, Khe, Ayn, and Ghayn is shown in Figure 8. These descenders cause the incorrect determination of right to left order. This algorithm works as follows. In the image, it starts from the right-most column which contains black pixels and selects the most-bottom black run in it, and follows this black run column-by-column toward left until this run joins another black run (step 1.1.3). In step this detected descender is considered to be valid if the following conditions are true: (i) The overlap length of this stroke with upper strokes (UpRunLen) is relatively large (more than 2.5 PW, where PW is the estimated pen width). (ii) The length of this stroke is relatively large (more than 2.5xPW). (iii) At least in one column, there are more than two black runs. 1. For each valid connected component (CC) do: 1.1. For each column of current CC starting from right column do: Find black runs in the current column and let the lowest black run to be current run (CurRun). UpRunLen= If (number of black runs is greater than 1 AND current column is not the most right column): UpRunLen= UpRunLen If (number of black runs which are adjacent to CurRun is more than 1): the current CC does not contain any descender. go to the next CC. Else: let the black run which is adjacent to CurRun as CurRun If (number of black runs in previous column which are adjacent to CurRun is more than 1): a descender is detected: Check validity conditions If (this descender is valid): Eliminate it from the image by filling it with color DESCENDER_COLOR. Figure 8. Descender detection and elimination algorithm. These characteristics discriminate descenders from other strokes properly. Fig. 9(a) shows a typical result obtained by this algorithm. Fig. 9(b) and 9(c), respectively, show the situations that make the conditions in steps and true. (b) (a) (b) (c) Figure 9. (a) Operation of descender elimination algorithm. (b) Situation which satisfies condition in step of the algorithm. (c) Situation which satisfies condition in step of the algorithm. 104 The Arabian Journal for Science and Engineering, Volume 30, Number 1 B. April 2005

11 Secondary Strokes Elimination To detect the secondary strokes, some of their characteristics such as small size and containment in a larger subword can be considered. We consider a connected component as a secondary stroke and eliminate it from the image if its width is less than 2.5 times and its height is less than 5 times the estimated pen width (or vice versa) and it overlaps, at least in 25 percent of its width, with a larger component. The above thresholds are determined such that no other strokes are incorrectly eliminated as secondary strokes. Thus the algorithm retains some secondary strokes that are written moderately large or are distant from character body. To enhance the performance of this algorithm, a primary classifier can be used in the preprocessing stage which can discriminate secondary strokes from letters that have nearly the same size as them (such as single forms of letters Alef, Dal, Zal, Re, Waw, and He ) [21]. Since the number of classes is smaller in this case, the features extracted from the image can be simpler and can be optimized for recognition Segmentation and Determination of the Right-to-Left Order The objective of the segmentation stage here is to achieve an over-segmentation such that each pair of connected characters are split. Then characters can be considered as states in the recognition stage [1]. When a character is segmented to more than one segment, variable duration of HMM states considered in this system, covers this problem. After segmentation, right-to-left order of segments must be found to use in recognition stage. We studied the existing word segmentation techniques and their ability to satisfy the mentioned criteria. The methods that are based on vertical histogram or baseline [22 25] are not suitable for handwritten words, specially for Nastaaligh style, because of various vertical overlaps and horizontal slants that exist in this style. Furthermore, methods that use vertical width of strokes [26] do not seem to be very appropriate for moderately free handwritten scripts. We will propose two enhanced methods which are more suitable for Nastaaligh word segmentation. The first method works based on the idea of regular and singular components, and considers the regular components as candidates for segmentation. The second method works based on analysis of upper contour of the words. In next sections, these segmentation methods and the technique used for finding right-to-left order of segments will be explained Segmentation using Regular and Singular Components Segmentation based on regular and singular components (or regularities and singularities) is proposed in [27], [6], and [1] for Latin handwritten and in [19] for Arabic handwritten words. We have implemented a segmentation method, ط, ص based on the same idea. In this method, first the holes in the preprocessed image (such as loops in characters etc.) are filled to avoid segmentation in these loops. Then, an opening operation is performed on the image with a, ه vertical structural element, whose height is a little (one or two pixels) larger than the estimated pen width. In this way, the moderately vertical parts of the image are obtained. Then a closing operation with a horizontal structural element having a small width (about three to five pixels) is performed to join together the vertical parts (resulting from the previous operation) that are close to each other. The results of this operation are called singularities or islands. By subtracting these components from the original image, regularities or bridges are found. At this point, some characters may have too many regularities which will cause an unacceptable over-segmentation. To decrease this problem, those regularities with a width smaller than a threshold (e.g. estimated pen width) are eliminated; i.e., they are added to singularities. Then among the remaining regularities (bridges), those that do not join two singularities (two islands) are also eliminated, i.e. are added to singularities (these regularities are the starting or the ending components of the sub-words). Finally, segmentation is performed at the middle of the final regularities. The following parameters are effective in the performance of this algorithm: The height of the structuring element for the opening operation: larger values for this parameter reduce the number of regularities, and thus increase over-segmentation and decrease under-segmentation. Since our goal here is to decrease under-segmentation, a value equal to the estimated pen width plus two is selected for this parameter. This value has experimentally shown better results. The width of the structuring element for the closing operation: smaller values for this parameter results in more over-segmentation and less under-segmentation occurrences. In [1] the value 5 is proposed for this parameter. We have selected the value 3 for it. April 2005 The Arabian Journal for Science and Engineering, Volume 30, Number 1B. 105

12 Figure 10 shows some words, regularities and singularities, and the segmentation of them. (a) (b) (c) Figure 10. Operation of the first segmentation method, which is segmentation based on regularities and singularities. (a) Binarized and noise reduced images of the words:, محجوب, بيفكر. ياسمن (b) Singularities and regularities specified by black and gray colors, respectively. (c) Resulting segmentation Segmentation using Local Minima of the Word Upper Contour In [28], the local minima of the upper contour of words have been considered as candidate positions for segmentation. Then if some conditions are satisfied, segmentation is performed in these positions. In addition, overlapping areas are detected and if required, segmentation is performed there. We have modified this method to be suitable for Nastaaligh handwritten words. In Nastaaligh style, when character Re is connected to a character before it, it is written without any upper contour minima between it and the previous character. As a result, this method is not able to segment character Re. Therefore, a new algorithm for detection and segmentation of connected Re is developed. First, this algorithm is applied to the word image. Then, the overlapping areas are detected and proper segmentation is performed there. Next, the upper contour is found by a simple method, and its local minima are found as primary segmentation points (PSP s). A validation process is performed for these PSP s and the word is segmented at the position of the valid PSP s. The algorithms for these steps are explained in below. Detection and Segmentation of the Connected Re The connected Re is detected using an idea similar to the one used for ascender detection. The special characteristics that discriminate connected Re from the other strokes are its almost 45 degree slope and large length. The proposed algorithm is shown in Figure The Arabian Journal for Science and Engineering, Volume 30, Number 1 B. April 2005

13 1. For each valid connected component do: 1.1. Let ncol= For each column of this connected component, from the left most to the right most column do: If (there exists more than one black run in the current column OR the width of some runs are more than PR0xPW): exit the loop (i.e. goto step 1.3) If (the most bottom black point of the current column is more than PR1 pixels under the lowest black point of the first column in the current decreasing trend (to consider probable rising end of Re )): exit the loop (i.e. goto step 1.3) ncol=ncol If (ncol <= PR3xPW): the sub-word does not contain Re. Else: The traversed lower contour is considered as a sequence of segments of 3-pixel length and a label H or S is assigned to each segment on the bases of, respectively, horizontal or slanted form of the segment The sequence of the labels H and S is smoothed by a state machine (e.g. SHS is converted to SSS, etc.) If (the number of the columns which considered as slanted are more than PR4xPW): a cut with color SEGMENTATION_COLOR is produced at the most right one of the traversed columns. Figure 11. Connected Re detection and segmentation algorithm. The values of the parameters are PR0=2.9, PR1=1, PR2=1, PR3=4.8, PR4=2. The performance of the proposed algorithm is very good. Figure 12 shows some results of this method.. تيررس and, شرف, ظهر Figure 12. The result of connected Re detection algorithm for three words Detection and Segmentation of Overlapped Strokes Segmentation using local minima of the upper contour is not able to segment the overlapped strokes either. For example, the form of middle He (row 8 in Table 1) in Nastaaligh style (such as ) or last Ye (such as ) or middle Mim (such as ) cannot be segmented by this method. This problem can be solved by finding overlapped strokes and performing segmentation in these areas (Figure 14). We have proposed a new algorithm for detection and segmentation of overlapped strokes that is shown in Figure 13. Figure 14 shows several good results obtained from this algorithm. Finding the Minima of Upper Contour and Segmentation The first step in this stage is finding the upper contour. For this purpose, the chain code at each point of the outer contour, found in the previous stages, is obtained by traversing it and saving the West, North-West, and South-West directions in it as the upper contour. Then the weak noise on this contour (with one pixel width and less than ContourNoise height) is eliminated. However, this elimination may result in a sub-segmentation problem for some cases in handwritten Nastaaligh style, e.g. for weak cogs. Thus, we set parameter ContourNoise to zero. Figure 15 shows the upper contour of a word resulting from this algorithm. Then the local minima of the upper contour are found and the segmentation is performed based on them. The algorithm for these operations is shown in Figure 16. April 2005 The Arabian Journal for Science and Engineering, Volume 30, Number 1B. 107

14 1. For each valid connected component do: 1.1. For each column of this connected component, from right most to left most column do: The number, length, and position of the black runs are found in the current column For each black run, starting from the most bottom of them do: If (there is no black points in the right column adjacent to this run AND overlap was found): Move toward left direction until two overlapped pieces are joined together. So two columns that there is overlap between them are found By traversing outer contour, we check that overlaps are not related to a loop On the upper part of the overlap, the position which has the least width is found and is signed for segmentation. Figure 13. Overlapped strokes detection and segmentation algorithm.. انگليسي and, ياسمن, محجوب Figure 14. Operation of the overlap detection algorithm for three words. تخصيص Figure 15. Obtained upper contour for the word 1. For each valid connected component do: 1.1. Finding PSPs: During traversing the upper contour, the points in which a falling trend is replaced with a rising one are found by a state machine, and the positions of these local minima are saved as primary segmentation points (PSP). If the minimum value continues in more than one pixel, the PSP is considered on the pixel whose width of pen is minimum; and if this situation also continues in more than one pixel, the PSP is considered at the middle of this part Validation of PSPs: for each found PSP, if there is no loop under it and the width of pen there is lower than THR1xPW (THR1 is considered equal to 4), this candidate point is valid and is labeled to segment Doing segmentation: in the labeled pixels, a cut is made with color SEGMENTATION_COLOR only if the left and right sides of the cut is black. If one side of the cut is a loop, the column of the hole which is adjacent to the cut is filled with black color to prevent this adjacency After segmentation, the cuts which result to small segments (less than three pixels wide) are canceled. Figure 16. Minima of the upper contour detection and segmentation algorithm. 108 The Arabian Journal for Science and Engineering, Volume 30, Number 1 B. April 2005

15 Figure 17 shows three samples of the operation of the complete segmentation algorithm, i.e. after running the algorithms of Figure 11, 13, and 16. Figure 17. Operation of the second segmentation method, which is segmentation based on the minimums of the upper contour Finding Right-to-Left Order A relatively complicated algorithm for finding right-to-left order of the segments is proposed in [1]. The algorithm proposed here is much less complicated than that algorithm. After ascender and descender elimination, the order can be found independent of the baseline. First, the order of sub-words is found. Then in each sub-word, the order of segments is obtained by considering the right most segment as the first one in the sub-word, and then traversing the outer contour and considering the order in which segments are visited. This algorithm is shown in Figure Feature Extraction The feature extraction method used in a character recognition system is probably the most important factor in achieving a good recognition rate [29]. Many different feature extraction methods are proposed in the literature, and the most suitable ones of them are generally found experimentally. After studying various feature extraction methods and testing some of them [21], we selected a mixed feature vector containing various features from binary and outer contour representations of pseudo-character images. The features we tested include geometric moments [18] extracted from binary and thinned representations, Fourier descriptors [30] extracted from outer contour, discrete and structural features including loop, height-to-width ratio, number of black points to total number of points ratio, position of connection to the right and left pseudo characters [21] extracted from binary representation, and pixel distribution features plus some other discrete features such as end points, T-joints, X- joints, and zero-crossing features [6] extracted from skeletons. Various combinations of these features were tested on images of ideally segmented characters using a mixture-of- Gaussian classifier. Finally, eight features, including three Fourier descriptors (descriptors number one to three), number of loops, height-to-width ratio, the number of black points to total number of points ratio, and the position of right and left connections were selected. This mixed feature vector, in addition to high discrimination power, has a short length that increase the speed of recognition. With these features no skeletonization is required, and so we avoid the complexity of such process. The feature vector is invariant to scale and shift. The normalization of features [6] makes the discrimination effect of them moderately equal. But some features may have more discrimination power than the others and by normalization this deference will be ignored. So we found a weight for each feature experimentally instead of normalizing them. These weights, which show the importance of each feature, resulted in more discrimination power in experiments. Fourier descriptors, which are normalized by the first descriptor, are used with unity weight, the loop feature with weight 40, the height-to-width ratio with weight 15, the number of black points to total number of points ratio with weight 45, and the left and right connections position with weight 40. This feature vector showed a good performance. Table 3 compares the performance of these features with the features proposed in [6]. April 2005 The Arabian Journal for Science and Engineering, Volume 30, Number 1B. 109

16 1. For each connected component do: 1.1. Filling the cut positions with black color in a temporary image (a copy of the segmented image): during the traversing of outer contour, each time a pixel with color SEGMENTATION_COLOR is visited, pixels relevant to this cut are painted black. 2. Finding right-to-left order of sub-word: The connected components of temporary image are sorted by their start columns (i.e. right most column) such that the right most sub-word becomes the first one in the order. 3. For each connected component of the temporary image (i.e. each sub-word) in order obtained in step 2 do: 3.1. For each connected component of the segmented images (i.e. each segment) do: If (the current segment belongs to the current sub-word): it is specified in the relevant index of an array of sub-words (i.e. it is specified that each sub-word contains which segments) Outer contour of the current sub-word is traversed, starting from the right- and top-most point of it If (the current segment, which is traversing, is revisited AND there are some unvisited segments): this segment is moved to the end of the order of segments. Else: If (there are some unvisited segments): the current segment is added at the end of the order of segments. goto If (a pixel with color SEGMENTATION_COLOR is visited): continue traversing in the next segment. goto The position of each segment (i.e. first, middle, last, or isolated) in the current sub-word is specified according to the obtained order of segments. 4. The coordinate of one point is stored in an array in the obtained order of them. 5. Ascenders and descenders are added to the image, new connected components are found, and the order of them is specified using the order of stored points in step Isolated ascenders are removed from the image. Figure 18. Finding right-to-left order algorithm Recognition A CDVDHMM model [1] is used for recognition. The characteristics which make this model fairly suitable for our system are: (1) The sequential nature of writing: Markov models can successfully code the sequential information. (2) Hidden states: In the handwritten word recognition task, the system tries to recover the sequence of characters (as hidden states) from the sequence of observed features (as observations). (3) Continuous symbol probability distribution: there is no vector quantization error in this case, and the multishaped property of Farsi/Arabic characters can be fairly modeled by the mixture-of-gaussian distributions. (4) Variable duration of states: this aspect can handle the over-segmentation problem. We consider the pure form of characters (i.e. without secondary strokes) and the compound forms of characters in Nastaaligh style as the states of the model. So the number of states will be 25. As mentioned before, Farsi characters in various positions have different forms and a character in a given position can also have different shapes (e.g. and ). So, considering all forms of a character in one class will not result in a good recognition rate. To compensate for this problem, we have defined the sub-state idea. We considered different shapes of each character as sub-state of the state assigned to that character. Then in the training stage, we use training images to obtain parameters of each sub-state separately. The role of sub-states in the performance improvement will become clearer in the following sections Training the Model In the training stage, the goal is estimation of model parameters λ = ( Π, A, Γ, B, D) (equations (1) to (5)). As mentioned before, characters are considered as states. Therefore, the states are meaningful, which makes possible avoiding re-estimation methods (e.g. Baum Welch method) for training, and so the training stage becomes simple [1]. Two training sources are used which include training images and the dictionary. Parameters B and D are obtained from training images, and the other parameters from dictionary. 110 The Arabian Journal for Science and Engineering, Volume 30, Number 1 B. April 2005

17 Training with images In this stage, the parameters are computed for each sub-state separately. In this subsection, the word state refers to sub-state. After using the segmentation algorithm on the training images, the state duration probabilities (D) are computed by counting the number of segments of each character manually. The probability that state q i has duration d, ( Pdq ( i )), is equal to the number of times that the character q i is segmented to d parts, divided by the total number of times that this character has appeared in the training images. In our training samples, the maximum duration of a state was four. But to be able to consider worse cases, we consider the maximum duration of states equal to six ( d = 1,2,..., 6 ). The observation pdf (parameter B) is represented as a finite mixture of the form: j M j b ( x) = c. N[ x, µ m= 1 jm jm, U jm ], 1 j N where N represents a Gaussian distribution with mean vector µ jm and covariance matrix U jm for the m th mixture component at state j. x is the vector being modeled, M j is the number of Gaussian components at state j, and c jm is the mixture coefficient for the m th Gaussian component at state j. The mixture gains satisfy the stochastic constraint: M j m= 1 c jm = 1, 1 j N, c jm 0, 1 m M We used the k-means clustering algorithm with a free parameter k and a fixed SNR to find the number of Gaussian functions for each state. We used criterion J 4 = tr[ Sw ] tr[ S m] to determine a proper SNR [31]. This criterion is based on the trace of within-class scattering matrix ( tr[ S w ]) divided by the trace of mixture scattering matrix ( tr[ S m ]). The experimental value 0.9 is obtained as the optimal value for the terminating condition of algorithm (SNR). The mixture coefficient c jm is the number of training samples existing in H jm divided by the total number of training samples for state q. j H jm is the set of the samples in cluster m of state j distribution are estimated as follows: m= 1 q. For each cluster in state x H jm jm j (6) (7) q j, the parameters of Gaussian 1 µ jm = x (8) N 1 T U jm = ( x µ jm )( x µ jm ) (9) N x H jm jm jm where x is the feature vector of the training samples and N jm is the number of samples in H jm. The covariance matrix U jm is assumed to be diagonal in our implementations. Because of the limited amount of available training data, a small constant ρ is added to the diagonal elements of the covariance matrix to prevent it from becoming singular [1]. The value 0.1 is selected for ρ in our implementations. The symbol probability density for an observation O is computed in the recognition stage as: M j 1 T 1 b j ( O) = c jm. exp[ ( O µ jm ) U jm ( O µ n n 2 jm )]. (10) ( 2π ).det[ U ] The Observation O can be composed of one or several consecutive segments. In handwritten word recognition, the shapes of consecutive segments resulting from segmentation process are dependent on each other. Thus the symbol probability for a composite observation is defined as follows [1]: d d b j ( o1 o2... od ) = b j ( O1 ) (11) d where O 1 is the image built by merging segment images o 1, o2,..., od together. The power d is used to balance the symbol probability for different number of segments. This is a necessary normalization procedure when every node in Viterbi net is used to represent a segment [1]. April 2005 The Arabian Journal for Science and Engineering, Volume 30, Number 1B. 111

Improved Method for Sliding Window Printed Arabic OCR

th Int'l Conference on Advances in Engineering Sciences & Applied Mathematics (ICAESAM'1) Dec. -9, 1 Kuala Lumpur (Malaysia) Improved Method for Sliding Window Printed Arabic OCR Prof. Wajdi S. Besbas