Improved Optical Recognition of Bangla Characters

SUST Studies, Vol. 12, No. 1, 2010; P:69-78 Improved Optical Recognition of Bangla Characters Md. Mahbubul Haque, A.Q. M. Saiful Islam, Md. Mahadi Hasan and M. Shahidur Rahman Department of Computer Science & Engineering, Shahjalal University of Science & Technology Sylhet-3114, Bangladesh. Email: rahmanms@sust.edu Abstract This paper presents a method on improved optical recognition of Bangla characters toward building a working system. The recognition process performs pre-processing of input image, segmentation of both lines, words and characters and then recognizes the characters based on the classifying criteria. Experience with OCR (Optical Character Recognition) problem teaches that for most subtasks involve in OCR, there is no single technique that gives perfect results for all the variations of printed text. Based on the features of Bangla characters, we have proposed a number of classifying features for generating an 11-digit character code which is then used to identify a particular character. Additionally, the character code is suitable of recognizing the conjuncts without breaking into their constituents. Experimental results show that the proposed algorithm can provide recognition accuracy up to 99% for regular-shaped distinct basic characters and conjuncts. This shows us a hope toward building a complete working system in Bangla. 1. Introduction An OCR can be regarded as one of the most useful tool in the process of digitization of literature. It is, however, a challenging task to build a practical OCR [1, 2]. There are several problems encountered in processing a document. The text zone of the document needs to be extracted before the recognition of the text can take place. In printed text, skew results in overlap between the text lines that makes it difficult to split the lines. The background noise can further complicate the situation. The ink spread results in character fusions and fading can fragment a character. In OCR, there is a conflicting demand of classifying a large set of natural variants into a single class and at the same time discriminate between closely resembling patterns. The last 50 years of research has clearly demonstrated that no single strategy is sufficient for dealing with the complexity of the problem. Moreover, the strategy cannot be same for reading texts of different scripts/languages. Bangla script is alphabetic in nature and the words are two dimensional compositions of characters and symbols which makes it different from Roman and ideographic scripts. The algorithms which perform well for other scripts can be applied only after extensive pre-processing which makes simple adaptation ineffective. An optical character recognition system usually consists of two main stages, namely segmentation and recognition. There is a possibility of ambiguity at each stage. The line segmentation may be ambiguous due to tilt and overlapping of text lines. Some of the closely written words may not get segmented into individual words. The characters may be fused due to unwanted breaks, resulting in wrong segmentation. Very often, adjacent characters are touching, and may exist in an overlapped field. Therefore, it is a complex task to segment a given word correctly into its character components. B. B. Chaudhuri and U. Pal have done a great amount of tasks in the area of Bangla OCR as well as segmentation [3-6]. Alamgir have reported various aspects of Bangla script recognition in [7]. The post-

70 Md. Mahbubul Haque, A.Q. M. Saiful Islam, Md. Mahadi Hasan and M. Shahidur Rahman processing system is based on contextual knowledge which checks the composition syntax. Khademul [8] have described Bangla numeral recognition based on the structural approach. Neural network approach for isolated characters has also been reported [9-11]. One of the recent works in character segmentation is found in [12] where the solution concerns about the reformation of the characters of upper and middle zone. Hasnat et. al. at [13] discussed the problems and provide a training dependent solution to reduce the segmentation error. Hasnat and Khan dealt with eliminating splitting errors in [14] where they applied rule-based feature information in addition with some classifying features. Since the basic characters (vowels and consonants) constitute 94%-96% of the text, most the researches emphasize on recognizing the basic characters. In addition with the basic characters, in this paper, we have proposed new classifying features for identifying conjuncts. Experimental results show satisfactory accuracy for distinct basic characters and conjuncts. Currently, research is going on to attain acceptable result up to sentence and paragraph-level so that a true working system can be built on Bangla OCR. 2. Composition of Bangla script from OCR viewpoint Bangla script can be summarized as comprising 49 basic characters, 14 modifiers called kar, fola, ref and hashanta (placed to the left, right, above, or at the bottom of a character or conjunct), more than conjuncts, 10 numerals and 25 symbols and punctuation marks [15]. Some of the modifiers are placed before or next to the consonant (core modifiers), some are above (top modifiers) and some are placed below (lower modifiers) a consonant. In Bangla, a consonant has a pure form as well as a shortened form. Bangla script contains a pure form for all the consonants and shortened form of many of the consonants. A consonant in shortened form always touches the next character, yielding conjuncts. Other than the numeric symbols, there are some punctuation marks in Bangla script which are used to provide actual meaning to the text. A horizontal line is drawn on top of all characters of a word that is referred to as header line. It is convenient to visualize a Bangla word in terms of three strips: a core strip, a top strip and a bottom strip. The core and top strips are separated by the header line. The top strip contains the top modifiers, the bottom strip contains lower modifiers and the core strip contains the characters and core modifiers. If no consonant of a word has top modifier, the top strip will be empty. Similarly, if no character of a word has lower modifier, the bottom strip will be empty. It is possible that either of the bottom or top strips, or both the strips are present. 3. Recognition Methodology A script can contain images, graphs, tables etc, in addition to the text. Extraction of text-zones from the document has been extensively studied [10, 11, 12] and still continues to be an active research area. In this paper, we employ a pre-processing stage that extracts uniform text zones from the document image. The current system segments each uniform text zone into text lines and text lines into words. Words are further segmented into characters and symbols. The characters and symbols may not be valid Bangla symbols when viewed in isolation. We refer to characters and symbols as recognition unit. Extraction of unit from preprocessed document Image can be performed by the steps outlined in Fig. 1. A. Line Segmentation Bangla script is written from left to right and top to bottom. A text line is separated from the previous and following text lines by white space. This segmentation is based on horizontal histograms of the document. A horizontal histogram of the uniform text zone is produced. A zero value in the histogram corresponds to a horizontal gap. The horizontal gaps are assumed to be the line boundaries.

Improved Optical Recognition of Bangla Characters 71 B. Identify Header-Line Position Position of the header-line is a dominating feature for extracting characters and modifiers correctly. Headerline is a long vertical stroke started from left and spans to the right from start to end. As this is the common line, the horizontal projection shows it as an instantaneous stroke at the plot. The stroke becomes very high rapidly and falls down rapidly again having a width of only 1 or 2 pixels. The system checks if there is a rapid variation in pixel count while checking the rows one after another. If it is found the position is stored. Scan the document Image Preprocessing: Text area recognition Segmentation: => Line segmentation Star of line End of line => Word segmentation Detection of Top strip Detection of Core strip Bottom script of word detection => Header line detection => Character Segmentation Preliminary Character Segmentation Defined Character Bound Shadow Character Segmentation Upper and Lower Modifier Segmentation => Character Recognition Categorize Character Does Header Exist? Bar Position? 1. Pre bar? 2. Mid bar? 3. End bar? 4. No bar? Is Upper Modifier? Is Connected with Header Line? Generate Unique Character Code Character Code = Criteria code + Character Sequence (Horizontal & Vertical) Identify the Character according to the Character Code Construction of Words and Sentences Fig. 1: The outline of the proposed method C. Segmentation of a text line into Words In Bangla script, characters and symbols of a word are joined together by a header line. As a result, word boundaries are rarely ambiguous. The gap in header line creates no problem for word boundary identification. A vertical histogram of a text line is made and every gap of two or more pixels in the histogram is taken to be the word delimiter.

72 Md. Mahbubul Haque, A.Q. M. Saiful Islam, Md. Mahadi Hasan and M. Shahidur Rahman D. Segmentation of a word into symbols and characters The header line joins the characters of a word together which makes the segmentation of a word into its constituent characters slightly difficult. The region above the header line contains upper modifier symbols. The region below the header line contains core characters and lower modifier symbols. Before segmentation can progress any further, header line must be identified. Header line is easily identified as it is the most dominating horizontal line in a word. After the header line is removed, vertical gap separates the top modifiers from the neighbors. The characters below header line are also separated from their neighbors by vertical gaps. Step 1 Step 2 Step 3 Vertical histogram Step 4 Step 5 Step 6 Fig. 2: Character segmentation from text line E. Segmentation of Modifiers Modifiers suitable for using at left and right position of a core character can be easily segmented from the vertical projection profile (e.g. evsjv `k in Fig. 2). However, decomposing upper and lower modifier is relatively tedious. Horizontal and vertical projections of upper modifiers are taken starting from (0,0) to (char width, header- line position). If an upper modifier joins another left or right modifier in the header line, then concatenation of both is considered a single modifier (as in the cases of ÕwÕ, x ). Fig. 3 shows an upper modifier, its horizontal and vertical projections and concatenation with another (left or right) modifier. Upper modifier Horizontal Projection Vertical Projection Concatenated Fig. 3: Segmentation process of the upper modifier

Improved Optical Recognition of Bangla Characters 73 Lower modifiers may have a thick joining, weak joining or no joining at all. If the lower modifier has no joining with the core character, it can be readily separated by an outer boundary traversal (Fig. 4a). But if they have a thick or weak joining, a threshold of the height of the character is determined first and then the point of lowest density investigated below the threshold height. This position is treated as the segmentation point where the core character is separated from the modifier. Fig. 4b shows the case. Gap Segmented Thick Segmentation Point Segmented (a) No joining (b) Thick joining Fig. 4: Segmentation process of the lower modifier 4. Recognition of Characters A. After the word is segmented, the recognition algorithm is applied to characters. The salient features employed to the recognition process are summarized below. i. Existence of the header line In Bangla script, some characters have complete header line whereas some doesn t have. This can be the first criterion to categorize the characters which make the recognition process easier and faster. Fig. 5: Recognition of the header line ii. Position of the Vertical Bar Bangla characters often exhibit vertical bar starting from header line and spans up to the whole core strip height. Those characters can be classified as Pre Bar characters Mid Bar characters End Bar characters, and Non Bar characters Fig. 6: Recognition of the placement of the vertical bars

74 Md. Mahbubul Haque, A.Q. M. Saiful Islam, Md. Mahadi Hasan and M. Shahidur Rahman iii. Top strip Some characters have portion at the top strip, which is absent in some other characters. Fig. 7: Recognition of the existence of the top strip iv. Connection to the Header Line Sometimes the header line is connected with the rest part of the characters whereas some other characters do not touch the header line such as in Z Ó and ÓfÓ. Fig. 8: Verify if a character touches the header line v. Dot as a lower modifier Recognition of the presence of the dots at the lower part of the character further helps classify the characters. Fig. 9: Recognition of the existence of the dot at the lower part of the character In addition with the features stated above, some other classifying features can be used, for example, extra connections with he header line (m, b), connection place with end-bar (b, m) and number of connections with end-bar (b, l), for improved accuracy. B. Horizontal Zero Crossings Image of a character is treated as an array of pixels. A black pixel is expressed by a 1 and a white pixel by 0. After tracing the whole array row by row, number of transitions from black pixel to white pixel is recorded for each row. Let N i be the number of transitions for ith row. The sequence N i, 0<= i<= n, where n is the pixel height of the character, is referred to as horizontal zero crossing sequence S.

Improved Optical Recognition of Bangla Characters 75 Fig. 10: The process of horizontal zero-crossings A character is divided into n horizontal segments of equal height. Each segment is represented by the number of zero-transitions that is most frequent in the segment. We divide the sequence S into three sub-sequences of equal length, S1,S2, and S3. For each subsequence the most frequent number is stored in S1-Most, S2-Most, S3-Most. The feature vector consists of (S1- Most, S2 Most, S3- Most) which is then used to filter out the false character candidates. The following table shows the obtained sequence, which are same for all the 3 fontsize of ÕKÕ and L' which are 131 and 122, respectively. Similar results have been obtained using four different font faces. Table 1: Sequence S for four samples of character ÕKÕ and L' Sequence S for ÕKÕ Sequence s for Õ L Õ 111133323321111 => 11113 33233 21111 => 131 111133343321111 => 11113 33333 21111 => 131 111133343321111 => 11111 33343 32111 => 131 3111 12222 22211 => 3111 12222 22211 => 122 3111 22212 22211 => 3111 22212 22211 => 122 3111 22211 22211 => 3111 22211 22211 => 122 C. Vertical Zero Crossings Vertical zero crossing is as same as the horizontal crossing where the image block is divided into several vertical segments. The process of vertical zero crossing yields another 3-digit code. Fig. 11: The process of vertical zero-crossings D. Recognition from the Generated Code sequence The processes in Sections A, B, and C produce 11-digit (5 digits of category code, 3 digits of horizontal, and 3 digits from vertical zero crossings) code that is employed for identifying a character.

76 Md. Mahbubul Haque, A.Q. M. Saiful Islam, Md. Mahadi Hasan and M. Shahidur Rahman Criteria Code (5 Digit) Header 1 = Yes 0 = No Bar 1=Pre bar 2=Mid bar 3=End bar 4=No bar Table 2: Generating character code using the classifying features Upper Modifier 1= Yes 0= No Connection with header 1 = Yes 0 = No Existence of dot 1 = Yes 0 = No Horizontal code (3 digit) Vertical code (3 digit) Unicode 1 2 0 1 0 133 122 K' 004b 0 3 0 0 0 121 232 L 004c The obtained character code can vary according to font face and size. In the next section we presented experimental results for font-face SutonnyMJ. 5. Recognition Results Four snapshots from the program output are shown in Fig. 5. As seen in the figure, four type of input (same size basic characters, different size basic characters, same size conjuncts, different size conjuncts) have been used to verify the performance of the proposed method. The font size varies from 14 to 32-point. Experimental setup and the obtained recognition accuracy are summarized in Table 3. Table 3: Recognition accuracy of the proposed method Font size No of Characters (basic) Recognition accuracy (%) No of Characters (conjuncts) 14 50 99.9 50 99.9 16 100 85 100 90 18 150 87 150 87 20 200 90 200 92 22 250 92 250 93 24 300 86 300 87 26 350 82 350 79 28 400 90 400 92 30 450 81 450 85 32 500 88 500 95 6. Conclusion Recognition accuracy (%) In this paper, we have presented a recognition method of Bangla characters. The results presented here using frequently used basic characters and conjuncts are satisfactory. The accuracy, however, decreases when recognizing from a complete sentence. Research is going on to improve the accuracy rate on characters segmented from sentences. Primarily, we concern only the regular-shaped font of varying sizes. By incorporating a neural network based post processor, we aim to build a complete working system on Bangla OCR addressing more variations in printed Bangla text.

Improved Optical Recognition of Bangla Characters 77 References [1] S. Kahan, T. Pavlidis, and H.S. Baird, On the recognition of printed characters of any font and size, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 9(2), pp. 274-288, 1987. [2] M-K. Kim, Y-B. Kwon, "Multi-font and Multi-Size Character Recognition Based on the Sampling and Quantization of an Unwrapped Contour," Proc. of Intl. Conf. on Pattern Recognition, vol. 3, pp.170, Vienna, Aug 25-29, 1996. [3] U. Pal, B. B. Chaudhuri, "OCR in Bangla: an Indo-Bangladeshi Language", Proc. of ICPR, pp. 269-274, Jerusalem, Israel, 1994. [4] B. B. Chaudhuri, U. Pal, "An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari(Hindi)", Proc. of 4th ICDAR, vol.2, pp.1011-1015, Ulm, Germany, 1997. [5] U. Garain and B. B. Chaudhuri, Segmentation of Touching Characters in Printed Devnagari and Bangla Scripts Using Fuzzy Multifactorial Analysis, IEEE Transactions on Systems, Man and Cybernetics, vol. 32, pp. 449-459, Nov., 2002. [6] B. B. Chaudhuri, "Digital document processing: major directions and recent advances", Springer, London, 2007. [7] Mohammed Alamgir, Md. Khademul Islam Molla, and Muhammed Zafar Iqbal, Bangla Character Recognition System, Proc. of ICCIT 99, pp.159-163, 3-5 December 1999. [8] Md. Khademul Islam Molla and Kamrul Hasan Talukder, Bangla Number Extraction and Recognition from Document Image, Proc. of ICCIT, pp. 200-206, Dhaka, 2002. [9] A. A. Chowdhury, Ejaj Ahmed, S. Ahmed, S. Hossain and C. M. Rahman, "Optical Character Recognition of Bangla Characters using neural network: A better approach". Proc. of 2nd ICEE, 2002. [10] J. U. Mahmud, M. F. Raihan and C. M. Rahman, "A Complete OCR System for Continuous Bangla Characters", Proc. of the Conf. on Convergent Technologies, 2003. [11] S. M. Shoeb Shatil and Mumit Khan, Minimally Segmenting High Performance Bangla OCR using Kohonen Network, Proc. of 9th ICCIT, 2006. [12] Md. Abdus Sattar, Khaled Mahmud, Humayun Arafat and A F M Noor Uz Zaman, "Segmenting Bangla Text for Optical Recognition, Proc. of ICCIT, 2007. [13] Md. Abul Hasnat, S M Murtoza Habib and Mumit Khan, "A High Performance Domain Specific OCR for Bangla Script", Proc. of Int. Joint Conf. on Computer, Information, and Systems Sciences, and Engineering (CISSE), 2007. [14] Md. Abul Hasnat and Mumit Khan, Elimination of Splitting Errors in Printed Bangla Scripts Proc. of the Conference on Language & Technology 2009, 22-24 January, 2009 [15] A. Elahi, Bangla OCR: A Complete Working System, B.Sc. Thesis, 2006, Dept. Computer Science & Engineering, Shahjajlal University of Science & Technology, Sylhet, Bangladesh.

78 Md. Mahbubul Haque, A.Q. M. Saiful Islam, Md. Mahadi Hasan and M. Shahidur Rahman Fig. 5: Snapshots form the program output Submitted: 8 th July 2009; Accepted for Publication: 25 th January, 2010.