CHARACTER Segmentation and Ground truth preparation for handwritten Bangla word images

Size: px

Start display at page:

Download "CHARACTER Segmentation and Ground truth preparation for handwritten Bangla word images"

Hilary Manning
5 years ago
Views:

1 CHARACTER Segmentation and Ground truth preparation for handwritten Bangla word images Submitted by SANCHITA MAITY Exam. Roll No. : MCA of University Regn. No. : of Under the guidance of Mr. Ram Sarkar Department of Computer Science and Engineering, Jadavpur University. A dissertation submitted in partial fulfillment of the requirements for the award of Master of Computer Application (MCA) Department of Computer Science and Engineering Faculty of Engineering and Technology Jadavpur University Kolkata

5 CONTENTS Chapter 1: Introduction 1 Page no. 1.1 An overview on Optical Character Recognition(OCR) Description History of OCR Problem of OCR Recent Trends in OCR research Characteristic of Bangla script Character Segmentation and Ground-truthing What is character segmentation? What is ground-truthing? Importance of handwritten Bangla Word 12 Chapter 2: Review of existing work Problems of Character Segmentation from handwritten Bangla word images Some recent character segmentation and groundtruthing methodologies A fuzzy technique for character segmentation A two stage approach for segmentation

6 2.2.3 A database for unconstrained handwritten Bangla word images A complete handwritten numeral database Motivation 16 Chapter 3: Present Work Data collection methodologies Segmentation Selection of SF and DNS Components Initial Selection of Obvious SF and DNS Class Components Classification of SF/DNS Components using MLP Determination of Matra Pixels using a Fuzzy Membership Function and Horizontalness Feature for SF components Determination of Potential Segmentation Points using Two Fuzzy Membership Functions for SF components Identification of Actual Segmentation Points in the SF Components Preparation Ground-truthed images 36 Chapter 4: Conclusion 49 References 50

7 Chapter 1 Introduction 1.1 An Overview on Optical Character Recognition (OCR) Description Optical character recognition usually abbreviated to OCR, is the mathematical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine editable text. Broadly speaking, OCR system eases the barrier of the keyboard interface between man and machine to a great extent, and help in advancement of office automation. By doing so, OCR system facilate large scale document transcription with huge saving of time and human effort. The systems has potential application in reading amount from bank checks, extracting data from field-in forms and interpreting handwritten address from mail pieces for automatic routine, and so on. OCR is a field of research pattern recognition, artificial intelligence and machine vision. Though academic research in a field continues the focus of OCR has shifted to implementation of proven techniques. Optical character recognition (using optical techniques such as mirrors and lenses and) and digital character recognition (using scanners and computer algorithms) were originally considered separate fields. Because 1

8 vary few application survive that use true optical techniques, the OCR has now been broaden to include digital image processing as well. Early system required training (the provision of known samples of each character) to read a specific font. Intelligent systems with a high degree of recognition accuracy for most fonts are now common. Some systems are even capable of reproducing output that closely approximates the original scanned page including images, column and other non textual components History of OCR In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by handle who obtained a US pattern on OCR in USA in 1933 (U.S. Patent 1,915,993). In 1935 Tauschek was also granted a US patent on his method (U.S. Patent 2,026,329). Tauschek s machine was a mechanical device that used templates. A photo detector was placed so that when the template and the character to be recognized were line up for an exact match and a light was directed towards them, no light would reach the photo detector. In 1950, David H. Shepard, a cryptanalyst at the Armed Forces Security Agency in the United State, was asked by frank Rowlett who had broken the Japanese PURPLE diplomatic code, to work with Dr. Louis Tordella to recommend data automation procedures for the Agency. This included the problem of converting printed messages into machine language for computer processing. Shepard decided it must be possible to build a machine to do this, and, with the help of Harvey Cook, a friend, built gismo in his attic during evenings and weekends. This was reported in the Washington Daily News on 27 April 1951 and in the New York Times on 26 December 1953 after his U.S. Patent 2,663,758 was issued. Shepard then founded Intelligent Machines Research Corporation (IMR), which went on to deliver the world s first several OCR systems used image analysis, as opposed to character matching, and could accept some font variation, Gismo was limited to reasonably close vertical registration, whereas the following 2

9 commercial IMR scanners analyzed characters anywhere in the scanned field, a practical necessity on real world documents. The first commercial system was installed at the Readers Digest in 1955, which, many years later, was donated by Readers Digest to the Smithsonian, where it was put on display. The second system was sold to the Standard Oil Company of California for reading credit card imprints for billing purposes, with many more systems sold to other oil companies. Other system sold by IMR during the late 1950s included a bill stub reader to the Ohio Bell Telephone Company and a page scanner to the United States Air Force for reading and transmitting by teletype typewritten messages. IBM and others were later licensed on Sheppard s OCR patents. In about 1965 Readers Digest and RCA collaborated to build an OCR Document reader designed to digest the serial numbers on Reader Digest coupons returned from advertisements. The fonts used on the documents were printed by an RCA Drum printer using the OCR-A font. The reader was connected directly to an RCA 301 computer (one f the first solid state computers). The reader was followed by a specialized document reader installed at TWA where the reader processed Airline Ticket stock(a task made more difficult by the carbonized backing on the ticket stock). The readers processed document at a rate of 1500 documents per minute and checked each document rejecting those it was not able to process correctly. The product became part of the RCA product line as a reader designed to process Turn around Documents such as those Utility and insurance bills returned with payments. The United States Postal Service has been using OCR machines to sort mail since 1965 based on technology devised primarily by the prolific inventor Jacob Rabinow. The first use of OCR in Europe was by the British General Post office or GPO. In 1965 it began planning an entire banking system, the national Gyro, using OCR technology, a process that revolutionized bill payment systems in the UK. Canada Post has been using OCR systems since OCR systems read the name and address of the addressee at the first mechanized sorting center, and print a routing bar code on the envelope based on the postal code. After that the letters need only be sorted at later centers by less expensive sorters which need only read the bar code. To avoid 3

10 interference with the human-readable address field which can be located anywhere on the letter, special ink used that is clearly visible under ultra violate light. This ink looks orange in normal lighting conditions. Envelopes marked with the machine readable bar code may then be processed. In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and led development of the first Omni-font optical character recognition system a computer program capable of recognizing text printed in any normal font. He decided that the best application of the technology would be to create a reading machine for the blind, which would allow blind people to understand written text by having a computer read it to them out loud. However, this device required the invention to two enabling technologies the CCD flatbed scanner and the text-to- speech synthesizer. On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the blind. Called the Kurzweil Reading Machine, the device covered an entire tabletop, but functioned exactly as intended. On the day of the machine s unveiling, Walter Cronkite used the machine to give his signature sound off, And that s the way it was, January 13, While listening to The Today Show, musician Stevie Wonder heard a demonstration of the device and personally purchased the first production version of the Kurzweil Reading Machine. In 1978 Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload paper legal and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which had an interest in further commercializing paper-to-computer text conversion. Kurzweil Computer Products thus became a subsidiary of Xerox known as Scan soft (now Nuance). 4

11 1.1.3 Problem of OCR OCR of textual documents in general involves the following problems. i) Image acquisition ii) Text line extraction from document images iii) Word segmentation and character segmentation iv) Character recognition and word recognition Optical scanners attached with PCs are mostly used for capturing digital images document images. Extraction of text lines from document images is a trivial problem provided that document image remains unskewed. Text line from such document images can be easily extracted by identified valleys of horizontal pixel density histograms of these images. But for all practical situations, document images are skewed at least to some extent and the said technique fails to work for these images. Many text lines may touch each other. Skewness is inherent in handwritten text. So, special techniques are required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted from optically scanned document images of handwritten text, is one of the major problems of optical character recognition (OCR). If we can find a better method for segmenting the handwritten words into characters then we can increase our recognition of characters too. So segmentation of words into characters makes a large contribution towards the overall performance of OCR system character recognition also towards the overall performance of an OCR system too. Characters segmented from document image are to be recognized for coding them in ASCII or some other standard character code. For any of the widely used nonholistic optical character recognition (OCR) approaches, success of a specific technique depends on how best a word can be segmented into pieces, which are to be considered subsequently as candidates for its constituent characters. The better is the segmentation, the lesser is the ambiguity encountered in recognition of candidate characters or word pieces. To recognize a candidate character, its context also requires due consideration. 5

12 Because of variation of shapes and sizes, character segmentation of handwritten Bangla word images requires more sophisticated technique than that of printed characters Recent Trends in OCR Research Research on OCR has been mostly found to concentrate on text of European languages based on Roman alphabet. Possibly the probability of European languages in the industrialized West has interested both the researchers and entrepreneurs in OCR of text of European languages including English text. Scripts relating to Asian languages like Chinese, Korean, Japanese and Arabic have also received considerable attention from the researchers working in the field of OCR. Other that these, a number of Indian scripts, viz, Devnagri, Oriya and Bangla, have started to receive attention for OCR related research in the recent years. Out of these, Bangla is the second most popular script and language in the Indian subcontinent. As a script, it is used for Bangla, Assamees and Manipuri languages. Bangla, which is also the national language of Bangladesh, is the fifth most popular language in the world. So is the importance of Bangla both as a script and as a language. But evidences of research on OCR of handwritten Bangla characters, as observed in the literature, are a few in numbers. 1.2 Characteristics of Bangla script Characters of Bangla script can be grouped into five categories of characters, viz., vowel, consonant, modified shape, compound character, and punctuation symbol. Out of these characters, vowels and consonants, which constitute Bangla alphabet, are called basic characters. There are 11 vowels and 39 consonants in Bangla alphabet. There is no concept of upper and lower case characters in Bangla script. Characters in Bangla script are written from left to right. A vowel following a consonant in a word takes a modified shape in Bangla script. Such shapes of all vowels are termed as modified shapes. It is noteworthy that some modified shapes attached with a consonant 6

13 have two isolated parts appearing in two opposite sides of the consonant. Some modified shapes may appear just below the consonant, and some may reach its top from one of its sides with a curved or partly curved segment. So, characters in Bangla script may not always appear in non-overlapping consecutive positions. Depending on the mode of pronunciation, a Bangla consonant followed by one or two consonants takes a complex shape, which is called a compound character. There are in all 280 compound characters in Bangla script. Apart from the basic characters, the modified shapes, and the compound characters, Bangla script also constitutes 10 digit patterns. An important feature of Bangla characters is Matra or head line. Excluding a few, all basic and compound characters of Bangla script has this feature. The width of a Matra is nearly same as the width of the character it touches. All the Matras of consecutive characters appearing in a Bangla word are joined to form a common Matra of the characters appearing in the word. Fig.1 Bangla alphabet basic shape (The first 11 characters are vowels while others are consonants.) 7

14 (a) (b) (c) Fig. 2 Examples of vowel and consonant modifiers: (a) vowel modifiers, (b) exceptional cases of vowel modifiers and (c) consonant modifiers 8

15 1.3 Character segmentation and ground truthing What is character segmentation? Character segmentation is a necessary preprocessing step for character recognition in many handwritten word recognition systems. The most difficult case in character segmentation is the cursive script. Fully cursive nature of Bangla handwriting poses some high challenges for automatic character segmentation. Character segmentation techniques are mostly script dependent. It is not only because of variations of character shapes from one alphabet to other but also for certain script specific features of text document. Segmentation of isolated words into constituent characters is a challenging problem for Bangla scripts. Appearance of consecutive characters in overlapping column positions over a text line makes the problem of Bangla word segmentation more complex compared to segmentation of English words. The problem becomes compounded with handwritten Bangla words because of variation in sizes and shapes of handwritten characters. Considering all this, a novel technique for segmenting images of handwritten Bangla words is presented in this paper. Before segmenting individual characters of each Bangla word in the text image, the word is horizontally partitioned into three adjacent zones as shown in Fig. 3. The portion of each word on and above the Matra constitutes upper zone, the main body of the characters in a word lies within the middle zone, and the portion of the word, containing especially modified shapes and period like isolated character components, below the main body form the lower zone. The technique of word segmentation is based on detection of the Matra. 9

16 Fig. 3 Illustration of three zones and region boundaries of a Bangla word A Matra is a horizontal line, which passes touching the upper part of many characters of Bangla script as shown in Fig. 4(a). Depending on the characters, it covers at most the entire character width. The consecutive characters, in a Bangla word, which have Matras, are joined through a common Matra formed by joining the Matras of individual characters as shown in Fig. 4(b). This line may have some discontinuity over the positions where the characters in the word appear without Matras In handwritten Bangla words, the Matras are not horizontal as strictly as these are in printed words. So the technique of removing the Matra of a word for segmenting its constituent characters may leave many characters joined with each other. Such under segmentation may complicate classification decisions in the subsequent stage. How to segment handwritten Bangla words into constituent characters efficiently is still a challenging problem of OCR related research. This is a major point of motivation behind the present work that deals with the problem of segmenting hand written Bangla words into constituent characters. 10

17 Fig. 4(a) An illustration of the common Matra of a word Fig. 4(b) An illustration of Matra of individual characters in a word In image analysis testing of any algorithm is time and man-power consuming in manual way, which is now a days are widely used in different corner of world. Even the testing schema is not standard. Different organization uses different testing schema. So the success rate varies. Even standard database availability is too poor. So the result generation to a particular technique becomes hectic for researcher as they need to collect or prepare database for their won job What is ground truthing? Generations of appropriate ground truth data has always been a challenging and time some task for the kind of problem under consideration. Availability of ground truth 11

18 information, however, makes any database more useful, enabling proper evaluation of one s technique by comparing their output with the ground truth of the same. In the present work, we have prepared ground truth images for a subset of our database,viz., CMATERg1.1.1 and CMATERg1.2.1 respectively. We have prepared these ground truth images of the databases in a semi automatic way. More specifically, we have employed our previously developed technique [9] to identify individual character segments from any document image. The possible error that might have been generated in the automated character segmentation is corrected using a software tool called GT Gen version 1.1, which we have developed for this project. Basically, we have used GT Gen to recolor the characters, which were erroneously labeled by our previously developed technique [9]. It may be noted that all the ground truth images are stored in bitmap (bmp) file format, where the background is labeled in white and individual characters are marked in different colors Importance of handwritten Bangla word Bangla is an important East Asian script widely used in India and Bangladesh. Popularity wise, Bangla ranks 5th in the world and 2nd ranked in India as a script and a language both. It is also the national language of Bangladesh. Handwritten Bangla word is cursive in nature in most of the cases. So identification of each character is difficult to any segmentation algorithm. In handwritten Bangla words, the Matras are not horizontal as strictly as these are in printed words. So the technique of removing the Matra of a word for segmenting its constituent characters may leave many characters joined with each other. Such under segmentation may complicate classification decisions in the subsequent stage. 12

19 Chapter 2 Review of Existing Work In this chapter about the previous work and their drawbacks on character segmentation from handwritten Bangla word images. 2.1 Problems of Character Segmentation from handwritten Bangla word images Character identification is the first and most important step in the process of OCR of document images. If the characters are not identified accurately and for example two or more characters are connected with common Matra line then none of the characters of the word can be recognized correctly. The same problem occurs if a character is accidentally split into two or more parts. The characters might be written so close to one another those accents and similar features may become difficult to classify into the correct character. Adjacent characters might even touch one another at some points and in those cases it becomes very difficult to identify the constituent characters, which have joined to form a single component. Character segmentation of handwritten Bangla word images is faced by many challenges depends on the style of writing of an individual. In image analysis testing of any algorithm is time and man-power consuming in manual way, which is now a days are widely used in different corner of world. Even the testing schema is not standard. Different organization uses different testing schema. So the success rate varies. The present work suggests a method based on comparison of neighborhood connected or disconnected components to determine whether they belong to the same character. 13

20 2.2 Some recent character segmentation and groundtruthing methodologies A wide variety of text line detection methods for handwritten Bangla word images have been reported in the literature. These methods may be categorized into four types, namely (i) a fuzzy technique for segmentation ; (ii) a two stage approach for segmentation; (iii) a database for unconstrained handwritten Bangla word images; (iv) a complete handwritten numeral database, which cannot be grouped in a unique category since they do not share a common guideline A fuzzy technique for character segmentation A fuzzy technique for segmentation of handwritten Bangla word images have been presented in work [1]. It works in two steps. In first step, the black pixels constituting the Matra (i.e., the longest horizontal line joining the tops of individual characters of a Bangla word) in the target word image is identified by using a fuzzy feature. In second step, some of the black pixels on the Matra are identified as segment points (i.e., the points through which the word is to be segmented) by using three fuzzy features. On experimentation with a set of 210 samples of handwritten Bangla words, collected from different sources, the average success rate of the technique is shown to be 95.32%. Apart from certain limitations, the technique can be considered as a significant step towards the development of a full-fledged Bangla OCR system, especially for handwritten documents A two stage approach for segmentation Segmentation of handwritten Bangla word images is a challenging problem for the researchers. Discontinuity or absence of Matra, an important feature of Bangla 14

21 script, may lead to inherent segmentation within the word images. Around 55% of these inherently segmented connected sub-images do not require further segmentation. They have designed a novel two-stage approach for segmentation of isolated Bangla word images. In the first stage, a feature based approach is design to classify the connected word segments into either of the two classes, Segment further and Do not Segment using a multi-layer perception (MLP) based classifier. In the second stage, fuzzy segmentation feature are design to identify the Matra region and the potential segmentation point on the Matra of the connected word segments that belong to Segment further class. Using this technique, the overall successful segmentation accuracy achieved after two stages is 95.87% in the work [2] A database for unconstrained handwritten Bangla word images In the work [7], the preparation of a benchmark database for research on off-line Optical Character Recognition (OCR) of document images of handwritten Bangla text and Bangla text mixed with English words have been described. This is the first handwritten database in this area, as mentioned above, available as an open source document. As India is a multi-lingual country and has a colonial past, so multi-script document pages are very much common. The database contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and rests of the 50 pages are written in Bangla text mixed with English words. This database for offline-handwritten scripts is collected from different data sources. After collecting the document pages, all the documents have been preprocessed and distributed into two groups, i.e., CMATERdb1.1.1, containing document pages written in Bangla script only, and CMATERdb1.2.1, containing document pages written in Bangla text mixed with English words. Finally, we have also provided the useful ground truth images for the line segmentation purpose. To generate the ground truth images, we have first labeled each line in a document page automatically by applying one of our previously developed line extraction techniques and then corrected any possible error by using our developed 15

22 tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two databases, respectively, using our algorithm. Both the databases along with the ground truth annotations and the ground truth generating tool are available freely at A complete handwritten numeral database The paper [16] describes the ISI database of handwritten Bangla numerals. Bangla is the second most popular language and script of the Indian subcontinent and it is used by more than 200 million people all over the globe. The present database has several components which include both on-line and off-line handwritten numerals. Samples of numeral strings and isolated numerals have been collected under both modes of writing. This database has been developed at the Computer Vision and Pattern Recognition Unit laboratory of Indian Statistical Institute, Kolkata. Samples of the present database are properly ground thruthed and subdivided into respective training and test sets. The off-line sample images are stored in TIFF image format and the online samples are stored along with various information as header in ASCII file format. This database will facilitate fruitful research on handwriting recognition of Bangla through free access to the researchers. Other methodologies are include in the works described in [12-14]. In [5], the character segmentation problem is seen from an artificial intelligence perspective. 2.3 Motivation Considering the all kind of problems as discussed above, we actually need an automated evaluation tool for OCR systems, which is comparing the segmented results of a technique/algorithm with ground thruthed images. The evaluation technique is constructed of the following steps. First, database of word images is prepared. Then 16

23 apply segmentation technique on that database. The characters of the word images are not segmented properly. So we need detection and correction of the errors manually. After correction manually we store the word images in the database as ground truthed images. Our aim is to compare the segmented word images with the corresponding ground-truthed images automatically and also will give the success rate. For this purpose we want to create a tool in future. It will save time and man-power and will minimize analytical errors. Ground-truth preparation plays an important role in image analysis as mentioned above. It is also found that there is no such standard database or automatic evaluating tool for handwritten Bangla word images for handwritten Bangla OCR system. So, in present work, ground-truthing for handwritten Bangla word-images in two levels is introduced. Ground-truthed images are generated for the said database for evaluation of character segmentation algorithm. Character segmentation accuracy on these handwritten word images is also reported in the current work. Ground-truthed images are prepared for component level and character level so that the database would also be very useful for the performance evaluation of character recognition system. 17

24 Chapter 3 Present Work The present work on Character segmentation and ground-truthed preparation for handwritten Bangla word images is described below. A typical OCR system consists of scanning, preprocessing, word and character separation, recognition and post processing stages. Each stage has an impact on overall performance. As India is a multi-lingual country and has a colonial past, so multi- scripted document pages are very much common. The database contains 5000 handwritten word images written purely in Bangla script. The document of offline-handwritten scripts is collected from different data sources. After collecting the document pages, the entire document has been preprocessed. Each document page contains words on an average respectively. Finally we have also provided the useful ground-truthed images for the character segmentation purpose. To generate the ground-truthed images, at first we have labeled each component in a word images with unique color applying our previously developed technique [ ] and then corrected any possible error manually, developed for this project. The database would be very useful for handwritten OCR research in the area of Bangla especially for the performance evaluation of character segmentation methodologies as there is hardly any standard database found for the handwritten Bangla word images. Currently database is available on www. Cmaterju.org. Our aim is to provide the ground-truthed images for component level and character level segmentation of Bangla handwritten character recognition system. The name of the prepared database is CMATERdb1.1.1, where CMATER stands for Center of Microprocessor Application for Training Education and Research, a research laboratory at Computer Science and Engineering department of Jadavpur 18

25 University, India, where the current research activity took place. Db stands for database and the numeric value. Here the overall work flow is shown in fig. 5. The implementation details of the fig. 5 are discussed in the following sections. My present work is highlighted on this flow diagram. Fig.5 The basic flow diagram of the overall project 19

26 3.1 Data Collection Methodologies The materials of the handwritten document pages for the proposed database have been collected from different types of sources, viz., class-notes of students of different age group, handwritten, handwritten manuscript of a popular Bangla monthly magazine Computer Jagat and from document pages written by different person, on request, under our supervision. The document pages written under our supervision were collected from various persons with subject varying from news paper articles and Bangla text books containing both Bangla and English vocabulary. The writers were asked to use a black or blue ink pen and write inside the A-4 size pages. They imposed no other restrictions regarding the kind of pen they used or the style of writing chosen. Special attention was paid to ensure data collection from writers of different age groups and educational levels. Moreover, the pages were collected from different places (home, office, school etc.) in order to include different style of writing. In total 25 men and 15 women were participated in the data collection drive. The main characteristics of our database are as follows. 95% of the writers were native Bengali. Places of data collection: in school/colleges,40% in writers homes, and 20% in public places. Educational level of the writers: 20% 10 th standard school, 40% general high school and 40% college and university. Writers age: 40% between years, 30% between 25-35, and 30% between years. 3.2 Segmentation The current work designs a novel technique for identification of potential segmentation points on the Matra to isolate constituent characters from the word image 20

27 of Bangla script. In the first stage, component-labeling algorithm of is applied to identify connected sub-parts of the word images. The second stage involves an approach for classification of the connected sub-parts into either of the SF or DNS classes using a rule based prior selection and well-known MLP based classifier with a set of features extracted from these components. Finally, fuzzy segmentation features are used to identify potential segmentation points in an effective way on the detected fuzzy Matra region for subsequent extraction of isolated characters or character sub-parts from the overall word image. The basic steps of operations involved in this work for segmentation of handwritten word images of Bangla script are depicted in Fig. 6. Constituent characters or their sub-parts of words often extend above the common Matra or appear below the main character body. In the current work, we have identified three adjacent horizontally partitioned zones (viz., upper, middle and lower) from each word image as shown in Fig. 1. More specifically, the top row of the upper zone (R1), the top row of the middle zone (R2), the middle row of the middle zone (R3), the bottom row of the middle zone (R4) and the bottom row of the lower zone (R5) are identified from the word image. A horizontal pixel scan of the word image from top towards bottom identifies the first row with any black pixel as the top row of the upper zone i.e., R1. Similarly, a horizontal scan from bottom towards top identifies the first row with any black pixel as the bottom row of the lower zone i.e., R5. Identification of the top and bottom row boundaries of the middle zone (a key decision for subsequent features extraction) is a challenging task in handwritten word segmentation. In this work, authors have scanned the whole image to calculate sum of all maximum horizontal length for each row and then estimated R2 using those values. But sometimes this may give us misleading information. It may so happen because there are cases related to handwriting style of individual where the sum of maximum longest run length may appear anywhere in the word image and due to which R2 is not estimated correctly as shown in Fig. 7. Therefore we have modified the technique for determination of R2. 21

28 Fig.6 Block diagram of basic steps of present work. We know that generally Matra of handwritten word images do not appear in the lower half of the image. So in the present work, to identify the common headline of the word, horizontalness of each row is computed from the top to half of the word images i.e. from R1 to R1+(R5- R1)/2. Each black pixel of the word image in the said region is replaced by the length of the longest run of black pixels in horizontal direction by itself. Sum of the horizontal longest run values of all the pixels in a row is computed for each row of the word image. The row with the highest sum represents the row with maximum horizontalness. This row signifies the possible upper boundary of the middle zone and we have called it as 1st approximation of the upper boundary of the middle zone (R21). Then from the vertical feature we have estimated the 2nd approximation of R2 and called it R22. Fig.7 Wrong estimation of R2 using technique described in [2] 22

29 But even after estimating R21 and R22 we have observed that in few cases, the Matra regions are not estimated accurately (as shown in Fig. 6). To address this issue, we have determined another R2 estimate as the row containing the longest single run of black pixels and called it as R23. So finally we have taken the average of the three R2 approximations and called it as R2final, such that R2final = (R21+ R22 + R23)/3. This new estimate of R2 (involving three approximations) is observed to be more accurate in final comparison to our prior works involving two such approximates. We have taken R 2 as our final upper boundary (R2) of middle zone for a handwritten word image. Also we know that generally bottom row (i.e. R 4 ) of handwritten word images do not appear in the upper zone. So in our current work, to identify the R 4 of the word images, horizontal transition points between text and background pixels are computed from the middle to bottom of the word images i.e. from (R 1 +(R 5 - R 1 )/2) to R5. In each row, starting from the middle row to the bottom row of word image, the sum of transition points between text pixel to background pixel and vice versa are computed. The average number of transition points in the lower half of the image is computed as eta (η). Now the 1st row with greater transition points than η from bottom row of lower zone to half of the word image is identified as the bottom row of the middle zone (say, R 41 ). Again, as in case of R2, we have estimated R4 from the vertical feature and called it R 42. Then we have taken the average of R41 and R42 as the final R4 i.e. = R4final = (R 41 +R 42 )/2. We have taken R4final as bottom row of lower zone i.e. R4. Finally, the middle row of the middle zone is taken as R 3 i.e. R 3 = (R2+R4)/2. In the present work we have used a simple, yet popular, technique for identifying the connected components within the word image. Identification of connected word components requires identification of the connected pixels therein and marking them with identical labels. For this the CCL algorithm [14] scans the word image pixel by pixel from left to right and from top to bottom. During scanning, it considers all 8 neighbors of each pixel. For each of the connected components, all its member pixels appearing in the sub-image are replaced by a single distinct symbol. This is done to complete labeling of the connected pixels in the image and to generate uniquely coded 23

30 connected components as described in Fig. 8. Each of such connected components is subsequently extracted for analysis Selection of SF and DNS Components Among all the digitized word sub-parts generated after connected CCL, a decision is often required to identify only the components that need further segmentation because of the presence of many inherently segmented characters or their subparts in word images (as shown in Fig. 9). Thus, all word-components may not require further segmentation at all. These components are often classified into SF and DNS classes as shown in Fig. 7. Segmentation of DNS components is an overhead as it causes over segmentation of word components. So, selection of SF and DNS components not only minimize the character isolation overhead but also minimize the over segmentation probability. For this, we have developed here a two stage selection for SF and DNS class components. These stages are described in subsections and Fig. 8 A sample word image and its three of connected components 24

31 Initial Selection of Obvious SF and DNS Class Components In work [3], MLP based schemes were used for such a classification problem. However, consideration of all word sub-parts in the said classification algorithm not only increases computational overhead, but also leads to ambiguities in the selection leading to erroneous classification. To solve this problem, a pre-selection step is developed in the present work that identifies obvious SF and DNS class components. In the designed methodology two scale-invariant thresholds are used for this pre-selection of obvious SF and DNS components prior to the MLP-based classification scheme. In the current approach, all the word components are divided hypothetically into pieces by using a separating line (horizontal) along the middle line of the region R2 to R4. The row, along which this separating line is go through, is selected experimentally form the sample word images of the database. After this hypothetical separation, the number of connected sub-components or pieces generated as a result of this division is counted. We have applied this number as the decision maker i.e., based on the number of generated sub-components; the original component is categorized into one of the two types of classes, viz.., DNS or SF. If the number of sub-components in a component is less than a threshold value T1, then we have considered the component as a member of DNS class. On the other hand, if the same is greater than another threshold value T2 then the component is considered as belonging to the SF class. Some of the sample components classified successfully using this thresholding technique are shown in Fig. 9. The components with number of sub-components (n) between T1 and T2, i.e. T1<n<T2, are sent to a previously trained MLP classifier to accurately classify the components. This is done so, as decision-making on these components is not possible by using either T1 or T2. Experimentally, we have observed the values of T1 and T2 as 2 and 6 respectively. 25

32 Fig. 9 Sample images of Bangla script which are pre-classified as Obvious DNS components, pre-classified as Obvious SF components and sent for MLP based classification (for subsequent SF/DNS class identification) From the images of Fig. 9, it is evident that the choice of T1 is suitable for single character components partitioned in to two pieces along the hypothetical separating line, and subsequently classified as DNS segments. Also, the choice of T2 is done in such a way that, multiple touching characters or their sub-parts generate more number of components than T2 after being hypothetically partitioned along the separating line. These components are classified as SF components. In all remaining cases, ambiguities may exist and thus need sophisticated techniques such as MLP based classifiers and associated feature vectors Classification of SF/DNS Components using MLP In the present work, an MLP based classifier is used for classification of connected word components, which are not classified in the pre-processing stage, into either of the two classes to decide whether the given component needs to be further segmented or not, using the feature set mentioned in Table 1. The MLP based classifier designed for this work is trained with the Back Propagation (BP) algorithm. It minimizes the sum of the squared errors for the training samples by conducting a gradient descent search in the weight space. The number of neurons in a hidden layer in the same is also 26

33 adjusted during its training. In the current methodology we have designed a new feature set containing 11 statistical features, as described in Table 1. The following discussion justifies the choices of respective feature descriptors. The higher value of feature F1 signifies that the component may belong to DNS class, as this component may have some part(s) in the upper zone of the word as shown in Fig. 10(a). A similar explanation is applicable for the features F2 and F4 for the components in middle zone and the middle-lower zone respectively and is illustrated in Fig. 10(b). The feature F3 is used to classify the noise segment (i.e. broken part(s) of Matra) which almost certainly appears partially in upper and/or middle zone as shown in Fig. 10(c). Table 1: Feature vector and their description Feature ID F1 F2 F3 F4 F5 F6 F7 Feature Description Percentage height (w.r.t. the overall component height) of the component that appears upper zone of the word image Percentage height (w.r.t. the overall component height) of the component that appears middle zone of the word image Percentage height (w.r.t. the overall component height) of the component that appears lower of the middle zone of the word image Percentage height (w.r.t. the overall component height) of the component that appears lower zone of the word image Maximum horizontalness of the component within the region R2 to R4 Area of the component within the region R2 to R4 Number of data pixel of the component within the region R2 to R4 27

34 F8 F9 F10 F11 Number of data pixel of the component on R2 Maximum width of the component within the region R2 to R4 Width of the component along R2 Number of segmentation-point clusters on the Matra region of the component Feature F5, i.e. maximum horizontalness feature, has been used in the work [3]. However, due to writing styles of individuals this feature value may be higher in the upper, lower or lower half of the middle zone if the ascendant (character sub-parts in the upper-zone of the word image) or descendant (character sub-parts in the lower-zone of the word image) is extended unnecessarily as shown in the Fig. 11(a). Because of this, in the present work we have additionally used feature F9, i.e. maximum width of the component within the region R2 to R4 as shown in Fig. 11(b). Lesser value of this feature implies the component may be categorized as DNS class component. In feature F6, as used in work [3], the whole component was considered for area calculation. But this feature value may be higher for the component of DNS class due to extended ascendant and/or descendant as shown in Fig. 11(c). For this reason, we have modified the feature value of F6 by considering only the area of interest, i.e. the area within the region R2 to R4 only. Due to the same reason, the feature value of F7 i.e. number of data pixels is also calculated only within the region R2 to R4. Higher the value of feature F7 more is the possibility of the component belonging to the class SF. Similarly, high value feature F8 implies more prominent and continuity of the Matra i.e. component will be a member of the SF class. Again, due to cursive handwriting or discontinuity of Matra, the value of feature F9 may be lower for the components that need to be segmented further. That is why we have also taken feature F10 that gives the width of the component in the middle zone, i.e. vertical projection of the components within R2 to R4 along R2 (central Matra row). 28

35 (a) Word component in upper zone inside color box (b) Word components in middle and lower zone inside color boxes (c) Noise component inside color box Fig. 10(a-c) Illustration of feature F1, F2 F3 and F4 Feature F11 is the number of segmentation-point cluster. Often a component gets segmented to generate multiple, close segmentation points on the Matra region (Selection of Matra and segmentation point is discussed in sections and 3.2.3). Using 8-way connectivity, we have identified cluster of such segmentation regions and the number such clusters is considered as a feature value. More is the number of clusters, higher is chance of the component classified as SF class. In the previous work [3], the number of segmentation-points was considered as feature. But more number of segmentation-points may not always imply that the component needs to be segmented further. This is illustrated in Fig. 12(a-b). To compute feature F11, potential segmentation points in the region R1-R3 of connected components are to be determined first. The technique for finding potential segmentation points are discussed through section and

Fig. 11(a-c) Illustration of feature F5, F6 and F7 3.2.

36 Fig. 11(a-c) Illustration of feature F5, F6 and F Determination of Matra Pixels using a Fuzzy Membership Function and Horizontalness Feature for SF components The boundary between the sets of Matra pixels and non Matra pixels in the region R1-R3 is not distinct in practice. The black pixels lying over the line R2 have got strongest membership to the set of Matra pixels. As they are away on both sides of the line R2, their degree of belongingness to the set diminishes, as shown in Fig. 13, through a membership function MATRA (x). The exact expression of MATRA (x) is shown below. Where c=r2 denotes the center of the function, shown in Fig. 14, and a and b are parameters of the equation. The value of a is chosen as R1-R2 /2 for upper side of R2 and for lower side of R2 it is chosen as R2-R3 /2. The value of b is chosen as 1. To identify Matra pixels in the region R1- R3, all run lengths of black pixels along each row in the region are computed. Taking average of all these run lengths in the region, the mean run length of black pixels of the region are computed. This can be 30

Candidates for Matra pixels are to be selected from such line segments.

37 considered as the mean horizontalness of all black horizontal line segments in the region. Fig. 14 (a-b) shows three word images and the continuous horizontal line segments of black pixels, whose lengths exceed the mean horizontalness of the respective words. Candidates for Matra pixels are to be selected from such line segments. To finally determine whether a black pixel in region R1-R3 is a Matra pixel, the product of the horizontalness [2] of the black line segments, on which the pixel lies, and its MATRA value is computed. If the product exceeds the average value of all such products for all black pixels in the region R1-R3 then the pixel is finally considered as Matra pixel here. All such Matra pixels constitute the Matra region. Fig. 12(a-b) More segmentation clusters in (a) but only one segmentation cluster in (b) 31

38 Fig. 13 The membership function for the set of Matra pixels (a) Three sample word images (b) Consecutive black pixels, in the sample images, whose horizontalness exceed the mean horizontalness form continuous lines highlighted with darker shading Fig.14 Illustration of horizontalness (h) feature 32

39 3.2.3 Determination of Potential Segmentation Points using Two Fuzzy Membership Functions for SF components Potential segmentation points are Matra pixels, across which the segment is to be fragmented if it falls in the SF category. They are basically candidates for segmentation points until classification of the segment in SF class is completed. Potential segmentation points usually lie on the column positions along which the values of black pixel count are less. The less is the value of the black pixel count along the column position of a Matra pixel the higher is the degree of belongingness of the pixel to the class of potential segmentation points. To simulate this, a membership, 1, as shown in Fig. 15(a), is introduced here. The equation to this function is, for x 0. The values of parameters a, b, c are chosen as follows: c=0, b=1, a=wm, where WM is the maximum vertical width of the Matra region, defined in Section To ascertain a Matra pixel as a potential segment point, its distance from the line R2 is considered here. The less is the distance the higher is its degree of belongingness to the set of potential segmentation points. Ideally, it would be on R2. On this basis, another membership function 2, is introduced here. The function is shown in Fig. 14(b). The values of the parameters of 2 are same as that of MATRA. To decide about whether a Matra pixel in region R1-R3 would be a potential segmentation point, the average of 1 and 2 values are computed. If the average exceeds the mean of averages for all the Matra pixels in the region R1-R3 then the said pixel is to be considered as a potential segmentation point. Feature F11 here represents the total number of all such pixels which are identified as potential segment points in the region R1-R3. 33

40 Fig. 15 (a-b) The membership functions for determination of potential segmentation points Identification of Actual Segmentation Points in the SF Components For determination of actual segmentation points for SF components, there is always a trade-off between under/over segmentation of word images. In the current work, we have attempted to optimize between the two, with minimum loss of information. The issue of segmentation also becomes difficult in the presence of ascendants in the upper zone of the word component. For this reason, we have further designed algorithm-steps to identify a single column for segmentation on the Matra region. The methodology is described as follows. Often a SF component gets segmented using the fuzzy features to generate multiple, neighboring segmentation points along the Matra region. We have identified segmentation cluster points using 8-neighbors CCL algorithm as illustrated in feature F11 selection in section It may be observed from Fig. 12(a-b) that actual segmentation should not involve all the potential segmentation points in the cluster, but 34

41 focus on only the pixels that optimally separate the connected parts into different characters (or their sub-parts) of the word image. Selection of points which accurately segment the word components (sub-parts) into their constituent characters or sub-parts is a challenging issue. In case of poor selection of such points, over-segmentation may occur during the segmentation process. As a result of these, characters of their sub-parts may be internally broken/segmented, leading to loss of information. In the light of the above facts, we have selected the actual (more accurate) segmentation points from the segmentation points generated in each segmentationcluster in the current work. There are two primary decisions to be taken for this purpose, firstly, selection of the row-boundaries for segmentation along specific columns on the Matra region and secondly, identification of the segmentation columns in each segmentation-cluster. The algorithmic steps involved in this process are given below: 1. Check whether there is any ascendant in the word component under consideration. Estimation of the height of upper zone of the component does the checking. If the height of the said zone is exceeding some adaptively tuned threshold value (0.2*(R4 R2)), then it can be said that component has an ascendant part in the upper zone. 2. In either of the cases, the generated potential segmentation points are labeled using 8-way CCL algorithm. Each cluster of segmentation points is labeled uniquely. For each cluster, the following technique is applied to determine the segmentation column along which we can segment the word component under consideration: A) If there is no ascendant in the word component under consideration; calculate the sum of number of data pixels, Matra pixels and segmentation-point pixels for each column in the region from R1 to (R3- R2)/2. Otherwise, calculate the same for each column in the region from (R2- R1)/2 to (R2 + (R3- R2)/2). 35

42 B) Consider the column for segmentation within the estimated region (row boundaries), which has the minimum sum, as calculated in step A. Once the word components are segmented into constituent character or their subparts, again 8-way CCL algorithm is applied to separate each such word-component. Finally, such segmented components will be considered for recognition as meaningful character codes. 3.3 Preparation Ground-truthed images After scanning, the document were binarized by global thresholding technique, where the threshold was chosen as the mean of the maximum and minimum gray level values in each document images. All the binarized images were archived in DAT format, where the foreground and background pixels were represented as 0 and 1respectively. Then the documents were preprocessed in order to remove all the remaining salt and pepper noises like long lines in the border zone(s). Then segmentation techniques described in section 3.2 is applied on the word images and consequently gets the colored images. After getting the segmented components, error detection and correction is required as all the characters are not segmented properly by the segmentation technique. Detecting the errors, correct them manually and prepare the character and component level ground-truthed database. The basic steps of this work are represented by the fig. 16. Using segmentation technique used in section 3.2, we get isolated characters or their subparts. In this stage, the components are reconstructed as a word image depending upon their position in the original word image. These word images are 36

43 Fig. 16 Basic steps of generating Ground-truthed images labeled as distinct color for distinct one assigned and consequently we get the colored image with segmentation effect as shown in table 2. But these word images are not segmented properly. So we need detection and correction the errors manually and consequently we have to prepare ground truth images. 37

Preparation of Ground-truthed images work in two levels. In first level, each character in the word image is identified and then separated from each other.

44 Preparation of Ground-truthed images work in two levels. In first level, each character in the word image is identified and then separated from each other. Then each component either connected or disconnected of a character in the word image has different color. So the first level s work is called component level segmentation. In second level, each character may have two or more subparts either connected or disconnected but they contain the same color as they are the components of a single character. So the second level s work is called character level segmentation. For the purpose of error detection and correction the tool Paint is used. Paint reads word images with white background. We can select any color from the color box and use that to recolor the characters which are not segmented properly in the word image by selecting the intended segment point with the pencil. Using this technique, we can easily correct errors in our segmentation algorithm to generate ground truth data. A screenshot of the tool Paint with a word image is shown in fig. 17. The algorithmic representation for estimation of this method is given below: Steps: 1. Open an word image with the tool Paint 2. Pick any color from color box and then select the intended segment point with pencil. 3. The character which we intended to segment is filling with color. 4. This will be done until all the characters are segmented. 5. Close the window and save the image. Fig. 17 A screenshot of the toll Paint with an word image 38

45 Table 2: Some results after segmentation In ground-truthed database generation we work in two levels. In first level, each character in the word image is identified and then separated from each other. Then each component either connected or disconnected of a character in the word image has different color. So the first level s work is called component level segmentation. In second level, each character may have two or more subparts either connected or disconnected but they contain the same color as they are the components of a single character. So the second level s work is called character recognition. Among all the digitized word sub-parts generated after connected component label-ling, a decision is often required to identify only the components that need further segmentation because of the presence of many inherently segmented characters or their subparts in word images (as shown in Fig. 19(a) ). Thus, all word-components may not 39

46 require further segmentation at all. These components are often classified into SF and DNS classes as shown in Fig. 19. To Segment DNS word components is an over-head for isolation of character components and also may causes over segmentation of word components. So, Selection of SF and DNS components not only minimize the character isolation overhead but also minimize the over segmentation probability. For this we have developed here a two stage selection for SF and DNS class components. In the table 2, the figure 01 is segmented correctly by the work [2]. So we do not require any change. But in figure 02, two consecutive characters are connected to each other after segmentation. This type of segmentation is called under segmentation. We need to separate these characters by two distinct colors. To separate these characters identify the intended segment point. Then the character having two parts one is in middle zone and another is in upper zone also requiring segmentation. This is shown in fig. 18(a) step by step. In second level, two subparts of a character either connected or disconnected have the same color as they are parts of the same character. In table 2 the fig. 03 has character containing two or more colors after segmentation. this type of segmented word images are called over segmented. the character level and component level segmentation of the over segmented word images is shown in fig. 18(b). Collecting the component and character level Ground-truthed word images, we create a database which is shown in the table 3. The results after segmentation are comparing with the corresponding Ground-truthed images automatically by the propose tool and will get the success rate. 40

47 Component level segmentation Character level segmentation Fig. 18 (a) Illustration of Component level and Character level segmentation of the under segmented word image 41

Fig. 18 (b) Illustration of Component level and

image Table 3: Character and component level

# Original gray level Bangla word Images Corresponding

48 Fig. 18 (b) Illustration of Component level and Character level segmentation of the over segmented word image Table 3: Character and component level ground-truthed database: Sl. # Original gray level Bangla word Images Corresponding Character level Ground-truthed images Corresponding Component level Ground-truthed images

A two-stage approach for segmentation of handwritten Bangla word images

A two-stage approach for segmentation of handwritten Bangla word images Ram Sarkar, Nibaran Das, Subhadip Basu, Mahantapas Kundu, Mita Nasipuri #, Dipak Kumar Basu Computer Science & Engineering Department,