Off Line Sinhala Handwriting Recognition with an Application for Postal City Name Recognition

Similar documents
Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

Automatic Recognition and Verification of Handwritten Legal and Courtesy Amounts in English Language Present on Bank Cheques

Sinhala Handwriting Recognition Mechanism Using Zone Based Feature Extraction

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation

HANDWRITTEN GURMUKHI CHARACTER RECOGNITION USING WAVELET TRANSFORMS

Handwritten Devanagari Character Recognition Model Using Neural Network

Recognition of Unconstrained Malayalam Handwritten Numeral

ABJAD: AN OFF-LINE ARABIC HANDWRITTEN RECOGNITION SYSTEM

Segmentation of Bangla Handwritten Text

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

OCR For Handwritten Marathi Script

LEECH STEP PATH FINDING ALGORITHM

FRAGMENTATION OF HANDWRITTEN TOUCHING CHARACTERS IN DEVANAGARI SCRIPT

Segmentation of Characters of Devanagari Script Documents

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Preprocessing of Gurmukhi Strokes in Online Handwriting Recognition

OFF-LINE HANDWRITTEN JAWI CHARACTER SEGMENTATION USING HISTOGRAM NORMALIZATION AND SLIDING WINDOW APPROACH FOR HARDWARE IMPLEMENTATION

CHAPTER 1 INTRODUCTION

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

Handwriting segmentation of unconstrained Oriya text

A two-stage approach for segmentation of handwritten Bangla word images

A New Technique for Segmentation of Handwritten Numerical Strings of Bangla Language

Hidden Loop Recovery for Handwriting Recognition

Indian Multi-Script Full Pin-code String Recognition for Postal Automation

Spotting Words in Latin, Devanagari and Arabic Scripts

A New Approach to Detect and Extract Characters from Off-Line Printed Images and Text

Handwritten Gurumukhi Character Recognition by using Recurrent Neural Network

Devanagari Handwriting Recognition and Editing Using Neural Network

Enhancing the Character Segmentation Accuracy of Bangla OCR using BPNN

LECTURE 6 TEXT PROCESSING

A Neural Network Based Bank Cheque Recognition system for Malaysian Cheques

Character Recognition Using Matlab s Neural Network Toolbox

A Review on Handwritten Character Recognition

HCR Using K-Means Clustering Algorithm

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation

Multi-Layer Perceptron Network For Handwritting English Character Recoginition

A Review on Different Character Segmentation Techniques for Handwritten Gurmukhi Scripts

Segmentation of Isolated and Touching characters in Handwritten Gurumukhi Word using Clustering approach

Robust line segmentation for handwritten documents

SEVERAL METHODS OF FEATURE EXTRACTION TO HELP IN OPTICAL CHARACTER RECOGNITION

Segmentation of Kannada Handwritten Characters and Recognition Using Twelve Directional Feature Extraction Techniques

Word Matching of handwritten scripts

Offline Signature verification and recognition using ART 1

Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script

NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: VOLUME 2, ISSUE 1 JAN-2015

A System towards Indian Postal Automation

II. WORKING OF PROJECT

AUTOMATIC LOGO EXTRACTION FROM DOCUMENT IMAGES

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera

IDIAP. Martigny - Valais - Suisse IDIAP

A Hierarchical Pre-processing Model for Offline Handwritten Document Images

One Dim~nsional Representation Of Two Dimensional Information For HMM Based Handwritten Recognition

DEVANAGARI SCRIPT SEPARATION AND RECOGNITION USING MORPHOLOGICAL OPERATIONS AND OPTIMIZED FEATURE EXTRACTION METHODS

Hilditch s Algorithm Based Tamil Character Recognition

Chapter 2. Literature Survey and Objectives. 2.1 Literature Survey

Handwritten Character Recognition with Feedback Neural Network

Face Recognition Technology Based On Image Processing Chen Xin, Yajuan Li, Zhimin Tian

Extracting Characters From Books Based On The OCR Technology

Slant normalization of handwritten numeral strings

A Technique for Offline Handwritten Character Recognition

NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: VOLUME 5, ISSUE

An Efficient Character Segmentation Based on VNP Algorithm

FEATURE EXTRACTION TECHNIQUE FOR HANDWRITTEN INDIAN NUMBERS CLASSIFICATION

Word-wise Hand-written Script Separation for Indian Postal automation

Structural Feature Extraction to recognize some of the Offline Isolated Handwritten Gujarati Characters using Decision Tree Classifier

Mobile Application with Optical Character Recognition Using Neural Network

Hand Written Character Recognition using VNP based Segmentation and Artificial Neural Network

A Feature based on Encoding the Relative Position of a Point in the Character for Online Handwritten Character Recognition

A New Algorithm for Detecting Text Line in Handwritten Documents

Isolated Curved Gurmukhi Character Recognition Using Projection of Gradient

Offline Tamil Handwritten Character Recognition using Chain Code and Zone based Features

Anale. Seria Informatică. Vol. XVII fasc Annals. Computer Science Series. 17 th Tome 1 st Fasc. 2019

Optical Character Recognition For Bangla Documents Using HMM

Segmentation Based Optical Character Recognition for Handwritten Marathi characters

2. Basic Task of Pattern Classification

One type of these solutions is automatic license plate character recognition (ALPR).

Character Segmentation and Recognition Algorithm of Text Region in Steel Images

OFFLINE SIGNATURE VERIFICATION

Handwritten Hindi Numerals Recognition System

Handwritten Numeral Recognition of Kannada Script

HANDWRITTEN/PRINTED TEXT SEPARATION USING PSEUDO-LINES FOR CONTEXTUAL RE-LABELING

guessed style annotated.style mixed print mixed print cursive mixed mixed cursive

Radial Basis Function Neural Network Classifier

An Improvement Study for Optical Character Recognition by using Inverse SVM in Image Processing Technique

Extract an Essential Skeleton of a Character as a Graph from a Character Image

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

Universal Graphical User Interface for Online Handwritten Character Recognition

Off-line Character Recognition using On-line Character Writing Information

HANDWRITTEN SIGNATURE VERIFICATION USING NEURAL NETWORK & ECLUDEAN APPROACH

International Journal of Advance Research in Engineering, Science & Technology

INTERNATIONAL RESEARCH JOURNAL OF MULTIDISCIPLINARY STUDIES

Segmentation of Arabic handwritten text to lines

PRINTED ARABIC CHARACTERS CLASSIFICATION USING A STATISTICAL APPROACH

Opportunities and Challenges of Handwritten Sanskrit Character Recognition System

Invarianceness for Character Recognition Using Geo-Discretization Features

2009 International Conference on Emerging Technologies

Online Signature Verification Technique

Transcription:

Off Line Sinhala Handwriting Recognition with an Application for Postal City Name Recognition M.L.M Karunanayaka, N.D Kodikara, G.D.S.P Wimalaratne University of Colombo School of Computing, No.35, Reid Avenue, Colombo 7, Sri Lanka. Tel: +94-11-2581245/8, Fax: +94-11-2587239 {mlm,ndk,spw}@ucsc.cmb.ac.lk Abstract Sinhala is the national language of the country of Sri Lanka. 70% of the people living in Sri Lanka use Sinhala language for their day-to-day activities.very little number of researches have been done about Sinhala handwriting recognition. This proposed system is focused on recognition of Sinhala handwriting using postal city names as a case study.training and testing of this case study is done by using the handwriting of postal envelops. Therefore this research not limited only to a single writing style. Number of main post offices in Sri Lanka are limited to five hundreds and one. In this system one of the major impediments are touching characters. Segmentation of handwritten touching characters become a crucial step in such systems. Conventional segmentation methods incapable of handling the complexity exists in Sinhala handwritten characters. The proposed method separates touching characters into isolated character models in two steps viz; basic projection profile method and water reservoir concept. Finally recognition process carries with using the Kohenen artificial neural network. Over 300 patterns are tested in segmentation and 92% accuracy was reported and recognition phase was tested using 400 patterns and 84.5% success rate was reported. Keywords: Sinhala handwriting, character recognition, character segmentation, noise removal, artificial neural network, postal recognition 1. Introduction Off-Line recognition of handwriting has numerous practical applications in areas such as banking, census, mail sorting, commerce, etc. There are many techniques available in computational pattern recognition such as artificial neural networks and hidden Markov models to recognize handwritten characters in various research areas. In the character recognition schema there are several steps to be done before the recognition process starts. Those steps can be divided mainly into two groups. The first is Data acquisition and the second is preprocessing. In Data acquisition, Input image (i.e. Gray Level image) is converted to the binary image. All the image enhancement processes prior to recognition had been done in preprocessing step. Noise detection and removal, thinning, segmentation as the processes was done in this step. The next step is known as recognition. In our Sinhala handwriting recognition system Data acquisition and noise removal was done by using thresholding technique. Segmentation and recognition is the most difficult task of the proposed system, because most of the real Sinhala postal writings are touching each other and writing with various handwriting styles. There are two possible approaches that can be taken by the segmentation and recognition system namely: segment the image containing characters prior to the recognition phase [2,3,6,10] or integrate segmentation and recognition phases together [1,7]. In the proposed system segmentation and the recognition phases are done separately. It has been able to identify four different touching character groups according to the way they touch each other.overlapping, touching, connecting and intersecting are the identified character groups (Figure 1). In this paper, a two level segmentation algorithm is proposed. At the first level of the method, touching characters are identified if they are available in the input image.those touching characters are separated into appropriate groups described above. A suitable segmentation algorithm is applied consequently to each touching character. This proposed system recognition process is done using Kohenen artificial neural network. The organization of the paper is as follows; the section two of this paper describes the background of the Sinhala handwriting and its different styles. The previous researches done in this area is also described in the

section. The section three of this paper describes the methods used to implement the system. Binarization, noise reduction, segmentation and recognition methods too are described in this section. The evaluation of the results of the present work is given in the section four. Finally, the conclusion of the present work is described in the section five. Figure 1: Combined character groups 2. Background Sinhala is the national language of the people living in Sri Lanka [9]. 13 million people i.e. 70% of the population in Sri Lanka use Sinhala characters to write their mother tongue. It is not spoken in any other country, except enclaves of migrants. Being a descendant of a spoken form (Pali) of the root Indic language, Sanskrit, it can be argued that this belongs to the large family of Indo-Aryan languages [11]. Sinhala language is written by the left to right pattern and it has curved shape scripts. And these characters are written within the three layers horizontal that is upper layer, middle and lower layers. Some characters are written within these three layers, some of them are written only in the middle layer and another set of characters are written in the middle layer and in other two layers but it is optional to occupy both upper and lower layers for these characters ( Figure 2 ). A few researches have been done on Sinhala handwritten characters [4, 8]. Almost all those researches are focused on identifying regular, well-defined Sinhala handwritten character recognition. Therefore there has been no research done in cursive, unconstrained handwritten characters at all. 3. Methodology This section explains the procedure of noise removing, separation of touching and individual characters, segmenting touching characters and recognition of the postal cities. Section 3.1 describes the noise removal techniques of the input image. The section 3.2 describes the methodology of the separation touching characters and individual characters. The section 3.3 describes the segmentation techniques of the touching characters. Final section 3.4, describes the recognition methodology used in Kohenen artificial neural networks. 3.1. Binarization and Noise Removal Techniques This section introduces the combination techniques of binarization and noise removing. After processing these techniques the output image contain 255 background value and original foreground pixel values. This section describes three approaches taken in the process of binarization and noise removal of images. First of all the gray levels of the input images are sorted in ascending order, assuming that, the foreground of the image is one fourth of the all pixels in the image. Then get the first quarter of the sorted pixels and check the maximum gray value (i.e., MaxPV ) of this particular pixels. The MaxPV used as the cutoff gray value between the background and the foreground in the image. Use the following algorithm (Equation 1) to convert gray level image foreground into gray level 255. ( 1 ) Figure 2: Sinhala characters written with layers In the next method 3x3 kernel is used. This kernel is then applied on each pixel of the image together with their respective grey level values. The kernel is applied to one pixel exactly once. If the number of pixels that have grey level values which are close to zero (i.e. black) in a

Noise Removed Image Vertical Projection Profile section 3.2 Labeling section 3.3 Kohenen ANN Recognized Character Condition in equation 4 Water Reservoir concept Section 3.3 Cat 1 Condition 1 in equation 5 Non-Segment Condition 2 in equation 5 Cat 2 Condition 3 in equation 5 Cat 3 Non Segmented Character Figure 3 : Hierarchy of segmentation and recognition

given kernel is less than or equal to two that area is considered as an area belongs to the background of the image and then the grey level values of that area are set to 255(i.e. white). The other approach is the dynamic adaptive threshold (DTU) method. The following equation determines the adaptive threshold (Equation 2) I(x,y) is the gray level of a particular pixel in the image where 1 x w and 1 y h where w and h are width and height of the image respectively. Adaptive Threshold = x=width, y=height Min{I(x,y)}+Max{I(x,y)}+MaxPV ( 2 ) x=1, y=1 3 The threshold calculated according to the equation 2 is compared with grey level value of the corresponding pixel. If the grey value is above the threshold value of the image that pixel is set to 255. In the final output image the grey values of the background pixels should be 255 and the grey values of the foreground pixels are as same as the initial image gray value. In this method the grey values of the foreground are not changed in order to preserve the details and the information of the characters as much as possible. 3.2. Separation Touching and Individual Characters This section describes the segmentation procedure carried out in the present work. The figure 3 depicts the flow of the segmentation and recognition procedure. At the beginning the images are segmented using vertical projection profile (VPP) method. The touching characters (TC) are considered as a single entity at this level. In the next level the segmented character entities are further classified according to the criteria that whether the segmented character entities are touching characters or single characters. If there are touching characters the criteria further classifies them into the groups described in the section one of this paper. The above mentioned steps are done according to the procedure described below: The width of the image and number of characters occur in that image obtained from VPP is used to estimate roughly the average character width.( Equation 3) Average Character Image Width ( 3 ) Width = Number of Characters The segmented character entities are further classified according to the procedure given in equation 4. 3.3. Segmentation of Touching Characters This section describes the touching character segmentation procedures. First step of the segmentation procedure is labeling each touching character in order to distinguish overlapping touching character from other touching characters. If ( ( 3 x AvgWidth /2 ) > Width ) Touching Character ( 4 ) Else Segmented Character The proposed labeling algorithm uses 3 x 3 kernel and moves it horizontally through the input image. When it finds the first dark pixel (foreground pixel) it moves along the object assigning a special number 1(use label counter). If the foreground object is discontinued, the label counter is increased by one and the kernel moves freely to find another object. Then assign increment label counter value to newly find object and this procedure is carried out until the width of the image is reached If the label counter is greater than one it is reasonable to deduce that segmented unit which has more than one character. Then each label represents a separate character unit. If the label counter is equal to one the segmented character unit which is segmented in section 3.2 belongs to other three groups mention in section 1 viz touching, connecting and intersecting. These groups will continue in the next segmentation procedure namely Water Reservoir Concept discussed briefly in Pal, Beliad and Choisy[7]. The Figure 4 shows how the Water Reservoir Concept segments touching characters. Figure 4 : Apply water reservoir concept in touching character Validating of the segmented units using water reservoir concepts is based on the following attributes: 1. Number of Top reservoirs and its heights, volume and the Centre of Gravity. 2. Number of Bottom reservoirs and its heights, volume and the Center of Gravity. 3. Number of reservoirs in this unit 4. Maximum depth of each reservoir and check if a reservoir has more than one maximum point.

5. If the character unit have both top and bottom reservoirs, then calculate the angle of the centre of gravity (join each top centre of gravity and each bottom centre of gravity points in the reservoirs) If ( Number of top reservoirs = 1 and Top Reservoir height > 3 x Character Height / 4 and No one maximum depth Point ) PUT BIN 1 Else if (Number of Top reservoir = 1 and Number of bottom reservoir = 1 and Angle of the Center of Gravities of the reservoirs is in between -45 degree and 45 degree and No one maximum depth point ) PUT BIN 2 Else if ( Number of Top reservoir >= 1 and Number of bottom reservoir >= 1 and Only one maximum depth point in each reservoir ) PUT BIN 3 Else CANNOT SEGMENTED Equation 5 This is the final step of the segmentation method. The equation 5 is used to separate the touching characters into the three groups mentioned in the section one. There are different techniques which can be used to segment each group of character units which are suitable for each character group. Following paragraph discusses the segmentation of characters in particular groups or categories: Category 1 : Gray level distribution is used to segment characters in this category. In this type of characters, the connection point gray values are higher than other gray values of the character. That means the connection point is lighter than the other points in the character image because the writer withdraws the pressure on the pen tip through the connecting area but still continues writing with a little amount of pressure. The highest gray value points in the foreground pixels are the segmented point of the category 1 character images. This segmented point should also be checked whether it belongs to the reservoirs bottom line. Category 2 : This type of characters uses MDTR and MDBR to choose the segmentation point. Each maximum depth points on top and at the bottom are joined and the length of these connecting points, are calculated. This category has only one top reservoir and one bottom reservoir. Then, the segmentation path occurs through this connected line and separates the combined character into the isolated two characters using the MDTR and MDBR joining line. Category 3 : Characters in category 3 has more than one top reservoirs and also more than one bottom reserviors. All maximum depth points in top and bottom reserviors are joined. Then, the minimum distance of connecting points in the top reserviors depth points and the bottom reservoirs depth points, is calculated. This minimum distance line is the best cutoff line of the combined characters in category 3. 3.4. Recognition The Kohenen Artificial Neural Network (KANN) is used for the recognition phase of the proposed system. In this KANN 32 x 32 input neurons and 1 output neuron is used. The pattern in the input neuron is shown in figure 5. Each square in this pattern is one input node of the KANN. This KANN has only one input layer and one output layer. For this proposed system, all available characters are divided into forty groups as shown in figure 6. In dividing the group, modifiers of the characters are ignored where modifiers can be separated like, but the other characters where modifiers cannot be separated are taken as a whole with the modifier as in. All the modifiers that can be separated are grouped in cage 9. In this proposed system, first pattern is to input into KANN and produced the output, which is one of the forty character group listed in figure 6. If the output is group 9, then this pattern is ignored and passed on to check the next pattern which will be one out of the other thirty nine groups and is selected as the first letter of the word. The second pattern is chosen from the second letter of the list of words categorized under one selected character that is the first letter of the word selected by KANN and set it as

the second letter of the word ensuring that the pattern is not included in the group 9. This processed can be continued to select the remaining character patterns in the word image. Figure 5 : Input pattern of KANN Identifying city image is as follows: for a example if the city image is,the KANN generate the output signal as AQTMX and then system search the database which city is equal to emitting symbol AQTMX. Finally, system can understand the input city is Anamaduwa. Symbol Character Symbol Character A Q B R C S D E F f G H I i J s T t U V W w X K Y L Z l 1 M 2 m 3 N 4 O 5 P 9 Figure 6 : Groups of Characters x 4. Results and Evaluation Proposed method was applied for Sinhala handwritten postal addresses because the postal addresses are written by many different people and with different educational levels which leads the sample data set to accommodate a wide variation of handwritten characters and there is no restriction applied when writing the city names. The Sinhala handwritten database which is available in NSF[4] in Sri Lanka is one of the sources of Sinhala handwriting where the real postal addresses used to train and test the proposed method. Another source used to testing and training of proposed system is manually collected postal city names which are digitized using HP Scanjet 5200C, written by selected students of University of Colombo. For training and testing of the proposed system, the collected data are categorized into four groups. Those groups are namely real postal addresses which are in NSF database(rpa), words written by the student of university of Colombo(WUOC) and a selected sample of combined characters(cc) and isolated characters(ic). Success rate of the segmentation method is 92% and recognition method is 80%. Segmentation and recognition results are shown in table 4.1 and table 4.2 respectively. No of Segmented Success(%) Patterns correctly RPA 100 89 89% WUOC 100 93 93% CC 100 94 94% Total 300 276 92% No of Segmented Success(%) Patterns correctly RPA 100 76 76% WUOC 100 87 87% CC 100 85 85% IC 100 90 90% Total 400 338 84.5% 5. Conclusions Table 4.2: Results of segmentation Table 4.2: Results of recognition According to the observation of the present work it can be concluded that the approach presented in this paper is efficient in recognizing cursive unconstrained Sinhala handwriting recognition compared with the traditional conventional segmentation and recognition methods. The performance of the present system can be improved in

many ways by incorporating other segmentation and recognition methods. More complex touching character groups and very noisy images that could not be handled in the present work can be handled by improving this method as a future enhancement. The next major future enhancement in this approach is to introduce the hybrid recognition process as some character groups are misrecognized when only the Kohenen Artificial Neural Network is used. Examples for 10. K. Romeo-Pakker, H. Miled and Y. Lecourtier. A new approach for Latin/Arabic character segmentation', 3 rd International Conference on Document Analysis and Recognition, pages. 874-877. 1995 11. R. Weerasinghe. A Statistical Translation Approach to Sinhala-Timil Language Translation. 5 th International Informmation Technology Conference,pages 136 141,2003. such groups are, and and and. Hybrid recognition process can be used by combing the Kohenen artificial neural network and Hidden Markov Models for post processing techniques in the future enhancements. References 1. T. M. Breue. Segmentation of Handprinted Letter Strings using a Dynamic Programming Algorithm. 6 th Internatioonal Conference on Document Analysis and Recognition, volume 1, pages 821-826. 2001 2. R.G. Casey and E. Lecolinet.'A Survey of Methods and Strategies in Character Segmentation. IEEE Transaction on Ptttern Analysis and Machine Intelligence,volume. 18,number 7, pages. 690-706. 1996. 3. C.E. Dunn and P.S.P Wang. Character Segmentation Tech- Techniques for Handwritten Text A Survey. International Conference on Pattern Recognition, pages. 577-580. 1992. 4. H.C Fernando, N.D Kodikara and S. Hewavitharana. A Database for Handwriting Recognition Research in Sinhala Language. Proceeding of 7 th International Conference on Document Analysis and Recognition, Edinburgh,UK. 5. S. Hewavitharana, H.C. Fernando and N.D. Kodikara. Offline Sinhala Handwriting Recognition using Hidden Markov Models. Proceeding of the Third Indian Conference on Computer Vision, Graphics and Image Processing,2002. 6. D. Lee, S. Lee and H. Park. A New Methodology for Gray- Scale Character Segmentation and Reco- gnition',ieee Transactions on Pattern Analysis and Machine Intelligence, volume. 18, number. 10, pages. 1045-1050. 1996. 7. S. Messelodi and C.M. Modena. Context driven text segmentation and recognition. Pattern Recognition, volume. 17, pages. 47-56.1996. 8. U. Pal,U, A. Belaid, and C. Choisy. Touching numeral segmentation using water reservoir concept. Pattern Recognition Letters,volume. 24, pages 261-272,2003 9. H.L. Premaratne and J. Bigun.Recognition of Printed Sinhala Characters Using Linear Symmetry. Symmetry. 5th Asian Conference on Computer Vision, pages 23-25, 2002.