A New Approach to Detect and Extract Characters from Off-Line Printed Images and Text

Similar documents
OCR For Handwritten Marathi Script

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

A Hierarchical Pre-processing Model for Offline Handwritten Document Images

Segmentation of Characters of Devanagari Script Documents

HCR Using K-Means Clustering Algorithm

CHAPTER 1 INTRODUCTION

Handwritten Devanagari Character Recognition Model Using Neural Network

An Efficient Character Segmentation Based on VNP Algorithm

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

Available online at ScienceDirect. Procedia Computer Science 45 (2015 )

Hand Written Character Recognition using VNP based Segmentation and Artificial Neural Network

Character Recognition Using Matlab s Neural Network Toolbox

OCR and OCV. Tom Brennan Artemis Vision Artemis Vision 781 Vallejo St Denver, CO (303)

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera

NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: VOLUME 5, ISSUE

Neural Network Classifier for Isolated Character Recognition

Automatic Recognition and Verification of Handwritten Legal and Courtesy Amounts in English Language Present on Bank Cheques

LECTURE 6 TEXT PROCESSING

SEVERAL METHODS OF FEATURE EXTRACTION TO HELP IN OPTICAL CHARACTER RECOGNITION

Scene Text Detection Using Machine Learning Classifiers

A two-stage approach for segmentation of handwritten Bangla word images

Structural Feature Extraction to recognize some of the Offline Isolated Handwritten Gujarati Characters using Decision Tree Classifier

SKEW DETECTION AND CORRECTION

ISSN: [Mukund* et al., 6(4): April, 2017] Impact Factor: 4.116

Handwritten Script Recognition at Block Level

HANDWRITTEN GURMUKHI CHARACTER RECOGNITION USING WAVELET TRANSFORMS

NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: VOLUME 2, ISSUE 1 JAN-2015

Numerical Recognition in the Verification Process of Mechanical and Electronic Coal Mine Anemometer

Hidden Loop Recovery for Handwriting Recognition

Layout Segmentation of Scanned Newspaper Documents

Handwritten Hindi Character Recognition System Using Edge detection & Neural Network

Optical Character Recognition

Probabilistic Artificial Neural Network For Recognizing the Arabic Hand Written Characters

Idle Object Detection in Video for Banking ATM Applications

A Technique for Classification of Printed & Handwritten text

Mobile Application with Optical Character Recognition Using Neural Network

A semi-incremental recognition method for on-line handwritten Japanese text

Automatically Algorithm for Physician s Handwritten Segmentation on Prescription

DATABASE DEVELOPMENT OF HISTORICAL DOCUMENTS: SKEW DETECTION AND CORRECTION

International Journal of Advance Research in Engineering, Science & Technology

A Laplacian Based Novel Approach to Efficient Text Localization in Grayscale Images

Character Recognition

International Journal of Signal Processing, Image Processing and Pattern Recognition Vol.9, No.2 (2016) Figure 1. General Concept of Skeletonization

Recognition of Unconstrained Malayalam Handwritten Numeral

Dynamic Stroke Information Analysis for Video-Based Handwritten Chinese Character Recognition

Segmentation of Kannada Handwritten Characters and Recognition Using Twelve Directional Feature Extraction Techniques

I. INTRODUCTION. Figure-1 Basic block of text analysis

Indian Multi-Script Full Pin-code String Recognition for Postal Automation

Handwritten Marathi Character Recognition on an Android Device

FRAGMENTATION OF HANDWRITTEN TOUCHING CHARACTERS IN DEVANAGARI SCRIPT

Segmentation of Arabic handwritten text to lines

Word Slant Estimation using Non-Horizontal Character Parts and Core-Region Information

Skew Detection and Correction of Document Image using Hough Transform Method

Handwriting Recognition of Diverse Languages

DEVANAGARI SCRIPT SEPARATION AND RECOGNITION USING MORPHOLOGICAL OPERATIONS AND OPTIMIZED FEATURE EXTRACTION METHODS

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation

Use of Multi-category Proximal SVM for Data Set Reduction

Hilditch s Algorithm Based Tamil Character Recognition

DESIGNING A REAL TIME SYSTEM FOR CAR NUMBER DETECTION USING DISCRETE HOPFIELD NETWORK

II. WORKING OF PROJECT

Robust PDF Table Locator

Segmentation of Isolated and Touching characters in Handwritten Gurumukhi Word using Clustering approach

OFFLINE SIGNATURE VERIFICATION

A Multimodal Framework for the Recognition of Ancient Tamil Handwritten Characters in Palm Manuscript Using Boolean Bitmap Pattern of Image Zoning

Extraction and Recognition of Alphanumeric Characters from Vehicle Number Plate

International Journal of Advance Research in Engineering, Science & Technology

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Solving Word Jumbles

A Generalized Method to Solve Text-Based CAPTCHAs

Multi-Layer Perceptron Network For Handwritting English Character Recoginition

A Review on Handwritten Character Recognition

Evaluation of Moving Object Tracking Techniques for Video Surveillance Applications

Available online at ScienceDirect. Procedia Technology 11 (2013 )

Available online at ScienceDirect. Procedia Computer Science 56 (2015 )

IMPLEMENTING ON OPTICAL CHARACTER RECOGNITION USING MEDICAL TABLET FOR BLIND PEOPLE

Automatic License Plate Recognition in Real Time Videos using Visual Surveillance Techniques

Auto-Digitizer for Fast Graph-to-Data Conversion

Restoring Warped Document Image Based on Text Line Correction

Text Extraction from Images

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

I. INTRODUCTION. Image Acquisition. Denoising in Wavelet Domain. Enhancement. Binarization. Thinning. Feature Extraction. Matching

Handwriting Recognition, Learning and Generation

Available online at ScienceDirect. Procedia Computer Science 59 (2015 )

Recognition of online captured, handwritten Tamil words on Android

Handwritten Gurumukhi Character Recognition by using Recurrent Neural Network

Character Segmentation and Recognition Algorithm of Text Region in Steel Images

A Document Image Analysis System on Parallel Processors

Data Hiding in Binary Text Documents 1. Q. Mei, E. K. Wong, and N. Memon

Keyword Spotting in Document Images through Word Shape Coding

with Profile's Amplitude Filter

A Method of Annotation Extraction from Paper Documents Using Alignment Based on Local Arrangements of Feature Points

ABJAD: AN OFF-LINE ARABIC HANDWRITTEN RECOGNITION SYSTEM

A Model-based Line Detection Algorithm in Documents

PROBLEM FORMULATION AND RESEARCH METHODOLOGY

Hand Written Telugu Character Recognition Using Bayesian Classifier

Application of Geometry Rectification to Deformed Characters Recognition Liqun Wang1, a * and Honghui Fan2

Document Image Binarization Using Post Processing Method

A System for Automatic Extraction of the User Entered Data from Bankchecks

Toward Part-based Document Image Decoding

Transcription:

Available online at www.sciencedirect.com Procedia Computer Science 17 (2013 ) 434 440 Information Technology and Quantitative Management (ITQM2013) A New Approach to Detect and Extract Characters from Off-Line Printed Images and Text Amit Choudhary a, *, Rahul Rishi b, Savita Ahlawat c a Maharaja Surajmal Institute, New Delhi, India b UIET, Maharshi Dayanand University, Rohtak, India c Maharaja Surajmal Institute of Technology, New Delhi, India Abstract Characters extraction is the most critical pre-processing step for any off-line text recognition system because the characters are the smallest unit of any language script. The paper proposes an approach to segment character images from the text containing images and computer printed or handwritten words. This segmentation approach is based on a set of properties for each connected component (object) in the whole binary image of the machine printed or handwritten text containing some other images. These words which are printed along with some images are of different lengths and are printed by different cursive fonts of different sizes. This character extraction technique is applied for the segmentation of untouched characters from the machine printed or handwritten words of varying length written on a noisy background having some images etc. Very promising results are achieved which reveals the robustness of the proposed character detection and extraction technique. 2013 The Authors. Published by by Elsevier B.V. B.V. Selection and/or peer-review under responsibility of the organizers of of the the 2013 International Conference Computational on Information Technology Science and Quantitative Management Keywords: OCR; Character Extraction; Character Recognition; Pre-processing Techniques; Cursive Words. 1. Introduction Character detection, extraction and recognition have been an active field of research for many years. It still remains an open problem in the field of Pattern Recognition and Image Processing. Optical Character Recognition (OCR) is a program that translates scanned or printed image of the document into a text document that can be edited. There are mainly three phases of a character recognition system namely Preprocessing, Segmentation and Recognition. Preprocessing aims at eliminating the variability that is inherent in printed * Corresponding author. Tel.: +91-991-133-5069. E-mail address: amit.choudhary69@gmail.com. 1877-0509 2013 The Authors. Published by Elsevier B.V. Selection and peer-review under responsibility of the organizers of the 2013 International Conference on Information Technology and Quantitative Management doi:10.1016/j.procs.2013.05.056

Amit Choudhary et al. / Procedia Computer Science 17 ( 2013 ) 434 440 435 words. The preprocessing techniques such as background noise removal, scaling, thinning skew removal etc. have been employed by various researchers in an attempt to increase the performance of the Segmentation and Recognition process. The role of Segmentation is to find correct letter boundaries. Segmentation precedes character recognition, which means that the output of segmentation becomes the input to the character recognition module. Segmentation of off-line cursive words into characters is one of the most difficult and important process in handwriting recognition as it directly affects the result of recognition process [1, 2, 5, 6]. The outline of the paper is as follows: Section 2 describes the work already done in this field so far. In section 3, Steps followed for the proposed segmentation technique are described. Section 4 describes the implementation details. Discussion of results is presented in section 5 and finally, the paper is concluded in section 6. 2. Related Work Even though segmentation is still a major problem in off-line text recognition (machine printed or handwritten), there were not many published segmentation algorithms until recent years, and there are still not many that report performance results separate from the recognition performance. Furthermore, many of the reported algorithms are for machine printed text, and would not work well with handwriting because they depend on special characteristics of machine printing (regularity, less or non-existent slant, etc.). An overview of segmentation of handwritten text can be found in Rehman and Dzulkifli 8]. Similarly, Saba, Rehman and Sulong [9] did a survey of character segmentation techniques in general. Samrajya et al. [10] investigate Hypergraph model to segment a cursive handwritten word image into isolated characters. Hypergraph model treats an image as packets of pixels. Authors claim that by recombining these packets of different sizes a given word image can be segmented into characters if at least one of the combinations provided a correct segmentation. However, neither segmentation results are presented for comparison nor the technique seems to yield successful results for touching characters. Dawoud [4] introduce iterative cross section sequence graph (ICSSG) for the character segmentation. ICSSG tracks the characters growth at equally spaced thresholds. The iterative thresholding reduces the effect of information loss associated with image binarization. However, the experiments are performed on handwritten digits only. Recently, Lee and Verma [7] propose a new segmentation algorithm for off-line cursive handwriting recognition. Initially, word images are dissected heuristically based on pixel density between upper and lower baselines. Each segment passed through multiple expert based validation processes to determine valid character boundaries. 3. Proposed Character Extraction Technique Segmentation module in handwriting recognition plays a crucial role for successful performance of the overall recognition system. However, it is very difficult to find precise character boundaries without knowledge about characters. So to avoid the error prone process, segmentation approach that is based on selecting the objects or regions that meet the criterion of having area greater than some threshold value has been proposed as one of the segmentation strategies. The objects whose area is greater than 25 and less than 3000 are assumed to be the valid character objects. The objects of area less than 25 are assumed to be some noise and the objects having area greater than 3000 are assumed to be some picture or logo etc. that may be present in the scanned image of the document. Various steps followed in the proposed character detection and extraction process [3]

436 Amit Choudhary et al. / Procedia Computer Science 17 ( 2013 ) 434 440 can be described as follows: Step1: The input scanned document image (having logo or images and handwritten or typed words) to be segmented into characters is obtained. Step 2: After pre-processing i.e removing the background noise etc., word image is converted into binary form. Step 3: The image Step 4: All foreground connected components (objects) are extracted by method. Step 5: For each object in the scanned word image, repeat steps 5 to 7. Step 6: If area of an object is greater than 25 and less than 3000, it is a valid connected component (object) and a smallest rectangular region with minimum area containing the complete connected component (object) is drawn on each object. Step 7: Each sub-image containing a single object within a smallest rectangular region is inverted and displayed as segmented character images. 4. Implementati on and Methodol ogy The word images of computer printed or handwritten words have been acquired through a scanner or a digital camera. These word images were written with different fonts of different size having different colors on the noisy background of different colors. The input word images were saved in JPEG or BMP formats for further processing. Some word image samples from our collected dataset are shown in Table 1. Tabl e 1. Printed and Handwritten Word Image Samples.

Amit Choudhary et al. / Procedia Computer Science 17 ( 2013 ) 434 440 437 The word images to be segmented into character were on a noisy background. So, it was necessary to remove the background noise to improve the quality of the word image before further processing. The contrast adjustment was also done to overcome the problem raised due to the use of pens of different colors in case of handwritten words and fonts of different colors in case of the machine printed words. Fig1 shows the original text image containing a logo and printed word and Fig 2 shows the resulting image after background noise removal and binarization using the grayscale intensity threshold. Fig.1. Original scanned input image having a logo image and text. Fig.2. The input image after background noise removal and contrast adjustment Thinning and size normalization operations are not performed here because in this paper only the segmentation technique to segment a word image into individual characters has been presented. If the extracted character images have to be recognized using any pattern recognition technique, only then it may become necessary to obtain the resized image after thinning it. The segmentation technique has been applied on the word image as obtained in Fig 2. The input image was traced vertically from the upper left corner and all the connected components were identified based on their foreground area. All the valid connected components has been extracted and enclosed in a rectangular region with smallest possible area. All the rectangular regions enclosing segmented character images are obtained as shown in Fig 3.

438 Amit Choudhary et al. / Procedia Computer Science 17 ( 2013 ) 434 440 Fig.3. Extraction of characters from the input image It can be observed from these segmented character images that they have been cropped very well. These individual images can be resized and thinning operation can be performed to make the individual character width equal to a single pixel width. After such pre-processing techniques, these character images can be used for training a neural network classifier if the recognition also has to be performed after the segmentation process, as it is normally done in an optical character recognition (OCR) system. 5. Experimental Results and Analysis The proposed character extraction technique performed exceptionally well as s hown in Fig 3. This technique is based on the extraction of the smallest bounding box that contains the whole connected component. A connected component (object) is valid if the area of its rectangular region (bounding box) lies in the minimum and maximum range as specified in the pseudo code of this technique. Due to this constraint, the dots could not be extracted because the area of the bounding boxes containing the dot (.) was less than 25 as shown in Fig 5. Fig.4. Original image and the pre-processed handwritten word image

Amit Choudhary et al. / Procedia Computer Science 17 ( 2013 ) 434 440 439 Fig.5. If the minimum area requirement for a bounding box containing a valid connected component is further decreased to less than 25, the dots (.) in char can be extracted but the foreground noise may also be extracted from the scanned word image. The criterion of minimum and maximum area requirement for deciding valid bounding box can be changed as per the requirement. If the word image is noise free, the minimum area criterion can be set to a value much lower than 25. Further, it has been observed that this technique under-performed when the machine printed or handwritten characters in the scanned word image are not like a single connected component as shown in Fig 6. Here the character image is divided into two connected components and each component is a valid connected component (object) as both qualified the criterion of having the bounding box with a minimum specified area. Fig.6.

440 Amit Choudhary et al. / Procedia Computer Science 17 ( 2013 ) 434 440 6. Conclusion and Future Scope Very promising results are achieved by using this technique to extract character images. Although this technique is having a few shortcomings in cases where a character image is not completely connected, yet it can be extensively used in a variety of applications where the extracted character images are used to train and test a character recognition system like a car number plate recognition system etc. Further, this technique gives future direction for the development of a character extraction technique to extract touching characters like in case of cursive handwriting where characters are touched in a word image. Acknowledgements The authors acknowledge their sincere thanks to the management and the director of Maharaja Surajmal Institute, C-4, Janakpuri, New Delhi, India, for providing infrastructure and financial assistance to carry out this research. The excellent cooperation of the fellow colleagues and library staff is highly appreciated. The authors gratefully acknowledge the contribution of the anonymous comments in improving the clarity of the work. References [1] Broumandnia A, Shanbehzadeh J, Varnoosfaderani MR. Persian/arabic handwritten word recognition using M-band packet wavelet transform, Image and Vision Computing; 2008, (26) 829 842. [2] Camastra F. A SVM-based cursive character recognizer, Pattern Recognition; 2007, (40) 3721 3727. [3] Choudhary A, Rishi R. A Robust Character Extraction Technique for Machine Printed Cursive Words of Varying Length. In: [4] Dawoud A. Iterative cross section sequence graph for handwritten character segmentation. IEEE Trans Image Process; 2007, 16(8) 2150 2154. [5] Fujisawa H. Forty years of research in character and document recognition an industrial perspective, Pattern Recognition; 2010, (41) 2435 2446. [6] Kherallah M, Haddad L, Alimi A, A. Mitiche. On-line handwritten digit recognition based on trajectory and velocity modeling, Pattern Recognition Letters; 2008, (29) 580 594. [7] Lee H, Verma B. A novel multiple experts and fusion based segmentation algorithm for cursive handwriting recognition. In: ; 2008, 2994 2999. [8] Rehman A, Dzulkifli M. A simple segmentation approach for unconstrained cursive handwritten words in conjunction with the neural network. Int J Image Process 2(3); 2008, 29 35. [9] Saba T, Rehman A, Sulong G. Cursive script segmentation with neural confidence. Int J Innov Comput Inf Control; 2011, 7(7). [10] Samrajya P, Lakshmi M, Hanmandlu, Swaroop A. Segmentation of cursive handwritten words using Hypergraph, TENCON, IEEE region 10 Conference; 2006, 1-4.