SEVERAL METHODS OF FEATURE EXTRACTION TO HELP IN OPTICAL CHARACTER RECOGNITION

Similar documents
Off-line Recognition of Hand-written Bengali Numerals using Morphological Features

OCR For Handwritten Marathi Script

Recognition of Unconstrained Malayalam Handwritten Numeral

Indian Multi-Script Full Pin-code String Recognition for Postal Automation

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation

Handwritten Devanagari Character Recognition Model Using Neural Network

Recognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier

A Brief Study of Feature Extraction and Classification Methods Used for Character Recognition of Brahmi Northern Indian Scripts

DEVANAGARI SCRIPT SEPARATION AND RECOGNITION USING MORPHOLOGICAL OPERATIONS AND OPTIMIZED FEATURE EXTRACTION METHODS

Segmentation of Kannada Handwritten Characters and Recognition Using Twelve Directional Feature Extraction Techniques

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

HANDWRITTEN GURMUKHI CHARACTER RECOGNITION USING WAVELET TRANSFORMS

Handwritten Hindi Character Recognition System Using Edge detection & Neural Network

A two-stage approach for segmentation of handwritten Bangla word images

LECTURE 6 TEXT PROCESSING

Handwritten Gurumukhi Character Recognition by using Recurrent Neural Network

LITERATURE REVIEW. For Indian languages most of research work is performed firstly on Devnagari script and secondly on Bangla script.

Word-wise Hand-written Script Separation for Indian Postal automation

A Simplistic Way of Feature Extraction Directed towards a Better Recognition Accuracy

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes

A New Technique for Segmentation of Handwritten Numerical Strings of Bangla Language

N.Priya. Keywords Compass mask, Threshold, Morphological Operators, Statistical Measures, Text extraction

Complementary Features Combined in a MLP-based System to Recognize Handwritten Devnagari Character

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT

Extraction and Recognition of Alphanumeric Characters from Vehicle Number Plate

A Document Image Analysis System on Parallel Processors

A System towards Indian Postal Automation

HCR Using K-Means Clustering Algorithm

Segmentation Based Optical Character Recognition for Handwritten Marathi characters

Handwritten Script Recognition at Block Level

Handwritten Hindi Numerals Recognition System

Structural Feature Extraction to recognize some of the Offline Isolated Handwritten Gujarati Characters using Decision Tree Classifier

MOMENT AND DENSITY BASED HADWRITTEN MARATHI NUMERAL RECOGNITION

Automatic Recognition and Verification of Handwritten Legal and Courtesy Amounts in English Language Present on Bank Cheques

Segmentation of Characters of Devanagari Script Documents

Handwritten character and word recognition using their geometrical features through neural networks

Character Segmentation for Telugu Image Document using Multiple Histogram Projections

RESTORATION OF DEGRADED DOCUMENTS USING IMAGE BINARIZATION TECHNIQUE

A Technique for Offline Handwritten Character Recognition

Isolated Curved Gurmukhi Character Recognition Using Projection of Gradient

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

An Efficient Character Segmentation Based on VNP Algorithm

Segmentation of Bangla Handwritten Text

HMM-Based Handwritten Amharic Word Recognition with Feature Concatenation

CHAPTER 1 INTRODUCTION

Optical Character Recognition

Handwritten Character Recognition with Feedback Neural Network

Enhancing the Character Segmentation Accuracy of Bangla OCR using BPNN

K S Prasanna Kumar et al,int.j.computer Techology & Applications,Vol 3 (1),

A Technique for Classification of Printed & Handwritten text

A Fast Recognition System for Isolated Printed Characters Using Center of Gravity and Principal Axis

PCA-based Offline Handwritten Character Recognition System

Problems in Extraction of Date Field from Gurmukhi Documents

An Improvement Study for Optical Character Recognition by using Inverse SVM in Image Processing Technique

Recognition of handwritten Bangla basic characters and digits using convex hull based feature set

II. WORKING OF PROJECT

A FEATURE BASED CHAIN CODE METHOD FOR IDENTIFYING PRINTED BENGALI CHARACTERS

Character Recognition of High Security Number Plates Using Morphological Operator

Segmentation of Isolated and Touching characters in Handwritten Gurumukhi Word using Clustering approach

Character Recognition Using Matlab s Neural Network Toolbox

AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE

Morphological Approach for Segmentation of Scanned Handwritten Devnagari Text

EE795: Computer Vision and Intelligent Systems

One Dim~nsional Representation Of Two Dimensional Information For HMM Based Handwritten Recognition

Mixture of Printed and Handwritten Kannada Numeral Recognition Using Normalized Chain Code and Wavelet Transform

INTERNATIONAL RESEARCH JOURNAL OF MULTIDISCIPLINARY STUDIES

A Cascaded Scheme for Recognition of Handprinted Numerals

A Hierarchical Pre-processing Model for Offline Handwritten Document Images

Building Multi Script OCR for Brahmi Scripts: Selection of Efficient Features

Online Bangla Handwriting Recognition System

Chapter Review of HCR

Hand Written Character Recognition using VNP based Segmentation and Artificial Neural Network

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Research Article Development of Comprehensive Devnagari Numeral and Character Database for Offline Handwritten Character Recognition

SKEW DETECTION AND CORRECTION

Line and Word Segmentation Approach for Printed Documents

Devanagari Handwriting Recognition and Editing Using Neural Network

Feature extraction. Bi-Histogram Binarization Entropy. What is texture Texture primitives. Filter banks 2D Fourier Transform Wavlet maxima points

Skeletonization Algorithm for Numeral Patterns

Review of Automatic Handwritten Kannada Character Recognition Technique Using Neural Network

Mobile Application with Optical Character Recognition Using Neural Network

Pattern Recognition Letters

Comparison between Various Edge Detection Methods on Satellite Image

A New Algorithm for Detecting Text Line in Handwritten Documents

Convolution Neural Networks for Chinese Handwriting Recognition

Implementation Of Neural Network Based Script Recognition

NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: VOLUME 5, ISSUE

An Improved Zone Based Hybrid Feature Extraction Model for Handwritten Alphabets Recognition Using Euler Number

Local Image preprocessing (cont d)

Image Normalization and Preprocessing for Gujarati Character Recognition

Handwritten Character Recognition System using Chain code and Correlation Coefficient

Handwritten English Character Segmentation by Baseline Pixel Burst Method (BPBM)

A Robust Method for Circle / Ellipse Extraction Based Canny Edge Detection

Handwritten Devanagari Character Recognition

Topic 4 Image Segmentation

Types of Edges. Why Edge Detection? Types of Edges. Edge Detection. Gradient. Edge Detection

Skew Angle Detection of Bangla Script using Radon Transform

Transcription:

SEVERAL METHODS OF FEATURE EXTRACTION TO HELP IN OPTICAL CHARACTER RECOGNITION Binod Kumar Prasad * * Bengal College of Engineering and Technology, Durgapur, W.B., India. Rajdeep Kundu 2 2 Bengal College of Engineering, Durgapur, W.B., India. *email: binod.krpr@gmail.com Article History: Submitted on 17 th August, Revised on 27 th October, Published on 25 th January 2017 Abstract. An Optical Character Recognition (OCR) consists of three bold steps namely Preprocessing, Feature extraction, Classification. Methods of Feature extraction yield feature vectors based on which the classification of a testing pattern is executed. The paper aims at proposing some methods of feature extraction that may go a long way to recognize a Bengali numeral or character. Pixel Ex-OR Method presents a digital gating (Ex-OR) technique to extract the information in an image. Two successive elements of a row in image matrix have been Ex-ORed and the output is again Ex-ORed with the next element. Alphabetical coding codes a binary character image by means of letters of English alphabet. Directional features find gradient information using Sobel Masks to make position of stroke clear in an image. The features have been derived in eight standard directions and then these eight feature vectors are merged into four sets of features to reduce the system complexity and hence processing time is saved considerably. These features will help develop a Bengali numeral recognition system. Keyword. Optical Character Recognition; Feature extraction; Chain code; Directional features; Bengali numerals recognition system. INTRODUCTION Recognition of handwritten characters has been a popular research area for many years because of its various application potentials such as Postal automation, Bank cheque processing, Automatic data entry etc. There are many pieces of work towards handwritten recognition of Roman, Japanese, Chinese and Arabic scripts. Bangla is third most popular language in the Indian subcontinent and sixth most popular language in the world. Bangla is a main language of West Bengal state. Many official works like Postal,Court,Bank and state government offices use Bangla numerals for their works. With the rapid growth and advancement of the use of computers in West Bengal, the use of Bengali in computers is being much talked about and much research is being done.to take care of variability involved in the writing style of different individuals, Pal and Chaudhuri [1] proposed a robust scheme for the recognition of isolated Bengali off-line handwritten numerals. The scheme is based on features obtained from the concept of water over flow from the reservoir, as well as topological and statistical features. Dutta and Chaudhuri [2] developed a system for recognition of isolated Bengali alphanumeric handwritten characters using neural networks. The characters were represented in terms of the primitives and structural constraints. The primitives were characterized on the basis of the significant curvature events like curvature maxima, curvature minima and inflectional points observed in the characters. Bhattacharya et al. [3] have used a topology adaptive self-organizing neural network to extract the skeletal shape from a numeral pattern represented as a graph. Certain features like loops, junctions, etc. present in the graph are considered to classify a numeral into a smaller group. Finally, multi-layer perceptron networks are used to classify different numerals uniquely. The early work in combining decisions of multiple classifiers was done by Bajaj et al. [4] for an Indian script. They represented different numerical object by concatenation of strokes and then calculated strength, density and moment as the feature set. Then they trained different networks and combined them for increasing reliability of the recognition results. W. Lu et al. [5] proposed a practical model based scheme for recognition of handwritten Bengali numerals. This logic sets are derived from the features consisting of normalized distances obtained using the box approach and very useful for the Indian Postal system. In recent work Bhattacharya and Chaudhuri [6] proposed a multi-resolution wavelet analysis method for feature extraction and a complicated multilevel classifier for recognition of Devanagari and Bengali numerals. They used chain code histogram as feature on each of wavelet sub-bands in different resolutions, multi-layer perceptron (MLP) neural networks are trained for classification on each resolution, and then combined the classification results by majority voting technique. Sikdar et al. [7] extracted the features of printed words by chain code method. Authors in [8] have summarized the recent trends in feature extraction and feature selection www.ijimc.in 52

PREPROCESSING Preprocessing consists of number of preliminary processing steps to make the raw data usable for the Matlab software. Preprocessing aims to produce data that are easy for the Matlab to operate accurately. Typical preprocessing process includes the several steps namely text digitization, binarization, noise removal. We have used MATLAB R2010a version for our project. Usually unconstrained handwritten numerals when scanned to a fixed resolution, produces images of different size, aspect ratio as well as variable stroke thickness. Paper and ink quality along with A/D converter may introduce non-negligible noise. So our first task is to resize the entire image into a fixed size with fixed average thickness. In this paper, median filtering technique has been used to get smooth binary image. The numeral sample images may be of different sizes having different stroke thickness and limited variation in orientation. Therefore, these images are first resized to 50 50 and then binarized using Local Thresholding method minimizing inter class difference. Since each numeral object has ribbon like structure, we need to make average thickness fixed for all objects. A morphological thinning operation is done to get the skeleton of the numeral images. TEXT DIGITIZATION Figure 1. Steps of preprocessing The process of text digitization can be performed either by a Flat-bed scanner or a hand-held scanner. Hand held scanner typically has a low resolution range. We have used a Flatbed scanner. The digitized images are in gray tone and we have used local thresholding approach to convert them into two tone images Figure 2. Image in gray code form (partial) BINARIZATION Binarization is a technique by which the gray scale images are converted to binary images. Binarization separates the foreground (text) and background information. We have selected a proper threshold for the intensity of the image and all the intensity values above the threshold is transferred to intensity value 1 ( white ), and all intensity values below the threshold are transformed to the other chosen intensity ( black ). If g(x, y) is a thresholded version of f(x, y) at some global threshold T, g(x, y) =1 ; if f(x, y) T = 0; otherwise. www.ijimc.in 53

11111100000001 11100000000011 Figure 3. Image in binary code form (partial) NOISE REMOVAL Scanned documents often contain noise that arises due to printer, scanner, print quality, age of the document. Therefore, it is necessary to filter this noise before the image is processed. The commonly used approach is to low pass filter the image. The objective in the design of a filter is to reduce noise without affecting the other aspects of the original image. In median filtering, the value of an output pixel is determined by the median of the neighborhood pixels. The median is much less sensitive than the mean to extreme values (called outliers). Median filtering is, therefore better able to remove these outliers without reducing the sharpness of the image. THINNING Figure 4. Median Filtering Before extracting the features, character images are converted to thin curve segments having single pixel thickness. Traditional thinning strategies lead to deformation of character shapes, especially where the strokes get branched out; i.e. the junction points in the skeleton. Images 1 2 3 4 Input original Image Skeleton (Thinned) Image FEATURE EXTRACTION Figure 5. Thinned images of some numerals Feature extraction has an important role in character recognition. It is the most challenging task to recognise a character but choice of good features significantly improves the recognition rate and minimizes the error. In feature extraction stage each character is represented as a feature vector(s). The major goal of feature extraction is to extract a set of features, which maximizes the recognition rate with the least amount of elements. Various steps to extract features are discussed below: www.ijimc.in 54

Figure 6. Steps of feature extraction. PIXEL COUNT METHOD (ROW WISE) It is clearly shown that after thresholding the entire figures are represented by binary image matrix containing only 0 and 1. If a pixel contains some information (i.e. pixel value is less than 240 in gray code) that are represented by binary 0 else (i.e. pixel value is greater than 240 in gray code) represented by binary 1. We have considered here any binarized 0 pixels as ON pixels and binarized 1 pixel as OFF pixels. In this method the ON pixels ( Binarized 0) have been counted in every row of a particular image matrix. Each row contains different set of ON pixels for different Bangla numerals. From this concept we have to generate a code vector for different numerals. In our project every figures contain 50 50 pixels, therefore our final code vector length is 50 feature elements. Code vector for Nine 0 0 0 0 0 3 3 1 2 1 1 1 2 1 1 2 1 2 1 3 4 4 2 2 2 2 2 2 1 8 5 6 5 5 3 4 5 3 4 3 2 6 4 5 5 11 13 0 0 0 PIXEL EX-OR METHOD (ROW WISE) Every row of an image matrix contains different number of ON & OFF pixels. An on pixel is represented by binary 0 else 1. So we can get different set of 0 and 1 for each row and that will vary with different figure of the numbers. The proposed method does digital EX-OR operations on the successive 0 and 1 elements in each row. The image size is 50 50 pixels. Therefore, 50 rows are present in the entire image under consideration. After EX-ORing one row we get one bit output that will be either 0 or 1. So from the 50 rows we can get a 50 bit output. This output cannot be same for the two different Bengali numerals. Each Numeral will have a different coded output after EX-ORing. So we can use this code vector for recognition. Code vector for Three 0 1 0 1 0 1 1 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ALPHABETICAL CODING (ROW WISE) We are clearly shown that every figure have 50 50 pixels. In fast method we are count the ON pixels in each row and after we deduce a code for a particular numeral. Here we divide every row in two parts and then we count the ON pixels in each subdivided part of the rows. It is clearly shown that every numeral contains 50 rows www.ijimc.in 55

therefore the code length of this feature will be 100 digits. Here we can count maximum up to 25 in each subdivided rows (because 50 bits are present in every rows and each row is divided in two parts i.e. 50/2=25 bits present in every subdivided rows). The maximum number of ON pixels will be 0-25 and English alphabet contains 26 letters. Therefore the numbers can be well coded by corresponding alphabets as discussed below. Table 1. Alphabetical Coding Logic Numbers represented by corresponding Alphabets 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Code vector for Six AAAAAAAAGADADACBACADACADADACADADADACACACABACABACACABABCBKBMBEBCB CCCCCBCCCCBBBCBCBCCDCDCDDDDCBJAAAAAA DIRECTIONAL FEATURES Strokes contain vast information regarding the very character which may be harnessed while recognizing it.if direction and position of a particular stroke with respect to an image of character could be analyzed properly, it might go a long way to boost the recognition rate. Many statistical features used in character recognition are designed according to this idea. It has been an established fact that among direction features, the gradient features outperform various other directional features because gradient feature provides higher resolution. To compute the density of line segments in the quantized direction we use only two mask-horizontal Sobel mask and Vertical Sobel mask. Phase is quantized in eight directions as shown below. Figure 7. Gradient directions For each quantized phase value, corresponding magnitudes are added to get the total strength in that direction. These angles are paired up as (0, ±180 ), (45, 135 ), (90, 90 ) and (135, 45 ) to get four feature vectors. CHAIN CODE METHOD Code vector for Eight 5 1 3 1 In short, the chain code is a way to represent a binary object by encoding only its boundary. The chain code is composed of a sequence of numbers between 0 and 7. Each number represents the transition between two consecutive boundary pixels, 0 being a step to the right, 1 a step diagonally right/up, 2 a step up, etc. www.ijimc.in 56

.Figure 8. Chain Code directions The chain code thus has as many elements as there are boundary pixels. Chain code encodes the shape of the object, not its location. We need only to remember the coordinates of the first pixel in the chain to solve that. Chain code encodes a single, solid object. If the object has two disjoint parts, or has a hole, the chain code will not be able to describe the full object. CONCLUSION The current paper reports several methods of feature extraction which are very simple but promising to produce précised results. Pixel Count Method counts the ON pixels (Binarized 0) in every row of the image matrix yielding fifty feature elements. Pixel Ex-OR method performs digital Ex-OR operation on successive binary pixels of image matrix. Alphabetical coding codes the number of ON pixels in a subdivided row by means of a specific letter of English alphabet. The effectiveness of a method of feature extraction is analyzed or compared on the basis of accuracy rate of a recognition system where it is used in unison with some classifier to recognize a testing pattern. However, it can be concluded from the methods of feature extraction proposed in this paper that these methods are devoid of complex mathematical calculations and hence will save processing time of processor considerably. In future works, the proposed features will be utilized to develop a Bengali numerals recognition system with a robust classifier. REFERENCES [1] U. Pal and B. B. Chaudhuri; Indian script character recognition: a survey; Pattern Recognition; 2004, Vol. 37, No.9, Pg. 1887 1899. [2] A. Dutta and S. Chaudhury; Bengali alpha-numeric character-recognition using curvature features; Pattern Recognition; 1993, Vol. 26, No. 12, Pg. 1757 1770. [3] U. Bhattacharya, T. K. Das, A. Datta, S. K. Parui, and B. B. Chaudhuri; Recognition of handprinted bangla numerals using neural network models. Springer, Berlin / Heidelberg, 2002, Pg. 139 161. [4] R. Bajaj, L. Dey and S. Chaudhur;. Devnagari numeral recognition by combining decision of multiple connectionist classifiers, Springer, 2002, Vol. 27(Part 1), Pg. 59 72. [5] W. Lu, Y. Lu, Y. Pengfei and S. Pengfei; Handwritten bangla numeral recognition system and its application to postal automation, Pattern Recognition, 2007, Vol. 40, No. 1, Pg. 99 107). [6] U. Bhattacharya and B. B. Chaudhuri; Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals; IEEE Trans. on Pattern Analysis and Machine Intelligence; 2009, Vol. 31, No. 3, Pg. 444 457. [7] A.Sikdar, S. Banerjee, P. Roy, S. Mukherjee and M. Das; Bengali printed character recognition using a feature based chain code method, Advances in Image and Video Processing, 2014, Vol. 2, No. 3, Pg. 01 09. [8] Muhammad Arif Mohamad, Haswadi Hassan, Dewi Nasien and Habibollah Haron,;A Review on Feature Extraction and Feature Selection for Handwritten Character Recognition; International Journal of Advanced Computer Science and Applications, 2015, Vol. 6, No. 2, Pg. 204-212. www.ijimc.in 57