MANJUSHA K.*, ANAND KUMAR M., SOMAN K. P.

Similar documents
Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

NEW METHOD FOR FINDING A REFERENCE POINT IN FINGERPRINT IMAGES WITH THE USE OF THE IPAN99 ALGORITHM 1. INTRODUCTION 2.

Study of Network Optimization Method Based on ACL

6 Gradient Descent. 6.1 Functions

Online Appendix to: Generalizing Database Forensics

Image Segmentation using K-means clustering and Thresholding

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Feature Extraction and Rule Classification Algorithm of Digital Mammography based on Rough Set Theory

A shortest path algorithm in multimodal networks: a case study with time varying costs

Rough Set Approach for Classification of Breast Cancer Mammogram Images

A Neural Network Model Based on Graph Matching and Annealing :Application to Hand-Written Digits Recognition

Shift-map Image Registration

Bends, Jogs, And Wiggles for Railroad Tracks and Vehicle Guide Ways

A fast embedded selection approach for color texture classification using degraded LBP

Coupling the User Interfaces of a Multiuser Program

a 1 (x ) a 1 (x ) a 1 (x ) ) a 2 3 (x Input Variable x

Adjacency Matrix Based Full-Text Indexing Models

New Version of Davies-Bouldin Index for Clustering Validation Based on Cylindrical Distance

Threshold Based Data Aggregation Algorithm To Detect Rainfall Induced Landslides

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Skyline Community Search in Multi-valued Networks

Handling missing values in kernel methods with application to microbiology data

A Framework for Dialogue Detection in Movies

Software Reliability Modeling and Cost Estimation Incorporating Testing-Effort and Efficiency

Research Article Research on Law s Mask Texture Analysis System Reliability

Loop Scheduling and Partitions for Hiding Memory Latencies

Multilevel Linear Dimensionality Reduction using Hypergraphs for Data Analysis

Multi-camera tracking algorithm study based on information fusion

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

The Reconstruction of Graphs. Dhananjay P. Mehendale Sir Parashurambhau College, Tilak Road, Pune , India. Abstract

Bayesian localization microscopy reveals nanoscale podosome dynamics

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Characterizing Decoding Robustness under Parametric Channel Uncertainty

A Convex Clustering-based Regularizer for Image Segmentation

Overlap Interval Partition Join

Design of Policy-Aware Differentially Private Algorithms

WLAN Indoor Positioning Based on Euclidean Distances and Fuzzy Logic

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Classical Mechanics Examples (Lagrange Multipliers)

Unknown Radial Distortion Centers in Multiple View Geometry Problems

Improving Spatial Reuse of IEEE Based Ad Hoc Networks

APPLYING GENETIC ALGORITHM IN QUERY IMPROVEMENT PROBLEM. Abdelmgeid A. Aly

Generalized Low Rank Approximations of Matrices

Research Article Inviscid Uniform Shear Flow past a Smooth Concave Body

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

A Multi-class SVM Classifier Utilizing Binary Decision Tree

Object Recognition Using Colour, Shape and Affine Invariant Ratios

A PSO Optimized Layered Approach for Parametric Clustering on Weather Dataset

Refinement of scene depth from stereo camera ego-motion parameters

Learning convex bodies is hard

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization

William S. Law. Erik K. Antonsson. Engineering Design Research Laboratory. California Institute of Technology. Abstract

A Plane Tracker for AEC-automation Applications

Using Vector and Raster-Based Techniques in Categorical Map Generalization

Kinematic Analysis of a Family of 3R Manipulators

A Comparative Evaluation of Iris and Ocular Recognition Methods on Challenging Ocular Images

Modifying ROC Curves to Incorporate Predicted Probabilities

Learning Subproblem Complexities in Distributed Branch and Bound

Exploring Context with Deep Structured models for Semantic Segmentation

A Novel Density Based Clustering Algorithm by Incorporating Mahalanobis Distance

Non-Uniform Sensor Deployment in Mobile Wireless Sensor Networks

Shift-map Image Registration

Learning Polynomial Functions. by Feature Construction

Particle Swarm Optimization with Time-Varying Acceleration Coefficients Based on Cellular Neural Network for Color Image Noise Cancellation

Evolutionary Optimisation Methods for Template Based Image Registration

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

1 Surprises in high dimensions

P. Fua and Y. G. Leclerc. SRI International. 333 Ravenswood Avenue, Menlo Park, CA

FINDING OPTICAL DISPERSION OF A PRISM WITH APPLICATION OF MINIMUM DEVIATION ANGLE MEASUREMENT METHOD

THE increasingly digitized power system offers more data,

AnyTraffic Labeled Routing

Comparison of Methods for Increasing the Performance of a DUA Computation

EFFICIENT STEREO MATCHING BASED ON A NEW CONFIDENCE METRIC. Won-Hee Lee, Yumi Kim, and Jong Beom Ra

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

k-nn Graph Construction: a Generic Online Approach

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

Classification and clustering methods for documents. by probabilistic latent semantic indexing model

A multiple wavelength unwrapping algorithm for digital fringe profilometry based on spatial shift estimation

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks

THE APPLICATION OF ARTICLE k-th SHORTEST TIME PATH ALGORITHM

Performance Modelling of Necklace Hypercubes

BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES

Fast Fractal Image Compression using PSO Based Optimization Techniques

CAMERAS AND GRAVITY: ESTIMATING PLANAR OBJECT ORIENTATION. Zhaoyin Jia, Andrew Gallagher, Tsuhan Chen

Table-based division by small integer constants

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Robust Camera Calibration for an Autonomous Underwater Vehicle

Exploring Context with Deep Structured models for Semantic Segmentation

Automation of Bird Front Half Deboning Procedure: Design and Analysis

Exercises of PIV. incomplete draft, version 0.0. October 2009

A new fuzzy visual servoing with application to robot manipulator

UNIT 9 INTERFEROMETRY

Fast Window Based Stereo Matching for 3D Scene Reconstruction

Fuzzy Clustering in Parallel Universes

Investigation into a new incremental forming process using an adjustable punch set for the manufacture of a doubly curved sheet metal

On the Placement of Internet Taps in Wireless Neighborhood Networks

Dense Disparity Estimation in Ego-motion Reduced Search Space

Parts Assembly by Throwing Manipulation with a One-Joint Arm

Transcription:

Journal of Engineering Science an echnology Vol. 13, No. 1 (2018) 141-157 School of Engineering, aylor s University IMPLEMENAION OF REJECION SRAEGIES INSIDE MALAYALAM CHARACER RECOGNIION SYSEM BASED ON RANDOM FOURIER FEAURES AND REGULARIZED LEAS SQUARE CLASSIFIER MANJUSHA K.*, ANAND KUMAR M., SOMAN K. P. Centre for Computational Engineering an Networking (CEN), Amrita School of Engineering- Coimbatore, Amrita Vishwa Viyapeetham, Amrita University, Inia *Corresponing Author: k_manjusha@cb.amrita.eu Abstract Robust an reliable recognition are inee necessary requirements for optical character recognition systems. Distortions present in the ocument image an the pre-processing errors cause the optical character recognition system to apply rejection policies to achieve reliable recognition in computer assiste applications. he objective of this paper is to implement a robust an reliable character recognition system for Malayalam language. Ranom Fourier features classifie with Regularize Least Square loss function base Regression classifier can approximate the non-linear kernel machines. Baseline Malayalam character recognition base on Ranom Fourier features an Regularize Least Square regression classifier is implemente in this paper. Up on this baseline character recognition system, rejection strategies are applie an are experimente with real worl ocument images. An improvement in recognition accuracy is achieve with the simulate Malayalam character recognition system at the cost of rejecting character images having low classification score. Keywors: Character recognition, Ranom Fourier features, Regularize least square classifier, Rejection approach, Accuracy - rejection curve. 1. Introuction Optical Character Recognition (OCR) process can be applie in a wie variety of applications, to spee up ata entry or to automate ata collection from ocument images. OCR tries to convert the images of ocuments, capture through imaging evices to machine eitable or machine unerstanable ocument format. In case 141

142 Manjusha K. et al. Nomenclatures c k n S W X Y z Number of character classes Dimension of Ranom Fourier feature Kernel function Number of input ata samples Score Vector assigne by the classifier Weight matrix in RLS classifier Input ata matrix in RLS classifier Label Matrix in RLS Classifier Ranom Fourier feature vector Greek Symbols Φ Mapping function insie kernel λ Regularization parameter in RLS classifier. F Frobenius norm of matrix δ Rejection threshol Abbreviations ARC GURLS OCR RBF RF RLS SVM Accuracy - Rejection Curve Gran Unifie Regularize Least Squares Optical Character Recognition Raial Basis Function Ranom Fourier Regularize Least Square Support Vector Machine of Malayalam language, the attempts towars builing OCR system is less an a complete OCR is still in its progressing stage [1-3]. his paper concerns with the implementation of robust an reliable character recognition system for Malayalam language ocuments. Nonlinear kernel machines have much importance in the research area of pattern recognition ue to their excellent capability to moel highly nonlinear ata. Kernel trick avois the cost of explicit mapping of input ata samples to high imensional feature space an allows classifier to work in implicit feature space of ata samples. With the help of kernel trick, kernel machines easily approximate ecision bounary between ata classes. he ilemma of kernel machines is that, it scales quaratically with the number of training ata samples (because of kernel matrix creation an storage). his computational complexity makes kernel machines inaequate to work with large scale classification problems irectly. For overcoming this issue, algorithms base on ranom sampling have been propose for approximating kernel matrix [4] in large scale classification problems. Kernel functions can be approximate using Ranom Fourier (RF) features [5] an can be effectively utilize in classification problems. RF features are capable for approximating shift invariant kernel functions an can be incorporate with linear learning algorithms to achieve the performance level of kernel machines [5-7]. Malayalam character recognition is a large multi-class classification problem. he presence of large number of similarly shape characters in Malayalam

Implementation of Rejection Strategies insie Malayalam Character.... 143 language creates the nee for a robust character recognizer in Malayalam OCR systems [2]. Support Vector Machine (SVM) classifier with extene architectures are evaluate with ifferent feature spaces an are foun effective in Malayalam character recognition process [8, 9]. In this paper, the Malayalam character recognition system is built by applying RF features with Regularize Least Square (RLS) regression classifier. Generally, OCR systems prouce goo recognition results on goo quality ocument images. In real worl scenario, the chances of getting goo quality ocument images are low. Document images may contain istortions introuce ue to efects in the paper, efects happene uring printing or efects happene uring igitization process [10]. Besies that, the errors happene uring preprocessing an segmentation stages in OCR system affect the overall recognition accuracy. Due to the above-mentione istortions an errors, the recognition result obtaine from well-traine classifiers (traine with very low error rate), may entirely vary from the expecte result. Reliability in recognition has to be introuce in these circumstances to improve the recognition accuracy. Instea of assigning all segmente character image samples to the highest probable class, the image samples with low classification score (confience/ probability value of the classifier) have to be ientifie. In computer assiste applications of OCR, image samples with low classification score can be reporte as rejecte to improve the reliability in recognition rather than taking the risk of misclassifying it. Figure 1 shows the architecture of rejection base approach on character recognition process. est Character Image Character Recognizer raine Classifier Moel Classification Score Vector Classification Reliable? Yes arget Class Accepte No Classification Rejecte Fig. 1. Flow chart representing rejection base classification approach. A crucial ecision from classifier, which leas to misclassification, occurs mainly ue to two reasons. When the ata sample is not present in the ata set (outlier class or result of segmentation error or noise present in the ocument), the classifier may not be able to ientify the character class an the classifier assigns the ata sample a very low classification score. Another reason for misclassification is ue to overlapping ecision bounary between classes (ue to similarity between classes) an the classification score for two or more classes are almost same.

144 Manjusha K. et al. Designing rejection strategies to achieve reliable classification in the abovementione circumstances is really ifficult. he misclassification rate shoul ecrease monotonically with rejection rate (number of image samples rejecte with respect to the total number of image samples) by ientifying those critical ata samples. Optimal error rate with rejection trae-off for recognition can be calculate an if the conitional probability ensity of both rate is known [11]. But in almost all real applications, the above-mentione probability ensities are unknown an the rejection strategy is usually erive from the confience or reliability measure provie by the classifier for the training ata samples [12-15]. In this paper, two rejection strategies base on the classification score is implemente to achieve reliable character recognition for Malayalam OCR system evelope with RF features on Regularize Least Square (RLS) regression Classifier. he performance of the implemente Malayalam OCR system with the rejection strategies is evaluate on the real-worl ocument images to analyse the effectiveness of propose approach in reliable classification context. he base-line Malayalam character recognition system built with RF features an RLS classifier is iscusse in Section 2. Up on which the classification rules base on rejection strategies escribe in Section 3 are evaluate. Different experiments conucte with the base-line recognition system an rejection strategies are escribe in Section 4. Finally, the conclusion iscusses the work mentione in this paper an outlines for future work. 2. Baseline Malayalam Character Recognition System Malayalam language belongs to Draviian family of languages an is the official language of the state Kerala [2]. Malayalam language inclues large number of character classes with Vowels (V), Consonants (C), Half-Consonants (known as chillu), Vowel Moifiers an Compoun characters. Besies the large number of character classes, script revision happene over time an the existence of nonstanar font styles are the main challenges in Malayalam character recognition problem [2]. For implementing a robust character recognition system, in this paper we have use RF features along with RLS classifier. 2.1. Malayalam character image atabase For implementing the character recognition system for printe Malayalam language ocuments, a character image atabase (Mal_CharDB) is create by using irect (collecting character images from real ocument images) an synthetic (creating character images by applying various styles an size in text form of Malayalam characters) approaches. Each character image is resize to 32 32. Mal_CharDB consists of 130 ifferent character classes which inclues inepenent an epenent vowels, consonants, some commonly use compoun characters of Malayalam language an igits (0-9). Mal_CharDB contains totally 24553-character images. From each character class, 75% of images are taken for implementing training process an the rest is consiere for testing.

Implementation of Rejection Strategies insie Malayalam Character.... 145 2.2. Ranom Fourier (RF) features Ranom Fourier features are inspire from the ranomization algorithms for approximating kernel functions [5]. Kernel functions efine a convenient way for calculating an inner prouct between the ata samples without explicitly lifting the ata samples to the higher imensional space. RF features relies on the fact that the ata samples can be mappe in to ranomize feature space having lower imension, so that the inner prouct between the ata samples in ranomize feature space approximates positive efinite translation invariant kernel functions. Let x an y be the ata points an Φ be the mapping for lifting ata points to higher imension, then the kernel function k can be efine as shown in Eq. (1). k( x, y) ( x), ( y) (1) As per Brochner s theorem, the Fourier transform of a shift invariant kernel k(x-y) is a proper probability istribution function if it is properly scale. Let p(ω) be the Fourier transform of k, then k(x,y) can be written as in Eq. (2). j ( x y) (2) k( x, y) k( x y) p( ) e j ( x y) e As p(ω) is a probability istribution function, the expecte value of is the unbiase estimate of k(x,y) only if ω is rawn from the probability istribution p. he multi-variate vector ω i, can be generate inepenently from p. j ( x y) ( ) Ee can be approximate using the generate ω i as shown in Eq. (4). ( ) ( ) ( ) j xy j xy ( ) p e E e j ( x y) ( ) E e 1 ji ( x y) e (4) i1 he RHS of Eq. (4) can be expane an is equivalent to the inner prouct in the function space of z (exponentials of projecte ata samples to ω i vectors sample from p) an the resulting inner prouct approximates the k(x-y). 1 ( ) 1 ji x y ji x ji y e e, e i1 i1 1 1 ji x ji y e, e i1 1 1 j1 x j1 y e e 1 1 j2 x j2 y e e,...... 1 1 j x j y e e (3) (5) (6) (7)

146 Manjusha K. et al. z( x), z( y) ( x), ( y) he first vector z(x), insie the inner prouct in Eq. (8) represents lifting of ata sample x to ranomize feature space of imension. In orer to avoi the computation of complex numbers, the ata point can be projecte on to cosine an sine bases separately an can be appene together to represent (2*) imension vector. So, the resultant z(x), can be represente as in Eq. (10). zx ( ) cos( 1 x)... 1 cos( x) sin( 1 x)... sin( x) 2.3. Regularize least-squares regression (RLS) classifier (8) (9) (10) A simple linear classification algorithm can be use to approximate the performance of non-linear kernel machines by classifying the extracte RF features. Classifier base on Regularize Least Squares loss function can be use for this purpose. Regularize least squares multiclass classification is base on the optimization function which minimizes the average loss in classification [16]. In our multi class classification problem, let c be the total number of classes. X is the ata matrix create by appening all training ata samples together. If represents the feature imension of ata samples, then X have imension n, where n is the total number of ata samples. Let Y be the label matrix of corresponing ata samples in X, having imension n c. In Y, each row represents the label vector for ata samples with +1 in the position of correct label inex an all other entries as -1. he optimization function formulation for RLS classifier is as shown in Eq. (11), where W represents the weight matrix with imension c, which nees to be optimize an λ is the regularization parameter. 1 2 2 min Y XW W (11) c W F F n where,. F enotes the Frobenius norm of matrix. n can be multiplie with λ which is again a scalar, so in further equations, λ represents (n* λ). Let, 2 2 f ( W) Y XW W (12) F F r Y XW Y XW r W W (13) r Y Y r W X XW r Y XW r W X Y r W W (14)

Implementation of Rejection Strategies insie Malayalam Character.... 147 he optimum value for W represente as W*, can be foun by equating the ifferential of f(w) with zero. fw ( ) 2 X X W X Y X Y 2W 0 (15) W X X I W X Y (16) 1 W * X X I X Y (17) Besies Eq. (17), Cholesky factorization can be applie to solve the linear system of Eq. (16) to fin W*. he classification label of test ata sample can be foun by projecting the ata sample x* (the extracte RF features) to W* an selecting the label of that class which have the highest projection value. For applying RLS classifier in recognition experiments, GURLS (Gran Unifie Regularize Least Squares) package [17] is use, which contains routines for selecting best possible classification moel through automatic parameter selection from training ata samples. 3. Applying Rejection Approach in Recognition During training, the RLS classifier is provie with the label matrix Y, in which each row represents the label vector, correspons to the ata sample in X matrix. he label vector is of size 1 c, where c is the total number of ifferent classes consiere. If the particular ata sample belongs to i th class, then the i th entry in the label vector will be +1 an rest of the entries will be -1. During training, the RLS classifier tries to minimize the square error in preiction by optimizing the weight matrix W an fins W*. During testing for each test ata sample, the RLS classifier preicts a score vector of size 1 c. he i th entry in the score vector will be the classification confience score assigne for the test ata sample to belong to i th class. he ieal situation in the multi-class RLS classifier for test ata sample is that if the ata sample belong to the class i, then the classification score provie for class i, will be 1 an for all other classes except i the score shoul be -1. But in real cases, the classification score iffer from the ieal situation an the score can vary from 1 or -1 an the classification score provie by the RLS classifier will be in the range (-1-α, 1+β). hree classification rules are formulate base on the classification confience score vector S generate by the RLS classifier an are escribe in section 3.1, 3.2 an 3.3. 3.1. Zero-rejection (Max_Rule) In orer to take the ecision about target class of ata sample epening on the classification score, the most commonly use one is to assign the ata sample to the class with highest score in the classification score vector. his classification rule can be terme further in paper as Max_Rule. his rule will assign all the ata samples with a target class label without rejection. Max_Rule oesn t provie reliability in classification because, even if the score of target class is very low, it still assigns the ata sample with that class label. his approach may be intene in those applications where the computer assistance in recognition is not available (recognition verification facility is not available). If S

148 Manjusha K. et al. is the classification score vector 1 c assigne for the ata sample, then the target class label assigne for that ata sample can be represente as shown in Eq. (18). Max _ Rule( S) i; if S( i) S( j) for j i, j [1, c] (18) 3.2. Rejection base on score value (SR_Max_Rule) Max_Rule can be moifie such that instea of assigning target classes to all ata samples, ata samples classifie with a maximum score value less than the rejection threshol δ, are rejecte. Accoring to SR_Max_Rule, classification score vector with its maximum value above δ only terme as reliable classification an all the ata samples with classification score vector with its maximum value less than or equal to δ are rejecte ata samples. SR_Max_Rule ivies the ata samples to two regions accepte, rejecte an assigns target class labels to only those ata samples which resie in the accepte region an it assigns -1 value for ata samples in rejecte region. Equation (19) represents SR_Max_Rule. Max _ Rule( S) ; if max( S) SR _ Max _ Rule( S) 1 ; if max( S) 3.3. Rejection base on ifference in score (DR_Max_Rule) (19) Instea of using maximum value in classification score vector, the ifference in classification score between the first an secon maximum value in classification score vector can be use for evaluating reliability in recognition. Let S 1 represents the highest classification score value insie S an S 2 represents the secon highest score value insie S. hen the classification of that ata sample is consiere as reliable only if the ifference between S 1 an S 2 is greater than istance-reject threshol δ. If the classes have overlapping ecision bounary an in situation where the classifier have to take critical ecision in between those classes, then this rejection strategy will reject those ata samples instea of misclassifying it. Max _ Rule( S) ; if ( S1S2) DR _ Max _ Rule( S) 1 ; if ( S1 S2) 3.4. Propose rejection base approach (20) For the classification rules escribe in section 3.2 an 3.3 rejection threshol has to be estimate from the valiation ataset. he rejection threshol estimation is one base on the Accuracy - Rejection curve (ARC). he rejection threshols are estimate as follows. he algorithm is base on selecting rejection threshol with esirable recognition accuracy. he algorithm can be moifie such that rejection threshol can be selecte within the esirable rejection rate. In the above algorithm, 'Correct' an 'Number' represents functions which calculates the number of correctly classifie images an total number of images respectively among the ataset passe through parameters. he above algorithm iterates through ifferent rejection threshols an selects the minimum threshol value which achieve the esirable recognition accuracy among the accepte images.

Implementation of Rejection Strategies insie Malayalam Character.... 149 Rejection hreshol Estimation base on esirable recognition accuracy Input: 1) Classification score vector S of character images in Valiation ataset 2) Desirable Accuracy DAcc Algorithm: 1. For each image in valiation ataset (Val) calculate FinalScore. In case of SR_Max_Rule, FinalScore = max(s) In case of DR_Max_Rule, FinalScore = (S 1 -S 2 ) 2. Set initial rejection threshol, δ = 0 3. Decie Accepte Images Accept, whose FinalScore > δ Calculate recognition accuracy Acc, among accepte images Acc = Correct(Accept)/Number(Accept) 100 4. Decie Rejecte Images Reject, whose FinalScore <= δ Calculate rejection rate RejRate RejRate = Number(Reject)/Number(Val) 100 5. If Acc >= DAcc, go to step 6. Else increment δ = δ+δ δ, go to step 3. 6. Stop In rejection base approaches, the estimation of rejection threshols is very crucial. he main challenge is that the rejection threshol shoul be chosen such that all the correctly classifie images shoul be accepte while all the misclassifie images shoul be rejecte. Depening on the valiation ataset, the rejection threshol estimate may change an thus can affect the overall performance of rejection base recognition systems. 4. Experimental Results an Discussion Base line character recognition system (Zero rejection) with RF features an RLS multi-class classifier is built on MALAB environment. 19162-character images from Mal_CharDB are use for training purpose an 5391-character images (valiation ataset) are use for evaluating the recognition system. Accuracy of recognition is calculate as the percentage of correctly classifie test character images among the total teste images. Likewise, misclassification rate is the percentage of incorrectly classifie character images among the total teste images. 4.1. Experiment 1: RF - RLS base character recognition he first experiment is to fin the suitable imension of RF feature that maximizes the accuracy of recognition system with respect to the Malayalam character image atabase. RF feature, z(x) is extracte from the character images as efine in Eq. 10. he imension of z(x) is etermine by the number of ranom vectors sample from the Fourier transform of Raial Basis Function (RBF). If the number of ranom vectors is, then 2* will be the size of RF feature vector. RLS classifier moel is built on RF features extracte from character images for ifferent values of, an the accuracy of classification is evaluate over test character images. he accuracy of recognition changes with the change in ranom vectors taken from the probability istribution, so instea of taking single recognition accuracy correspons to particular, average recognition accuracy is calculate from the 10 trials by changing the ranom

150 Manjusha K. et al. vectors. Figure 2 shows the variation in average recognition accuracy with the increase in. Fig. 2. Average valiation accuracy obtaine for the Malayalam character recognition system is plotte against the number of ranom sampling vectors. he average recognition accuracy of the system among the valiation ataset increases exponentially with the increase in imension. As the value of approaches value 1000, the recognition accuracy near 99% is achieve an after that the increase in recognition accuracy with increase in value of, is at a very slow rate. able 1 shows the average recognition accuracy achieve for ifferent values of starting from 1000 till 10,000 with an increase of 1000 in value. Even in higher imensions, the recognition accuracy is still improving at the cost of heavy computation. ill =5000, there is noticeable improvement in accuracy with the increase in feature imension. But the improvement in accuracy is very low an the increase in accuracy is only 0.03 when the feature imension is lifte from 5000 to 10,000. Further in our experiments, we are fixing the imension of as 5000 to avoi heavy computations. he recognition score correspons to the target class assigne for the valiation images by the recognition system is analyse. Figures 3(a) an (b) shows histogram plot of the highest recognition score an the ifference between the first an secon high recognition scores respectively in case of misclassifie character images in valiation ataset. For most of the misclassification cases, the recognition score is concentrate on the lowest region of graph in both the cases. he misclassification happene even in presence of high RLS recognition score is in case of similarly shape characters. From these histogram graphs it is pretty sure that rejection approaches base on recognition score may clearly etect most of the misclassification happene in the outcome of character recognition system. For testing purpose, 67 ocument images collecte from various sources are consiere. Level-set base active contour metho [18] is use for segmenting characters from the ocument images. Among the segmente 22,712-character

Implementation of Rejection Strategies insie Malayalam Character.... 151 images, 833 images are representing characters that are not present in Mal_CharDB (hese are enote as NDB). 483 images have segmentation error an are enote as SE. SE an NDB test character images comes uner error ata samples (ERROR). Image pixel value (IMG) can be use irectly as features an are prove feature escriptors in character recognition process [8]. Histogram of Oriente Graients (HOG) is capable of proucing strong feature escriptor in image classification tasks [19]. IMG an HOG features are compare with RF features an the recognition accuracy obtaine on test ataset is liste in able 2. In orer to classify the IMG an HOG features, Support Vector Machine (SVM) classifier (linear an RBF kernel) is utilize. RF feature performs better than the other recognizers with 88.08%. able 1. Average recognition accuracy of the character recognition system base on RF features for ifferent values on valiation ataset No. of ranom sampling vectors, Average recognition\ accuracy, % 1000 99.00 2000 99.43 3000 99.52 4000 99.59 5000 99.63 6000 99.63 7000 99.63 8000 99.65 9000 99.66 10000 99.66 (a) Fig. 3. (a) Highest classification score for misclassifie character images in valiation ataset, (b) Difference between highest an secon highest classification score for misclassifie valiation character images. able 2. Recognition accuracy on test ataset. (b) Feature Classifier Recognition accuracy (%) IMG Linear SVM 87.21 IMG RBF SVM 87.59 HOG Linear SVM 87.74 HOG RBF SVM 80.82 RF RLS 88.08

152 Manjusha K. et al. 4.2. Experiment 2: Estimating rejection threshols his experiment tries to estimate the optimal rejection threshol value for SR_Max_Rule an DR_Max_Rule by analysing their effect on the recognition outcome of baseline recognition system. In SR_Max_Rule, est character images are accepte only if the maximum RLS classification score assigne for it is greater than the rejection threshol an recognition accuracy is calculate among the accepte character images. In case of SR_Max_Rule, Accuracy - Rejection curve can be plotte for ifferent rejection threshols base on the recognition outcome on valiation ataset an is shown in Fig. 4. Fig. 4. Accuracy - Rejection curve plotte for SR_Max_Rule in recognition outcome obtaine from RLS classifier on valiation ataset. Rejection threshol can be selecte from this curve either by selecting the esire recognition accuracy among accepte character images or by limiting the rejection rate of the system to a particular value. In Fig. 4., the recognition accuracy is increasing at the cost of increase in rejection rate. he relation between rejection rate an recognition accuracy is monotonic. he recognition accuracy of 100% is achieve with SR_Max_Rule by rejecting 4.23% of test character images at a rejection threshol of 0.26. he rejecte test character images can be labelle as unreliable an presente to the user for easy error correction. In DR_Max_Rule, the ifference between the first an secon maximum classification scores assigne for the test character images by the RLS classifier are calculate an base on this ifference the character images are rejecte. he iea is that if the classifier has clearly istinguishe the test character image to belong to a particular class rather than the other, then the classification score for the target class assigne by RLS classifier will be very high compare to classification score of other classes. he recognition accuracy among accepte valiation character images an rejection rate accoring to DR_Max_Rule for ifferent rejection threshols is plotte in Fig. 5. At ifference reject threshol 0.48, the system obtaine 100% recognition accuracy among accepte test character images by rejecting 1.52% of total test character images. Compare to

Implementation of Rejection Strategies insie Malayalam Character.... 153 SR_Max_Rule, DR_Max_Rule obtaine 100% recognition accuracy by rejecting very less character images. Fig. 5. Accuracy - rejection curve plotte for DR_Max_Rule in recognition outcome obtaine from RLS classifier on valiation ataset. 4.3. Experiment 3: Applying rejection approach in recognition he aim of this experiment is to evaluate the Max_Rule, SR_Max_Rule an DR_Max_Rule in real ocument image recognition. On the test ataset (containing 22712 images) escribe at the en of Experiment 1, the classification rules are evaluate. ERROR etection rate is calculate as the percentage of ERROR images rejecte by the classification rule among all the ERROR images present in the ataset. All 22,712-character images are teste with the same recognition system an the classification scores obtaine from the RLS classifier is passe to Max_Rule, SR_Max_Rule an DR_Max_Rule. he rejection threshol estimate from Experiment 2 for SR_Max_Rule an DR_Max_Rule are use. he recognition accuracy among the accepte reliable classification an among all the teste character images is calculate. he rejection rate acquire for the classification rules along with recognition accuracy are tabulate in able 3. he Max_Rule coul classify the teste character images with 88.08% without rejecting any character image. his is the actual classification accuracy of the implemente character recognition system. Max_Rule is not checking the reliability of classification instea assigns target label for all teste character images. Recognition accuracy of SR_Max_Rule among accepte test character images is 97.62% an the rule rejecte 29.15% of all the teste character images. he same rule coul reject 99.09% of the ERROR ata samples present in the ataset. he rejection rate of DR_Max_Rule is 14.75%, which is only half of that of SR_Max_Rule an coul achieve 96.03% recognition accuracy among accepte character images. Among the ERROR character images, DR_Max_Rule rejecte 90.65% correctly. SR_Max_Rule performs better than DR_Max_Rule on ientifying the ERROR test character images but at the cost of high rejection rate. Combination of SR_Max_Rule an DR_Max_Rule is also evaluate on the test ataset. A slight

154 Manjusha K. et al. improvement in recognition accuracy is obtaine but with slight increase in rejection rate compare to both the rules. Classification Rule able 3. Performance of ifferent classification rules in real worl ocument image recognition. Rejection rate (%) Recognition accuracy Among Among all accepte test images images ERROR etection rate (%) Max_Rule - 88.08 88.08 - SR_Max_Rule 29.15 97.62 69.17 99.09 DR_Max_Rule 14.75 96.03 81.86 90.65 SR_Max_Rule + DR_Max_Rule 29.18 97.64 69.15 99.09 he rejection rules are not actually improving the recognition accuracy of the baseline system; rather it helps to ientify probable misclassifications in recognition outcome. hus, rejection approaches help in fining those unreliable classifications an opens an opportunity to improve recognition performance through further processing. he overall performance of the classification rules on the recognition outcome of test ataset is visualize in Fig. 6. With the baseline recognition system, the recognition accuracy obtaine without rejecting any character image (with Max_Rule) is 88.08%. his implies that the 11.92%-character images were misclassifie uring recognition. If the rejection rules coul etect these misclassifications correctly then there is a chance for improving the accuracy of baseline character system by applying further processing on these rejecte character images. Ieally the rejection rules shoul reject all the misclassification an shoul accept all correct classifications. SR_Max_Rule rejecte 29.15% of test character images among that 10.24% were misclassifie character images in recognition process. his implies SR_Max_Rule coul not etect 1.69% misclassifie images. DR_Max_Rule coul etect only 8.54% in 11.92% misclassifie character images, which implies 3.38% of misclassifications got accepte with DR_Max_Rule. he combination of both rules coul etect 10.25% misclassifie character images an it reuce the misclassifie character images not etecte to 1.67%. Along with the etection of unreliable classification, the other measure use for evaluation of rejection rule is the presence of correct classifications in rejecte region. Even though further processing is possible in rejecte character images, the presence of correctly classifie images in rejecte region shoul be as low as possible. SR_Max_Rule can only etect 69.17% in 88.08% correctly classifie character images. 18.91% correctly classifie character images got rejecte through SR_Max_Rule whereas DR_Max_Rule rejecte only 6.21% correctly recognize character images. As further processing can be one on the rejecte character images what actually matters is the misclassifications present among the accepte images, so SR_Max_Rule is suitable rather than DR_Max_Rule even the rejection rate is ouble than that of DR_Max_Rule.

Implementation of Rejection Strategies insie Malayalam Character.... 155 he misclassifications present among the accepte images for the combine rule is mainly happene ue to the similarity in shape between the character classes. he error ue to similarly shape classes can be reuce by re-checking applie for those particular classes. he risk involve in rejecting character images is less compare to misclassifying. Further classification, applying language information uring post-processing are the possible actions that can be one on rejecte character images. a) Max-Rule b) SR_Max_Rule c) DR_Max-Rule ) Combination of Rules Fig. 6. Performance analysis of the classification rules on the recognition outcome of baseline recognition system in real ocument recognition. 5. Conclusion an Future work Reliable recognition is one of the necessary requirements in most of the pattern recognition applications. Rejection strategies can be applie on the recognition outcome to ientify unreliable classifications. In this paper, we experiment the rejection strategies in Malayalam character recognition system to achieve reliable recognition results. For implementation purpose, an image atabase (Mal_CharDB) is create with 130 ifferent character classes. Baseline Malayalam recognition system is create by using Ranom Fourier (RF) features an Regularize Least Square (RLS) multi-class classifier. At RF feature imension 5000, the baseline recognition system achieve 99.63% recognition accuracy on Mal_CharDB.

156 Manjusha K. et al. Histogram analysis of classification score obtaine for the images shows, most of the misclassification occurre in the lower region of classification score. wo rejection rules are experimente in this paper; first one is base on the highest classification score value (SR_Max_Rule) an the other is base on the ifference between first an secon maximum classification score (DR_Max_Rule). he rejection threshol values for the two rules are calculate from the Accuracy - Rejection curve. he effectiveness of rejection rules is evaluate on segmente images extracte from real worl ocument images. SR_Max_Rule coul achieve 97.62% recognition accuracy among accepte character images by rejecting 29.15% of the character images. 99.09% of the ERROR character images in the real-worl test ataset got etecte in rejecte images. DR_Max_Rule have less rejection rate of 14.75% an coul etect most of the correctly classifie character images. But as the focus of the paper is on etecting the misclassifie an ERROR character images through rejection methos, SR_Max_Rule is performing better than DR_Max_Rule. he combination of both the rules is applie an coul achieve slightly better rejecte misclassification rate compare to SR_Max_Rule. Analysis on misclassification present in accepte character images, explores that these misclassifications occurre mostly ue to the high similarity in character shapes. Further classification or applying character context information on rejecte character images may improve the recognition accuracy of baseline character recognition system further. Future work inclues improving recognition accuracy with the help of multiple classifier ecision or language moelling. References 1. Pal, U.; an Chauhuri, B.B. (2004). Inian script character recognition: A survey. Pattern Recognition, 37(9), 1887-1899. 2. Govinaraju, V.; Setlur, S. eitors. (2009). Guie to OCR for Inic Scripts. Springer. 3. Krishnan, P.; Sankaran, N.; Singh, A.K. ; an Jawahar, C.V. (2014). owars a robust OCR system for Inic scripts. 11th IAPR International Workshop on Document Analysis Systems, DAS, 141-145. 4. Achlioptas, D.; McSherry, F.; an Scholkopf, B. (2002). Sampling techniques for Kernel methos. Avances in Neural Information Processing Systems, 335-342. 5. Rahimi, A.; an Recht, B. (2007). Ranom features for large-scale kernel machines. Avances in Neural Information Processing Systems, 1177-1184. 6. Lu, Z.; May, A.; Liu, K.; Garakani, A.B.; Guo, D.; Bellet, A.; Fan, L.; Collins, M.; Kingsbury, B.; Picheny, M.; an Sha, F. (2014). How to scale up kernel methos to be as goo as eep neural nets. ArXiv preprint arxiv:1411.4000. 7. Rahimi, A.; an Recht, B. (2009). Weighte sums of ranom kitchen sinks: Replacing minimization with ranomization in learning. Avances in Neural Information Processing Systems, 1(1), 1313-1320. 8. Neeba, N.V.; an Jawahar, C.V. (2009). Empirical evaluation of character classification schemes. Seventh International Conference on Avances in Pattern Recognition, IEEE Computer Society, 310-313.

Implementation of Rejection Strategies insie Malayalam Character.... 157 9. Manjusha, K.; AnanKumar, M.; an Soman, K. P. (2015). Experimental analysis on character recognition using singular value ecomposition an ranom projection. International Journal of Engineering an echnology, 7(4), 1246-1255. 10. Doermann, D.S.; an ombre, K. (2014). Hanbook of ocument image processing an recognition, Springer. 11. Chow, C.K. (1970). On optimum recognition error an reject traeoff. IEEE ransactions on Information heory, 16(1), 41-46. 12. Corelia, L.P.; Stefano, C.D.; ortorella, F.; an Vento, M. (1995). A metho for improving classification reliability of multilayer perceptrons. IEEE ransactions on Neural Networks, 6(5), 1140-1147. 13. De Stefano, C.; Fontanella, F.; Marcelli, A.; Parziale, A.; an Freca, A.S.D. (2014). Rejecting both segmentation an classification errors in hanwritten form processing. Proceeings of International Conference on Frontiers in Hanwriting Recognition, ICFHR, 569-574. 14. De Stefano, C.; Sansone, C; an Vento, M. (2000). o reject or not to reject: that is the question - an answer in case of neural classifiers. IEEE ransactions on Systems, Man an Cybernetics Part C: Applications an Reviews, 30(1), 84-94. 15. Naeem, M.; Zucker, J.; an Hanczar, B. (2010). Accuracy - rejection curves (arcs) for comparing classification methos with a reject option. Machine Learning in Systems Biology, (8), 65-81. 16. Rifkin, R.; Yeo, G.; an Poggio,. (2003). Regularize least squares classification. Nato Science Series Sub Series III Computer an Systems Sciences, 190, 131-154. 17. acchetti, A.; Mallapragaa, P.K.; Rosasco, L.; an Santoro, M. (2013). Gurls: A least squares library for supervise learning. Journal of Machine Learning Research, 14, 3201-3205. 18. Kumar, S.S.; Manjusha, K.; an Soman, K.P. (2014). Novel SVD base character recognition approach for malayalam language script. Recent Avances in Intelligent Informatics, 435-442. 19. Dalal, N.; an riggs, B. (2005). Histograms of oriente graients for human etection. IEEE Computer Society Conference on Computer Vision an Pattern Recognition, 1, 886-893.