Building Multi Script OCR for Brahmi Scripts: Selection of Efficient Features

Similar documents
Skew Angle Detection of Bangla Script using Radon Transform

LECTURE 6 TEXT PROCESSING

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Fine Classification of Unconstrained Handwritten Persian/Arabic Numerals by Removing Confusion amongst Similar Classes

Segmentation free Bangla OCR using HMM: Training and Recognition

Handwritten Devanagari Character Recognition Model Using Neural Network

Handwritten Gurumukhi Character Recognition by using Recurrent Neural Network

OCR For Handwritten Marathi Script

LITERATURE REVIEW. For Indian languages most of research work is performed firstly on Devnagari script and secondly on Bangla script.

Chapter Review of HCR

Handwritten Numeral Recognition of Kannada Script

RECOGNIZING TYPESET DOCUMENTS USING WALSH TRANSFORMATION. Attila Fazekas and András Hajdu University of Debrecen 4010, Debrecen PO Box 12, Hungary

Skew Angle Detection of Bangla Script using Radon Transform

Bengali Character Recognition using Feature Extraction. Thesis Paper for Department of Computer Science & Engineering.

A Survey of Problems of Overlapped Handwritten Characters in Recognition process for Gurmukhi Script

Recognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier

SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT

HANDWRITTEN GURMUKHI CHARACTER RECOGNITION USING WAVELET TRANSFORMS

Isolated Curved Gurmukhi Character Recognition Using Projection of Gradient

BANGLA OPTICAL CHARACTER RECOGNITION. A Thesis. Submitted to the Department of Computer Science and Engineering. BRAC University. S. M.

Isolated Handwritten Words Segmentation Techniques in Gurmukhi Script

A Brief Study of Feature Extraction and Classification Methods Used for Character Recognition of Brahmi Northern Indian Scripts

Image Normalization and Preprocessing for Gujarati Character Recognition

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Multimedia Information Retrieval

Practical Image and Video Processing Using MATLAB

Segmentation of Kannada Handwritten Characters and Recognition Using Twelve Directional Feature Extraction Techniques

Segmentation of Characters of Devanagari Script Documents

Segmentation Based Optical Character Recognition for Handwritten Marathi characters

A Robust Hand Gesture Recognition Using Combined Moment Invariants in Hand Shape

An Introduction to Pattern Recognition

Segmentation of Bangla Handwritten Text

Lecture 8 Object Descriptors

A Detailed Review of Feature Extraction in Image Processing Systems

Improving Latent Fingerprint Matching Performance by Orientation Field Estimation using Localized Dictionaries

SEVERAL METHODS OF FEATURE EXTRACTION TO HELP IN OPTICAL CHARACTER RECOGNITION

PCA-based Offline Handwritten Character Recognition System

HMM-based Indic Handwritten Word Recognition using Zone Segmentation

Handwritten Hindi Numerals Recognition System

Recognition of Unconstrained Malayalam Handwritten Numeral

Degraded Text Recognition of Gurmukhi Script. Doctor of Philosophy. Manish Kumar

Cursive Handwriting Recognition System Using Feature Extraction and Artificial Neural Network

Improved Optical Recognition of Bangla Characters

Convolution Neural Networks for Chinese Handwriting Recognition

Online Bangla Handwriting Recognition System

MOMENT AND DENSITY BASED HADWRITTEN MARATHI NUMERAL RECOGNITION

Word-wise Script Identification from Video Frames

Research Report on Bangla OCR Training and Testing Methods

CS 223B Computer Vision Problem Set 3

Optical Character Recognition For Bangla Documents Using HMM

One Dim~nsional Representation Of Two Dimensional Information For HMM Based Handwritten Recognition

Identifying Layout Classes for Mathematical Symbols Using Layout Context

Recognition of Gurmukhi Text from Sign Board Images Captured from Mobile Camera

Optical Character Recognition

Chain Code Histogram based approach

Image Enhancement Techniques for Fingerprint Identification

CS443: Digital Imaging and Multimedia Binary Image Analysis. Spring 2008 Ahmed Elgammal Dept. of Computer Science Rutgers University

Indian Multi-Script Full Pin-code String Recognition for Postal Automation

Enhancing the Character Segmentation Accuracy of Bangla OCR using BPNN

FRAGMENTATION OF HANDWRITTEN TOUCHING CHARACTERS IN DEVANAGARI SCRIPT

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

AN ALGORITHM USING WALSH TRANSFORMATION FOR COMPRESSING TYPESET DOCUMENTS Attila Fazekas and András Hajdu

A FEATURE BASED CHAIN CODE METHOD FOR IDENTIFYING PRINTED BENGALI CHARACTERS

OFF-LINE HANDWRITTEN JAWI CHARACTER SEGMENTATION USING HISTOGRAM NORMALIZATION AND SLIDING WINDOW APPROACH FOR HARDWARE IMPLEMENTATION

Short Survey on Static Hand Gesture Recognition

A Technique for Offline Handwritten Character Recognition

Complementary Features Combined in a MLP-based System to Recognize Handwritten Devnagari Character

K S Prasanna Kumar et al,int.j.computer Techology & Applications,Vol 3 (1),


Automatic Recognition and Verification of Handwritten Legal and Courtesy Amounts in English Language Present on Bank Cheques

Sinhala Handwriting Recognition Mechanism Using Zone Based Feature Extraction

Final Review. Image Processing CSE 166 Lecture 18

Devanagari Handwriting Recognition and Editing Using Neural Network

Signature Based Document Retrieval using GHT of Background Information

A two-stage approach for segmentation of handwritten Bangla word images

A Segmentation Free Approach to Arabic and Urdu OCR

Machine vision. Summary # 6: Shape descriptors

A Decision Tree Based Method to Classify Persian Handwritten Numerals by Extracting Some Simple Geometrical Features

LVQ FOR HAND GESTURE RECOGNITION BASED ON DCT AND PROJECTION FEATURES

Su et al. Shape Descriptors - III

MRT based Fixed Block size Transform Coding

A Content Based Image Retrieval System Based on Color Features

A System for Joining and Recognition of Broken Bangla Numerals for Indian Postal Automation

Segmentation free Bangla OCR using HMM: Training and Recognition

NOVATEUR PUBLICATIONS INTERNATIONAL JOURNAL OF INNOVATIONS IN ENGINEERING RESEARCH AND TECHNOLOGY [IJIERT] ISSN: VOLUME 2, ISSUE 1 JAN-2015

FEATURE EXTRACTION TECHNIQUE FOR HANDWRITTEN INDIAN NUMBERS CLASSIFICATION

Lecture 10: Image Descriptors and Representation

Keywords Handwritten alphabet recognition, local binary pattern (LBP), feature Descriptor, nearest neighbor classifier.

CS 231A Computer Vision (Fall 2012) Problem Set 3

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

Research Report on Bangla Optical Character Recognition Using Kohonen etwork

Polar Harmonic Transform for Fingerprint Recognition

4. Image Retrieval using Transformed Image Content

HAND-GESTURE BASED FILM RESTORATION

A Hierarchical Pre-processing Model for Offline Handwritten Document Images

Mixture of Printed and Handwritten Kannada Numeral Recognition Using Normalized Chain Code and Wavelet Transform

Minimally Segmenting High Performance Bangla Optical Character Recognition Using Kohonen Network

CoE4TN4 Image Processing

Bengali Printed Character Recognition A New Approach

An Introduction to Content Based Image Retrieval

Transcription:

Building Multi Script OCR for Brahmi Scripts: Selection of Efficient Features Md. Abul Hasnat Center for Research on Bangla Language Processing (CRBLP) Center for Research on Bangla Language Processing (CRBLP), Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh.

Brahmi Script Analysis Features of the graphemes of the characters Baseline / Matraa Vertical bar Curvatures Baseline / Matraa Curvatures Vertical Bar

Brahmi Script Analysis Vowels have dependent and independent form. Vowel change shape when followed by consonant. Consonant followed by consonant creates new shape.

Bangla Script

Other Scripts Devanagari Gurmukhi

Other Scripts Tibetan Sharda

Outline Feature extraction for OCR. Classification Analysis of feature extraction approaches Conclusion

What is Feature Extraction? Devijver and Kittle define feature extraction as the problem of extracting ti from the raw data the information which most relevant for classification purposes, in the sense of minimizing the within-class pattern variability while enhancing the between-class pattern variability. Image features are unique characteristics i that can represent a specific image. Meaningful, detectable parts of the image. Overcome the vulnerabilities of template matching. reduce the computation cost.

Feature Extraction in OCR The selection of image features and corresponding extraction methods is probably the most important step in achieving high performance for an OCR system. The preprocessing stage aims to make the image be suitable for different feature extraction algorithms.

Feature Extraction in OCR Properties of image features Robust to transformations Robust to noise Feature extraction efficiency Feature matching efficiency Issues in feature extraction Invariants features remains unchanged when a particular transformation is applied. Reconstruction can be reconstructed from the extracted features.

Features and Classifiers Different feature type may need different type of classifiers. Graph description - structural or syntactic classifiers. Discrete features - decision trees. Real valued features - statistical classifier.

Types of Features Feature extraction methods are based on three types of features Statistical Projections and profiles Crossing and distance Zoning Structural Nodal features Stroke analysis Global transformation and shape based Unitary image transform Shape (boundary & region based)

Statistical Features (Projection j Histograms) ) Introduced in 1956 in hardware OCR system. Today, this technique is mostly used for: Segmenting characters, words, and text lines Detect if an input image is rotated. Vertical Projection Horizontal Projection

Statistical Features (Profile) Count distance between the bounding box and the edge of a character image. Used to extract the contour of the character image. Figure: Profile of a character image

Statistical Features (Crossing) Count the number of transitions from background to foreground pixels. V = 2 H = 3 Figure: crossing

Statistical Features (Distance) Count the distance of the first Image pixel detected from upper and lower boundaries. U = 6 L = 5 R = 6 B = 7 Figure: crossing and Distance

Limitations (Projection,( j Profile, Crossing & Distance) ) Scale dependent. Sensitive to rotation. Sensitive to the variability in writing style. Important information about the character shape seems to be lost.

Statistical Features (Zoning) Divide the character image (matrix) into certain number of zones (sub-matrix). Apply computation on each zone separately. The goal of zoning is to obtain the local characteristics instead of global characteristics. Calculation over each zone: Percentage of black pixels. Weight of each zone. Evaluate the extent to which sub-matrix shape matches any direction. (Used for MLP based classifier)

Structural Features (Zoning)( One row overlap Two rows overlap 9 X 7 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 One row overlap Weight Matrix ( 9X7) Two rows overlap One row overlap Figure: Zones of a 32 X 24 image 60 Degree Path

Statistical Features (Zoning) Observations: Additional features needed to improve the classifier performance. Overlapping between zones to enhance the reliability of the features.

Structural Features (Nodal features) ) Actual Nodes Figure: Nodes extracted from a character image

Global Transformation (Unitary Image Transform) ) Reduction in the number of features. Preserving most of the information. Pixels are ordered by their variance, and the pixels with the highest variance are used as features. Reconstruction ability. Limitations: Not rotation invariant Input image have to be exactly the same size (Scaling and resampling is necessary if the size can vary)

Global Transformation (Unitary Image Transform) ) Several transformation methods: Karhunen-Loeve (KL) computationally demanding Fourier recommended by andrew Hadamard (or Walsh) -- recommended by andrew Haar transform Cosine computationally ti reasonable, better in terms of image compression Sine Slant Transform We applied Discrete Cosine Transform with Hidden Markov Model (HMM) We applied Discrete Cosine Transform with Hidden Markov Model (HMM) as a classifier.

Global Transformation (Discrete t Cosine Transform) ) Table: Reconstruction result of different variance difference 0.7 Difference of the variance 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 ক 1227 322 172 118 80 57 45 36 28 25 খ 1168 294 157 98 64 45 33 24 20 16 গ 1171 304 142 89 62 48 30 21 16 12 Table: Number of features for different variance difference

Shape Based Features are invariants to translation, scale, rotation, blur and noise. Most commonly used image features, where shape representation is the most important issue. Classified into two categories: boundary-based invariants region-based invariants.

Shape (Boundary Based) Explore only the contour information; Two techniques: Chain code Fourier descriptors Cannot capture the interior content of the shape. Reconstruction ability. Limitations: cannot deal with disjoint shapes; Decisions: Not appropriate for us.

Shape (Region Based) All of the pixels of the image are taken into account to represent the shape. Can capture some of the global properties. Popular region-based methods Hu s seven moment invariants Zernike moments Can also be employed to describe disjoint shapes. Reconstruction ability.

Region Based(Hu s moment invariants) Seven moments Hu s invariants have the properties of being invariant: image translation scaling rotation 1 2 3 4 5 6 7 ক 0.22899 0.000128 9.95E-07 4.74E-05-1.66E-10-3.15E-07 2.79E-10 খ 0.22096 0.000568000568 5.19E-06 7.65E-06-2.96E-11 1.19E-0719E-07 3.81E-11 গ 0.22057 0.005055 1.65E-05 1.83E-05 3.12E-10 1.27E-06 7.07E-11 Table: Hu s seven moments Compute the higher order of Hu s moment invariants is quite complex.

Region Based(Zernike moments) Allow independent moment invariants to be constructed easily to an arbitrarily high order. Concept of orthogonal moments to recover the image. Invariants to: Rotation Normalized Zernike moments, Invariants to: Translation Scale Rotation

Region Based(Zernike moments) Order 20 11 25 13 30 16 35 18 40 21 Number of features Table: Number of features for different order of Zernike moment Table: Reconstruction result for different order of Zernike moment

Features s of the existing open-source OCRs OCROPUS: Features used by the system currently include Gradients Singular points of the skeleton Presence of holes and Unary-coded d geometric information Location relative to the baseline and Original aspect ratio and skew prior to skew correction.

Features s of the existing open-source OCRs Tesseract: Feature used by tesseract includes: Segments of the polynomial approximation. Direction of the outline For test character features are three dimensional: x position y - position angle For training character features are three dimensional: x position y - position angle length

Features s of the existing open-source OCRs GOCR: Feature used by GOCR includes: size skew presence of serifs

Conclusion Unique features can extract from the similar Brahmi scripts. Zonal features are useful as secondary features. Nodal features are useful if properly extracted. Moments are useful primary features. Hu s seven features. Zernike features up to 40 order.

References [1] D. Trier, A.K. Jain, and T. Taxt, "Feature extraction methods for character recognition - a survey," Pattern Recognition, vol. 29, no. 4, pp. 641-662, Apr. 1996. [2] Tinku Acharya and Ajoy K. Ray, Image Processing Principles and Applications. New Jersey:John Wiley & Sons, 2005. [3] Qing Chen, "EVALUATION OF OCR ALGORITHMS FOR IMAGES WITH DIFFERENT SPATIAL RESOLUTIONS AND NOISES", Graduate Thesis Report, School of Information Technology and Engineering, Faculty of Engineering, University of Ottawa. [4] Peter Burrow, "Arabic Handwriting Recognition", Graduate Thesis Report, School of Informatics, University of Edinburgh. [5] Md. Abul Hasnat, S. M. Murtoza Habib, and Mumit Khan, Segmentation free Bangla OCR using HMM: Training and Recognition, Proc. of 1st DCCA2007, Irbid, Jordan, 2007. [6] R. Kapoor, D. Bagai and T.S. Kamal, Representation and Extraction of Nodal Features of DevNagri Letters, Proceedings of the 3rd Indian Conference on Computer Vision, Graphics and Image Processing. [7] Jan Flusser, Moment Invariants in Image Analysis, TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY, V11, Feb. 2006, ISSN 1305-5313 [9] Liu Maofu, He Yanxiang and Ye Bin, "Image Zernike moments shape feature evaluation based on image reconstruction", Geo-spatial Information Science, Volume 10, Issue 3, May 31, 2007. [10] http://www.micro.dibe.unige.it/research/ocr.htm it/research/ocr htm [11] www.iit.demokritos.gr/iit_ss/presentations/off-line%20handwritten%20ocr.ppt [12] http://tesseract-ocr.repairfaq.org/tess_glossary.html

Questions

Thank You