Building Multi Script OCR for Brahmi Scripts: Selection of Efficient Features

Building Multi Script OCR for Brahmi Scripts: Selection of Efficient Features Md. Abul Hasnat Center for Research on Bangla Language Processing (CRBLP) Center for Research on Bangla Language Processing (CRBLP), Department of Computer Science and Engineering, BRAC University, Dhaka, Bangladesh.

Brahmi Script Analysis Features of the graphemes of the characters Baseline / Matraa Vertical bar Curvatures Baseline / Matraa Curvatures Vertical Bar

Brahmi Script Analysis Vowels have dependent and independent form. Vowel change shape when followed by consonant. Consonant followed by consonant creates new shape.

Bangla Script

Other Scripts Devanagari Gurmukhi

Other Scripts Tibetan Sharda

Outline Feature extraction for OCR. Classification Analysis of feature extraction approaches Conclusion

What is Feature Extraction? Devijver and Kittle define feature extraction as the problem of extracting ti from the raw data the information which most relevant for classification purposes, in the sense of minimizing the within-class pattern variability while enhancing the between-class pattern variability. Image features are unique characteristics i that can represent a specific image. Meaningful, detectable parts of the image. Overcome the vulnerabilities of template matching. reduce the computation cost.

Feature Extraction in OCR The selection of image features and corresponding extraction methods is probably the most important step in achieving high performance for an OCR system. The preprocessing stage aims to make the image be suitable for different feature extraction algorithms.

Feature Extraction in OCR Properties of image features Robust to transformations Robust to noise Feature extraction efficiency Feature matching efficiency Issues in feature extraction Invariants features remains unchanged when a particular transformation is applied. Reconstruction can be reconstructed from the extracted features.

Features and Classifiers Different feature type may need different type of classifiers. Graph description - structural or syntactic classifiers. Discrete features - decision trees. Real valued features - statistical classifier.

Types of Features Feature extraction methods are based on three types of features Statistical Projections and profiles Crossing and distance Zoning Structural Nodal features Stroke analysis Global transformation and shape based Unitary image transform Shape (boundary & region based)

Statistical Features (Projection j Histograms) ) Introduced in 1956 in hardware OCR system. Today, this technique is mostly used for: Segmenting characters, words, and text lines Detect if an input image is rotated. Vertical Projection Horizontal Projection

Statistical Features (Profile) Count distance between the bounding box and the edge of a character image. Used to extract the contour of the character image. Figure: Profile of a character image

Statistical Features (Crossing) Count the number of transitions from background to foreground pixels. V = 2 H = 3 Figure: crossing

Statistical Features (Distance) Count the distance of the first Image pixel detected from upper and lower boundaries. U = 6 L = 5 R = 6 B = 7 Figure: crossing and Distance

Limitations (Projection,( j Profile, Crossing & Distance) ) Scale dependent. Sensitive to rotation. Sensitive to the variability in writing style. Important information about the character shape seems to be lost.

Statistical Features (Zoning) Divide the character image (matrix) into certain number of zones (sub-matrix). Apply computation on each zone separately. The goal of zoning is to obtain the local characteristics instead of global characteristics. Calculation over each zone: Percentage of black pixels. Weight of each zone. Evaluate the extent to which sub-matrix shape matches any direction. (Used for MLP based classifier)

Structural Features (Zoning)( One row overlap Two rows overlap 9 X 7 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 2.5 3.5 4 4 4 3.5 2.5 One row overlap Weight Matrix ( 9X7) Two rows overlap One row overlap Figure: Zones of a 32 X 24 image 60 Degree Path

Statistical Features (Zoning) Observations: Additional features needed to improve the classifier performance. Overlapping between zones to enhance the reliability of the features.

Structural Features (Nodal features) ) Actual Nodes Figure: Nodes extracted from a character image

Global Transformation (Unitary Image Transform) ) Reduction in the number of features. Preserving most of the information. Pixels are ordered by their variance, and the pixels with the highest variance are used as features. Reconstruction ability. Limitations: Not rotation invariant Input image have to be exactly the same size (Scaling and resampling is necessary if the size can vary)

Global Transformation (Unitary Image Transform) ) Several transformation methods: Karhunen-Loeve (KL) computationally demanding Fourier recommended by andrew Hadamard (or Walsh) -- recommended by andrew Haar transform Cosine computationally ti reasonable, better in terms of image compression Sine Slant Transform We applied Discrete Cosine Transform with Hidden Markov Model (HMM) We applied Discrete Cosine Transform with Hidden Markov Model (HMM) as a classifier.

Global Transformation (Discrete t Cosine Transform) ) Table: Reconstruction result of different variance difference 0.7 Difference of the variance 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 ক 1227 322 172 118 80 57 45 36 28 25 খ 1168 294 157 98 64 45 33 24 20 16 গ 1171 304 142 89 62 48 30 21 16 12 Table: Number of features for different variance difference

Shape Based Features are invariants to translation, scale, rotation, blur and noise. Most commonly used image features, where shape representation is the most important issue. Classified into two categories: boundary-based invariants region-based invariants.

Shape (Boundary Based) Explore only the contour information; Two techniques: Chain code Fourier descriptors Cannot capture the interior content of the shape. Reconstruction ability. Limitations: cannot deal with disjoint shapes; Decisions: Not appropriate for us.

Shape (Region Based) All of the pixels of the image are taken into account to represent the shape. Can capture some of the global properties. Popular region-based methods Hu s seven moment invariants Zernike moments Can also be employed to describe disjoint shapes. Reconstruction ability.

Region Based(Hu s moment invariants) Seven moments Hu s invariants have the properties of being invariant: image translation scaling rotation 1 2 3 4 5 6 7 ক 0.22899 0.000128 9.95E-07 4.74E-05-1.66E-10-3.15E-07 2.79E-10 খ 0.22096 0.000568000568 5.19E-06 7.65E-06-2.96E-11 1.19E-0719E-07 3.81E-11 গ 0.22057 0.005055 1.65E-05 1.83E-05 3.12E-10 1.27E-06 7.07E-11 Table: Hu s seven moments Compute the higher order of Hu s moment invariants is quite complex.

Region Based(Zernike moments) Allow independent moment invariants to be constructed easily to an arbitrarily high order. Concept of orthogonal moments to recover the image. Invariants to: Rotation Normalized Zernike moments, Invariants to: Translation Scale Rotation

Region Based(Zernike moments) Order 20 11 25 13 30 16 35 18 40 21 Number of features Table: Number of features for different order of Zernike moment Table: Reconstruction result for different order of Zernike moment

Features s of the existing open-source OCRs OCROPUS: Features used by the system currently include Gradients Singular points of the skeleton Presence of holes and Unary-coded d geometric information Location relative to the baseline and Original aspect ratio and skew prior to skew correction.

Features s of the existing open-source OCRs Tesseract: Feature used by tesseract includes: Segments of the polynomial approximation. Direction of the outline For test character features are three dimensional: x position y - position angle For training character features are three dimensional: x position y - position angle length

Features s of the existing open-source OCRs GOCR: Feature used by GOCR includes: size skew presence of serifs

Conclusion Unique features can extract from the similar Brahmi scripts. Zonal features are useful as secondary features. Nodal features are useful if properly extracted. Moments are useful primary features. Hu s seven features. Zernike features up to 40 order.

References [1] D. Trier, A.K. Jain, and T. Taxt, "Feature extraction methods for character recognition - a survey," Pattern Recognition, vol. 29, no. 4, pp. 641-662, Apr. 1996. [2] Tinku Acharya and Ajoy K. Ray, Image Processing Principles and Applications. New Jersey:John Wiley & Sons, 2005. [3] Qing Chen, "EVALUATION OF OCR ALGORITHMS FOR IMAGES WITH DIFFERENT SPATIAL RESOLUTIONS AND NOISES", Graduate Thesis Report, School of Information Technology and Engineering, Faculty of Engineering, University of Ottawa. [4] Peter Burrow, "Arabic Handwriting Recognition", Graduate Thesis Report, School of Informatics, University of Edinburgh. [5] Md. Abul Hasnat, S. M. Murtoza Habib, and Mumit Khan, Segmentation free Bangla OCR using HMM: Training and Recognition, Proc. of 1st DCCA2007, Irbid, Jordan, 2007. [6] R. Kapoor, D. Bagai and T.S. Kamal, Representation and Extraction of Nodal Features of DevNagri Letters, Proceedings of the 3rd Indian Conference on Computer Vision, Graphics and Image Processing. [7] Jan Flusser, Moment Invariants in Image Analysis, TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY, V11, Feb. 2006, ISSN 1305-5313 [9] Liu Maofu, He Yanxiang and Ye Bin, "Image Zernike moments shape feature evaluation based on image reconstruction", Geo-spatial Information Science, Volume 10, Issue 3, May 31, 2007. [10] http://www.micro.dibe.unige.it/research/ocr.htm it/research/ocr htm [11] www.iit.demokritos.gr/iit_ss/presentations/off-line%20handwritten%20ocr.ppt [12] http://tesseract-ocr.repairfaq.org/tess_glossary.html

Questions

Thank You