Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Similar documents
Network Traffic Measurements and Analysis

Classification: Linear Discriminant Functions

Kernel Methods & Support Vector Machines

Slides for Data Mining by I. H. Witten and E. Frank

Machine Learning Techniques for Data Mining

Regularization and model selection

Accelerometer Gesture Recognition

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Hacking AES-128. Timothy Chong Stanford University Kostis Kaffes Stanford University

Final Report: Kaggle Soil Property Prediction Challenge

Classification of objects from Video Data (Group 30)

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Dimension Reduction of Image Manifolds

CS6716 Pattern Recognition

CS 229 Midterm Review

CSE 573: Artificial Intelligence Autumn 2010

Machine Learning: Think Big and Parallel

Short Survey on Static Hand Gesture Recognition

STA 4273H: Statistical Machine Learning

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Predicting Messaging Response Time in a Long Distance Relationship

Applying Supervised Learning

Classification: Feature Vectors

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Character Recognition from Google Street View Images

Artifacts and Textured Region Detection

Fundamentals of Digital Image Processing

Machine Learning. Chao Lan

Lecture Linear Support Vector Machines

CS229 Final Project: Predicting Expected Response Times

Digital Image Processing

Character Recognition

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

Information Management course

Allstate Insurance Claims Severity: A Machine Learning Approach

All lecture slides will be available at CSC2515_Winter15.html

Ulrik Söderström 16 Feb Image Processing. Segmentation

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

A Taxonomy of Semi-Supervised Learning Algorithms

Lecture 7: Decision Trees

Introduction to Automated Text Analysis. bit.ly/poir599

1 Case study of SVM (Rob)

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

CS570: Introduction to Data Mining

Content-based image and video analysis. Machine learning

Machine Learning in Biology

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Identifying Important Communications

Describable Visual Attributes for Face Verification and Image Search

AN EFFICIENT BINARIZATION TECHNIQUE FOR FINGERPRINT IMAGES S. B. SRIDEVI M.Tech., Department of ECE

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Machine Learning / Jan 27, 2010

Performance Evaluation of Various Classification Algorithms

CSE 446 Bias-Variance & Naïve Bayes

Machine Learning : Clustering, Self-Organizing Maps

5. Feature Extraction from Images

ECE 285 Final Project

LECTURE 6 TEXT PROCESSING

Computer Vision for HCI. Topics of This Lecture

7. Boosting and Bagging Bagging

Learning to Recognize Faces in Realistic Conditions

Machine Learning with MATLAB --classification

Face Recognition via Sparse Representation

CS 490: Computer Vision Image Segmentation: Thresholding. Fall 2015 Dr. Michael J. Reale

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Supervised Learning Classification Algorithms Comparison

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

A study of classification algorithms using Rapidminer

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Visual Representations for Machine Learning

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Global Journal of Engineering Science and Research Management

Automatic Colorization of Grayscale Images

Data Mining Practical Machine Learning Tools and Techniques

Nonparametric Methods Recap

ECG782: Multidimensional Digital Signal Processing

Image Processing. Image Features

Support Vector Machines + Classification for IR

Random Forest A. Fornaser

Cyber attack detection using decision tree approach

Predicting Gene Function and Localization

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Classifying Depositional Environments in Satellite Images

Classification Algorithms in Data Mining

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Part-Based Skew Estimation for Mathematical Expressions

Nearest neighbor classification DSE 220

GENDER CLASSIFICATION USING SUPPORT VECTOR MACHINES

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Algorithms for Recognition of Low Quality Iris Images. Li Peng Xie University of Ottawa

UNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Transcription:

Equation to LaTeX Abhinav Rastogi, Sevy Harris {arastogi,sharris5}@stanford.edu I. Introduction Copying equations from a pdf file to a LaTeX document can be time consuming because there is no easy way to convert the equations in pdf form back to LaTeX code. This automatic conversion requires segmentation of characters from the image, identification of these characters and conversion to LaTeX code. This project deals with the first two steps of this problem. We have proposed a novel character segmentation algorithm for printed text and have experimented with some features and classifiers which classify the segmented characters. II. Segmentation We have made two assumptions for printed text- (a) All characters in the image are axis-aligned (b) Pixels corresponding to a single character are connected The first assumption is good because printed text is well structured. If the image is not axisaligned, it can be corrected using convex-hull detection if the rotation is less than 180 but we have assumed that this rotation is absent from our input images. The second assumption holds true for the most of the characters in printed text. Few characters like i, j, = etc. violate this assumption but they can be dealt with separately. Our segmentation algorithm is based on the intuition that if we sum along the rows of an axis-aligned image, the rows containing characters will give a lower sum than the rows containing blank pixels. Let us illustrate our algorithm. Let I be an m n image (m rows, n columns) containing multiple equations. Figure 1: The steps involved in segmentation of characters from an image of equation (order: top to bottom) Our segmentation algorithm proceeds in the following steps - 1. Threshold the image Apply a threshold of 0.9 to I to obtain a binary image B containing 1 in blank pixels and 0 in character pixels. Remove or add blank pixels on the border such that the minimum distance of any character pixel from the image border is 10 pixels. 2. Perform binary erosion We use a 1 10 horizontal mask to perform binary erosion on I. This spreads the character pixels along the rows so that when we 1

sum up the rows of the image, we get a substantial difference between the sums corresponding to blank rows and rows containing characters. These images are solely generated for the purpose of obtaining row split points. 3. Obtain blank rows Sum the horizontally eroded image along rows to get a list of m numbers {r 1, r 2,... r m } where r i = n j=1 I[i, j]. The set {i r i > 0.99 max(r i )} corresponds to the set of blank rows. 4. Split text lines As expected, many blank rows occur next to each other. We take the mid-point of each contiguous stretch and split the image into many images, each containing one row of text. Let m 1, m 2,... m p be the mid-points of contiguous stretches of blank columns. The images I[m i : m i+1, :] are the images containing a row of text for i = 1 to p 1. 5. Split text vertically We perform steps 2-4 for the image of each line using a vertical mask instead of horizontal and column sums instead of row sums to obtain column splits and divide each text row image further. 6. Obtain segmented characters Most of the characters are segmented after step 5 but if the text contains characters like i or subscripts in, we might not have segmented each character. So, if the image obtained after step 5 has more than 1 connected component, steps 2-5 are again repeated. 7. Scale and center Each obtained image is then centered by addition or removal of blank pixels on the border such that the minimum distance of any character pixel from the image border is 5 pixels. III. Dataset We wrote a python script which takes a txt file as input and and substitutes its contents in a tex file. A bash script then compiles the tex file to generate the pdf document, converts the pdf to an image, runs our segmentation algorithm on the obtained image to segment the different characters and labels each image according to the its character name obtained from the txt file. The above procedure can be used to obtain as much data as we need with very little effort. For the purpose of this project, we have focused on the identification of upper and lower case roman alphabets and digits. Our dataset consists of 8 images of each of these characters and contains 472 images in total. IV. Features Feature selection becomes a critical part of this problem since it involves a large number of classes and the size of each class is much smaller than the number of classes. The following feature mappings map the image of a character to a vector in a multidimensional euclidean space. The learning algorithms are then trained and tested on those vectors. We tested different models on these three sets of features- (i) Average Pixel Intensity features (A) The whole image is divided into a 16x16 grid and the average pixel intensity in each grid is calculated. The feature vector is of size 256x1 and is obtained by concatenating the columns of the 16x16 image. (ii) Neighborhood pattern features (B) The pattern formed by the adjacent pixels seems to be an important cue in the identification of a digit. However, average pixel intensity features do not take the pattern formed by neighboring pixels into account. If we consider a 3 3 neighborhood of a pixel, there are 2 9 = 512 possible patterns that can be formed by the pixels in this neighborhood. We assign a number to each such pattern. Under this feature mapping, we first convert the image into a 16 16 image as described above. Then each non-border pixel, which are 14 14 = 196 in number are assigned the number corresponding to the pattern formed by its 3 3 neighborhood. The feature vector of size 196 1 is obtained by concatenating the numbers assigned to each of the non-border pixels in the 2

16 16 image. (iii) Discrete Fourier Transform features (C) The fourier transform captures the trends in relative patterns and is not as susceptible to noise as the neighborhood pattern features. The feature vector is obtained by taking the DFT of the 16x16 average pixel intensity image. The real and imaginary parts of the DFT coefficients are then concatenated to form a 512 1 feature vector. Figure 2: (i) Original image extracted from text (ii) 16 16 image with average pixel intensities (iii) Absolute value of real part DFT coefficients and (iv) Absolute value of imaginary part DFT coefficients (order: left to right) V. Models We employed the multiclass version of the following models to classify the characters. The training and test error results for these models are shown in the results section. (i) Naive Bayes We chose Naive Bayes as a baseline model so that we could get a rough idea of the difficulty of the task. We trained a Naive Bayes model using the average pixel intensity features and neighborhood pattern features. (ii) Linear Discriminant Analysis This model assumes that the conditional distribution of the features given the class is gaussian. This assumption is not explicitly true for our dataset and all classes having the same covariance matrix does not seem to be a good assumption. Nevertheless, we expected a fairly good result because much variability is not present in our class. For the multiclass formulation of LDA, we used a one-vs-one approach. This involves training a classifier for each pairwise combination of classes. So, if the number of classes is K then we train K(K 1)/2 classifiers. During prediction, we classify the input using all of these classifiers and the class which gets the majority vote is picked to be our predicted class. (iii) Support Vector Machine SVM is a fairly good choice if we assume that the feature space is linearly separable. This is not intuitively clear as our feature space has a high dimension. Nevertheless, as previously mentioned, our classes do not have a large variability and hence we expect classes to be linearly separable pairwise. However, another issue is that our feature space has far more dimensions than the number of data points and our data points are close to each other. SVM (or any linear classifier) may pose the danger of overfitting. For this reason, we did not use polynomial or RBF kernels for SVM because it further increases the effective dimensionality for the feature space. The objective function of or algorithm is 1 min w,b,ζ 2 wt w + C n ζ i i=1 subject to y i (w T φ(x i ) + b) 1 ζ i where ζ i 0 for i = 1,... n Our model uses a one-vs-one approach for the multiclass formulation of SVM as described previously. (iv) Random forest This model can perform well given large feature sets because it combines the predictions of various decision trees to build a more robust classifier. While constructing new decision trees, this method uses a random subset of features. This helps us to get rid of spurious features and might improve the robustness of our estimate. Our random forest model makes use of ID3 algorithm for training decision trees and uses the Gini measure for calculating the goodness of a split criteria. The Gini impurity measure is defined as Gini(X m ) = p mk (1 p mk ) k 3

where p mk is the fraction of times an element of class k occurs in a split X m of set X. Gini approximates the entropy function and hence our decision tree approximates a sequence of split criteria that decrease the entropy of the final partition. After trainig numerous decision trees, our model combines them using the AdaBoost algorithm. VI. Results We experimented with various combinations of features and models described above. The training and test results are as follows. The abbreviations for feature names have been given in the section on features. The errors are reported as the percentage of data misclassified. The training data consists for 354 images and the test data consists of 118 images. Feature Model A B C NB 0.1638 0.3842 - LDA 0.1299 0.3729 0.0 SVM 0.0734 0.3672 0.0 RF 0.1582 0.3616 0.0 Table 1: Training Error of feature model combination Feature Model A B C NB 0.0339 0.3729 - LDA 0.0339 0.3559 0.0678 SVM 0.0169 0.3559 0.0508 RF 0.1017 0.3559 0.0847 Table 2: Test Error of feature model combination VII. Discussion As expected, the average pixel intensities fail to give a good result because a slight shift in the image might lead to a different feature vector altogether. We expected neighborhood pixel pattern features to work better than average pixel features but it was surprising to find that they perform worse. On further deliberation, we feel that these features fail due to a similar reason. A slight shift in the image might severely affect the feature vector. Moreover, the effect of noise is more pronounced in this feature because a noisy pixel will considerably change up to 9 components of the feature vector. These components correspond to the noisy pixel and its neighbors. It is tough to visualize the exact correspondence between a DFT feature and the image. Hence, we had a neutral expectation from this feature mapping before we began working on it. As can be seen, these features give a considerably better result than other features. Our explanation for this observation is that DFT features capture the spatial periodicities of various pixels. DFT can be thought of as averaging after multiplication by sinusoids of different angular frequencies which are integral fractions of 2π. Due to the averaging operation, the DFT features are robust to noise. Further, a shift in the image corresponds to the change in phase of the DFT coefficients and hence the feature vectors are not affected much. As far as models are concerned, we didn t expect a good result from our baseline model, which was a multiclass Naive Bayes classifier. This is because NB assumes that the pixel intensities are independent, which doesn t seem to be a good assumption. The LDA classifier works for our data because we have very little variability within each class and hence the assumption that all classes share the same covariance matrix doesn t have much effect on the accuracy of our algorithm if the classes are far apart in the feature space, which is true for most of the classes. The random forest model is more robust and helps in decreasing the variance of the classifier because it uses a set of decision trees to make its decision. Since we pick up a random subset of features in the process of training the decision trees, we automatically filter out good features which affect our decision. Our intuition is validated by the fact that RF does better than SVM when we use an inferior set of features (features A and B) because of its 4

ability to discard irrelevant features. However, when we have a better set of features (feature C), its performance is close to that of SVM. SVM and DFT features was the best modelfeature combination we found. On test data, our model misclassifies 6 out of 118 images. The errors involve misclassification of 0, 1 and 2 as o, t and a respectively. The confused characters are structurally similar and hence the confusion is plausible. VIII. Future Work Classification of printed characters is a tough task. Close to perfect accuracy is desirable in order to build a system for converting equations to LaTeX commands. We have only worked on a subset of characters in this project. If one were to include greek alphabets too, the error would increase because of presence of similar characters like a and α, B and β etc. While doing a literature survey, we found that neural networks give a good result for handwritten digits. It would be interesting to explore how they work on printed text when lots of different classes are present. Another interesting but non-machine learning aspect of this problem which we have not considered in this project is the generation of LaTeX commands after segmentation and classification of text. References [1] T. Mitchell, Machine Learning: A Guide to Current Research: Springer, 1986. pp52-78 [2] A. Ng, "CS 229 Lecture Notes: Support Vector Machines," [online] cs229.stanford.edu/notes [3] E. Brown, "Character Recognition by Feature Point Extraction," [online], www.ccs.neu.edu/home/feneric [4] Mahmoud, S.A.; Mahmoud, A.S., "Arabic Character Recognition using Modified Fourier Spectrum (MFS)," Geometric Modeling and Imaging - New Trends, 2006, vol., no., pp.155,159, 16-18 Aug. 1993 [5] Scikit Learn- A machine learning library for Python [online] scikitlearn.org 5