Designing a Semantic Ground Truth for Mathematical Formulas

Similar documents
An Online Repository of Mathematical Samples

Refinement of digitized documents through recognition of mathematical formulae

Support Vector Machines for Mathematical Symbol Recognition

Towards a Parser for Mathematical Formula Recognition

Mathematical formulae recognition and logical structure analysis of mathematical papers

Integrated OCR software for mathematical documents and its output with accessibility

Mathematical formula recognition using virtual link network

Framework for Sense Disambiguation of Mathematical Expressions

A Linear Grammar Approach to Mathematical Formula Recognition from PDF

A Ground-Truthed Mathematical Character and Symbol Image Database

Faithful Mathematical Formula Recognition from PDF Documents

JOURNAL OF OBJECT TECHNOLOGY

On-Line Recognition of Mathematical Expressions Using Automatic Rewriting Method

Open XML Requirements Specifications, a Xylia based application

INFORMATION SOCIETY TECHNOLOGIES (IST) PROGRAMME. Project IST MoWGLI. L A TEX - based authoring tool (first prototype)

Database-driven Mathematical Character Recognition

MathDox: Mathematical Documents on the Web Contribution to the OMDoc book

The PAGE (Page Analysis and Ground-truth Elements) Format Framework

Security Issues Formalization

Introduction to Automata Theory. BİL405 - Automata Theory and Formal Languages 1

THE OUTLOOK FOR MATHEMATICS ON THE WEB

Types and Static Type Checking (Introducing Micro-Haskell)

Digitization g of Mathematical Journals. The beginning: Current research subjects: URL:http;// Policy:

A Method of Annotation Extraction from Paper Documents Using Alignment Based on Local Arrangements of Feature Points

Crafting a Compiler with C (II) Compiler V. S. Interpreter

A Design-for-All Approach Towards Multimodal Accessibility of Mathematics

CIS 1.5 Course Objectives. a. Understand the concept of a program (i.e., a computer following a series of instructions)

UML-Based Conceptual Modeling of Pattern-Bases

Object-oriented Compiler Construction

Eclipse Support for Using Eli and Teaching Programming Languages

OpenMath and SMT-LIB

Identifying Layout Classes for Mathematical Symbols Using Layout Context

Automated data entry system: performance issues

A Prolog-based Proof Tool for Type Theory TA λ and Implicational Intuitionistic-Logic

Chapter 3. Semantics. Topics. Introduction. Introduction. Introduction. Introduction

Introducing live graphics gems to educational material

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Evaluation of Hardware Oriented MRCoHOG using Logic Simulation

MMT Objects. Florian Rabe. Computer Science, Jacobs University, Bremen, Germany

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. Compiler Design

New Accessibility Features in MathJax

Toward Part-based Document Image Decoding

A Small Permutation Group Engine by: Gregory Kip. COMS W4115 Programming Languages and Translators Prof. Stephen Edwards

Compilers. History of Compilers. A compiler allows programmers to ignore the machine-dependent details of programming.

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

Indexing Mathematical Content Using Full Text Search Engine

Management Science Letters

CMSC 330: Organization of Programming Languages. Formal Semantics of a Prog. Lang. Specifying Syntax, Semantics

Introduction to Lexical Analysis

When do We Run a Compiler?

Formal editing: jedit-mmt. Narrative editing: LaTeX-MMT. Browsing: MMT web server. Building: MMT scripting language. The MMT API: A Generic MKM System

CSE 12 Abstract Syntax Trees

On the Logical Structure of Mathematical Notation

A System to Automatically Index Genealogical Microfilm Titleboards Introduction Preprocessing Method Identification

Let denote the number of partitions of with at most parts each less than or equal to. By comparing the definitions of and it is clear that ( ) ( )

Performance Evaluation of Semantic Registries: OWLJessKB and instancestore

INDEPENDENT POSTULATES FOR THE "INFORMAL" PART OF PRINCIPIA MATHEMATICA*

What do Compilers Produce?

Journal of Asian Scientific Research FEATURES COMPOSITION FOR PROFICIENT AND REAL TIME RETRIEVAL IN CBIR SYSTEM. Tohid Sedghi

3.5 Practical Issues PRACTICAL ISSUES Error Recovery

Basic concepts. Chapter Toplevel loop

Appendix: Generic PbO programming language extension

Inheritance Metrics: What do they Measure?

Reimplementing the Mathematics Subject Classification (MSC) as a Linked Open Dataset

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Group B Assignment 9. Code generation using DAG. Title of Assignment: Problem Definition: Code generation using DAG / labeled tree.

6.001 Notes: Section 6.1

Efficient Image Compression of Medical Images Using the Wavelet Transform and Fuzzy c-means Clustering on Regions of Interest.

Semantic Analysis. Compiler Architecture

Uncertain Data Models

Model-Solver Integration in Decision Support Systems: A Web Services Approach

MIDTERM EXAM (Solutions)

Equivalence Detection Using Parse-tree Normalization for Math Search

Computer algebra systems, mathematical representation, and the DLMF

The Cantor Handbook. Alexander Rieder

7. Introduction to Denotational Semantics. Oscar Nierstrasz

AUTOMATIC ACQUISITION OF DIGITIZED NEWSPAPERS VIA INTERNET

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

COMP 181 Compilers. Administrative. Last time. Prelude. Compilation strategy. Translation strategy. Lecture 2 Overview

Compilers. Prerequisites

An Archiving System for Managing Evolution in the Data Web

History of Compilers The term

The Structure of a Syntax-Directed Compiler

Using Attribute Grammars to Uniformly Represent Structured Documents - Application to Information Retrieval

A Lightweight Language for Software Product Lines Architecture Description

CPS122 Lecture: From Python to Java

Types and Static Type Checking (Introducing Micro-Haskell)

Replacing SEP-Triplets in SNOMED CT using Tractable Description Logic Operators

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

On the relationship between OpenMath and MathML

InftyEditor with InftyReader-pdf2latex

CS101 Introduction to Programming Languages and Compilers

Compiling and Interpreting Programming. Overview of Compilers and Interpreters

SOFTWARE AND MULTIMEDIA. Chapter 6 Created by S. Cox

2.2 Syntax Definition

INFTY An Integrated OCR System for Mathematical Documents

FRACTAL IMAGE COMPRESSION OF GRAYSCALE AND RGB IMAGES USING DCT WITH QUADTREE DECOMPOSITION AND HUFFMAN CODING. Moheb R. Girgis and Mohammed M.

Multilingual mathematical e-document processing

A simple syntax-directed

The Structure of a Compiler

Transcription:

Designing a Semantic Ground Truth for Mathematical Formulas Alan Sexton 1, Volker Sorge 1, and Masakazu Suzuki 2 1 School of Computer Science, University of Birmingham, UK, A.P.Sexton V.Sorge@cs.bham.ac.uk, http://www.cs.bham.ac.uk/ aps vxs 2 Faculty of Mathematics, Kyushu University, Japan suzuki@math.kyushu-u.ac.jp, www.inftyproject.org/suzuki Abstract. We report on a new project to design a semantic ground truth set for mathematical document analysis. The ground truth set will be generated by annotating recognised mathematical symbols with respect to both their global meaning in the context of the considered documents and their local function within the particular mathematical formula they occur. The aim of our work is to have a reliable database available for semantic classification during the formula recognition process with the aim of enabling correct interpretations of mathematical formulae and generating semantic markup such as Content MathML. 1 Introduction Ground Truth sets are manually annotated or validated sets of training data that are important tools for many recognition tasks. In document image analysis research, ground truth data is crucial for the design, training and testing of algorithms for data identification and extraction. A ground truth set for an optical character recognition(ocr) system generally consists of images of single characters together with their correct syntactic interpretation, e.g. in the form of ASCII code. Bespoke ground truth sets have to be developed to cater for types of target documents and recognition methods. In spite of the availability of automated tools for the development of ground truth sets for certain special cases [8], in the majority of cases ground truth sets can only be assembled semiautomatically and their manual correction is generally a very laborious task. For mathematical OCR and formula recognition, assembling a ground truth set is an even more daunting task as it not only needs to contain a large number of, often very similar, symbols but also has to cope with the two dimensional layout of mathematical formulas and therefore contain spatial information. There is currently only one ground truth set for mathematical documents available [6]. It has been constructed from 30 different articles on pure mathematics. It is a database of over 680,000 characters occurring both in text and mathematical formulae in the articles. It contains information about the structural nature of The authors work was supported by Royal Society International Joint Project (2008/R3)

the expression as a whole that the symbols are contained in and can therefore be used as a collection of statistical information about the relative occurrence of and relationships between neighbouring characters. Since most of the characters appear many times in this database, there is a large amount of information that can be mined from the database. For instance, the ground truth set has been used to compile statistics on spatial relationships between mathematical symbols[1] that are exploited to resolve sub- and superscript relationships within the mathematical formula recognition of the Infty system [5]. While [6] can greatly improve the robustness of algorithms for correct syntactic recognition, it is currently still of limited use for extracting semantic meaning of formulas. However, often mathematical formulae can only be correctly recognised if the underlying semantics of the formula is clear. Enriching a ground truth set for mathematical OCR with semantic information could therefore be desirable, and if a semantic ground truth set can be constructed successfully, it should not only enhance current formula recognition techniques, but also enable direct translation of expressions from mathematical documents into semantic markup such as Content MathML or OpenMath. This would aid accessibility tools to interpret functions and their components correctly as well as make content of mathematical documents amenable to mathematical software like Computer Algebra systems. Part of the problem in building high quality mathematical formula recognisers is the ambiguity caused by similar constructs ( used for different mathematical n concepts. To take just one example, is a binomial coefficient (n choose k) k) or a vector? Without context we can not be confident of our answer. A semantic ground truth for mathematical expressions would help us to base our decisions on well-founded scientific data, rather than the programmer s mathematical intuition. 2 Semantic Ground Truth The aim of our work is to compile a semantic ground truth set for mathematical formula recognition. We propose to implement this in two parts. The first part is a semantic ground truth for mathematical characters and symbols here we are concerned with associating low level information of individual symbols and their local neighbours (e.g. font information spacing and relative baseline positions) with mathematical objects and constructs from the mathematical domain that the document in question belongs to. The second part is a semantic ground truth for mathematical expressions as a whole 2.1 Semantic Symbol Ground Truth Set Our approach to constructing the semantic symbol ground truth set is to annotate the mathematical symbols occurring in a syntactic ground truth set similar to [6]. The resulting ground truth set should contain three kinds of information:

1. OCR level: Data on character codes and graphical coordinates. 2. Tree structure level: This describes the structure of a formula in form of an abstract syntax tree and corresponds to presentation mark up or L A TEX conversion level. 3. Semantic level: Semantic information about the nature and origin of symbols and formulas. This level will also contain semantic markup such as OpenMath or Content MathML. Semantic annotations will in turn be based on the following three levels: 1. Subject area 2. Usage of a symbol 3. Definition within a given context. The three different types of annotations enable the description of a symbol s semantics on three different levels of granularity. Subject Area Each symbol will have an annotation attribute for its origin in some mathematical field, which will correspond to the two first digits of the AMS Mathematics Subject Classification of 2000 [7]. This refers to the general mathematical field the document belongs to from which the particular symbol was extracted. Usage of Symbol A common problem with the correct interpretation of mathematical symbols is that they often have different meanings depending on the overall mathematical area or the local context in which a formula occurs. Therefore, one of the semantic annotations will record the exact mathematical usage of each symbol, e.g., is it a function symbol, an operator, a relation etc., in the formula from which it was extracted. For example the following two formulas give the symbol g two distinct meanings: g G (1) g B A (2) In (1) g is declared as an element of a group G, whereas in (2) g represents a function with domain A and codomain B. Consequently the former would be annotated as an ordinary symbol while the latter would be annotated as a function symbol. The usage of g can then be interpreted differently in other expression. For instance, in the expression g(hk) accordingtothe semanticusagein(1), itwouldbepartofamultiplicationwithin a group, while it has to be interpreted as a function application according to the semantics given to it in (2). Definition The most fine-grained semantic annotation will be based on the mathematical definition of a particular symbol in the context of the particular document it has been extracted from. We will use, as far as possible, the definition given in OpenMath content dictionaries [4] as annotations.

2.2 Semantic Expression Ground Truth Set While we can attach semantics to individual symbols, and, to a certain extent, relationships between neighbouring symbols, this does not extend to whole expressions where the relationships are between neighbouring sub-expressions, rather than simply symbols. Before we can hope to attach such expression semantics, therefore, we need to identify sub-expressions for semantics to be attached to. We therefore propose to use the tree structure level of the ground truth set in the form of abstract syntax trees (ASTs) for the expressions in our set and attach semantics to the nodes in these. Since the leaves of the ASTs would consist of the single characters and symbols in the expression, they would automatically have the annotations of the symbol ground truth. The inner nodes of an AST would then inherit the subject area annotation. While an inner node would not have a usage annotation, it will be annotated with the definition corresponding to the semantics of the sub-expression rooted in this node. For example, an expression of the form (1 2 3) in group theory has as symbol annotations open fence, three ordinaries, and closed fence and the three ordinaries in turn have definition annotations as integers. The AST representing the entire expression then will have a separate definition annotation, which will be permutation. 3 Automated Generation We intend to assemble the semantic ground truth set based on the machinery designed and implemented for the syntactic ground truth set presented in [6]. This facilitates the automatic recognition of mathematical symbols using the Infty system with subsequent manual correction. We are extending these tools to enable the handling and storage of semantic annotations. While the subject classification will be entered globally for all symbols from an article, the other two semantic annotations will be entered using a semiautomated approach, via a hangman style completion mechanism. That is, for each article in the ground truth set, symbols occurring in mathematical expressions will be annotated manually. If the same symbol occurs elsewhere in the same article it will be automatically given the same annotation. Thus the number of symbols that need to be annotated should gradually decrease with the majority of work spent on manually checking and correcting annotations if the completion yields incorrect results (e.g., if one symbol is used with different semantic meanings or in different contexts). Moreover, we intend, as much as possible, to exploit automatic classification of symbols with respect to the second semantic annotation for basic mathematical usage of symbols. We have recently developed a mechanism to categorise symbols based on special relations between symbols occurring in a formula [3]. The spatial analysis is based on re-engineering the basic layout rules which are traditional in mathematical typesetting and which are also employed by the

L A TEX system. The mechanism is currently employed for formula recognition from PDF documents [2], however, this will be extended in order to work for formula recognition in a more general context, such as from scanned documents. As a simple example of the working of this algorithm consider the following two formulas: xry yrx (3) x R y y R x (4) Here the spacing between the symbols in (3) would not distinguish the three occurring letters and therefore classify all three as ordinal symbols. On the other hand the increased spacing between the R and x,y in (4) would automatically identify R as a relation symbol, which is indeed its intended meaning in the given formula. While these techniques can assist in assigning semantics to symbols they are not sufficient annotating at the level of expressions in ASTs. Normally, ASTs are the output of a parser using a set of grammar rules. Here, we want to use the resulting ground truth set to deduce the appropriate grammar rules. Therefore, we must construct the ASTs manually, to correspond to how a human mathematician understands the expression. With such ASTs, we can then attach semantic interpretations to the nodes to complete our semantic expression ground truth set. Constructing ASTs manually is a particularly arduous task. We are implementing a tool that uses previous work on formula recognition on PDF documents [2, 3] to construct an initial proposed AST, and provide a convenient user interface for manipulating this tree into one closer to what the human mathematician understands. At this point, semantic annotations for nodes can be added so that, in the binomial coefficient/vector problem mentioned in the introduction, the AST would be the same in both casesbut the semantic annotation would be different. Since our previous recognition tool can already produce some Content MathML we hope to exploit this to aid the manual annotation. 4 Conclusion We have outlined our current project of creating a semantic ground truth set that should help to extend mathematical formula recognition techniques to produce semantically marked-up results. We are currently preparing the basic machinery to assemble the semantic ground truth set. One obstacle is that we can not simply base it on the existing syntactic ground truth set [6] due to copyright issues. While this implies that we will have to start from scratch it will also give us the opportunity for a new, careful selection of documents to include. We currently envisage that the annotation with respect to subject area and general usage of symbols is fairly straightforward and can be mostly automated, while entering the exact mathematical definitions will need to be done manually and might be more difficult to complete for the entire set.

Other potential problems with the construction of a semantic ground truth data set that could occur are: Ambiguities in the meaning of mathematical notation can not be resolved by considering a single article of the ground truth set, but will need a background knowledge of the mathematical literature in the field. The current semantic formalisation given in the OpenMath content dictionaries are not sufficient for annotating the given data. The OpenMath formalisations are not at the right level to give semantic meaning to human oriented mathematics. Since, to the best of our knowledge, no work has been done to create semantic ground truth in the context of document analysis we have currently no comparison to assess these factors. However, we strongly believe that if our attempt is successful the work could potentially be of great impact for mathematical formula recognition. References 1. W. Aly, S. Uchida, A. Fujiyoshi, and M. Suzuki. Statistical classification of spatial relationships among mathematical symbols. In Proceedings of ICDAR 2009, pages 1350 1354. IEEE Society Press, 2009. 2. J. Baker, A. Sexton, and V. Sorge. A linear grammar approach to mathematical formula recognition from PDF. In Proceedings of Intelligent Computer Mathematics, LNAI. Springer Verlag, Germany, 2009. 3. J. Baker, A. Sexton, and V. Sorge. Faithful mathematical formula recognition from PDF documents. In Proceedings of DAS 2010, 2010. Forthcoming. 4. S.Buswell, O.Caprotti, D.P.Carlisle, M.C.Dewar, M.Gaëtano, and M.Kohlhase. The OpenMath Standard. The OpenMath Society, June 2004. 5. M.Suzuki,F.Tamari, R.Fukuda,S.Uchida,andT.Kanahori. Infty anintegrated OCR system for mathematical documents. In Proceedings of ACM Symposium on Document Engineering, pages 95 104. ACM Press, 2003. 6. M. Suzuki, S. Uchida, and A. Nomura. A ground-truthed mathematical character and symbol image database. In Proceedings of ICDAR 2005, pages 675 679. IEEE Society Press, 2005. 7. The American Mathematical Society. 2000 Mathematics Subject Classification, 2000. http://www.ams.org/msc/. 8. J. van Beusekom, F. Shafait, and T. M. Breuel. Automated OCR ground truth generation. In Proceedings of DAS 2008, Sep 2008.