Mathematical formulae recognition and logical structure analysis of mathematical papers

Similar documents
Digitization g of Mathematical Journals. The beginning: Current research subjects: URL:http;// Policy:

Mathematical formula recognition using virtual link network

Integrated OCR software for mathematical documents and its output with accessibility

Refinement of digitized documents through recognition of mathematical formulae

Study of Tools & Techniques for Accessing Mathematics by Partially Sighted / Visually Impaired persons. Akashdeep Bansal

Advanced Topics in Curricular Accessibility: Strategies for Math and Science Accessibility

A Ground-Truthed Mathematical Character and Symbol Image Database

INFTY An Integrated OCR System for Mathematical Documents

InftyEditor with InftyReader-pdf2latex

Support Vector Machines for Mathematical Symbol Recognition

Designing a Semantic Ground Truth for Mathematical Formulas

Alternate Format for STEM

PDF recompression using JBIG2

On-Line Recognition of Mathematical Expressions Using Automatic Rewriting Method

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

A Design-for-All Approach Towards Multimodal Accessibility of Mathematics

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

}w!"#$%&'()+,-./012345<ya

Mathematical Symbol Recognition with. Support Vector Machines

Mathematical Symbol Recognition with. Support Vector Machines

Report on the retrodigitization project Archiv der Mathematik

Creating Accessible PDF Documents

Example Paper in the ICROMA Format

Computer algebra systems, mathematical representation, and the DLMF

What is LaTeX. Is a document markup language and document preparation system for the TeX typesetting program

Assessments for CS students:

Click here, type the title of your paper, Capitalize first letter

Example Paper in the ICROMA Format

An Online Repository of Mathematical Samples

PLATYPUS FUNCTIONAL REQUIREMENTS V. 2.02

ABBYY FineReader 14. User s Guide ABBYY Production LLC. All rights reserved.

THE OUTLOOK FOR MATHEMATICS ON THE WEB

Word-to-L A TEX specification

Click here, type the title of your paper, Capitalize first letter

California Open Online Library for Education & Accessibility

DRAFT WEB DESIGN 1 HBT 3131 HBT 3131 Web Design 1 Credit: Grade: Term: ACP Credit: Books: Resource List:

California Open Online Library for Education & Accessibility

A Brief Introduction to L A TEX

Identifying Layout Classes for Mathematical Symbols Using Layout Context

MDSRC , November, 2017 Wah/Pakistan. Template for Abstract Submission

Introduction to Moodle: Creating & Sharing Content

Accessible and Usable PDF Documents: Techniques for Document Authors Fourth Edition

Introduction to L A T E X

Making Accessible Documents. PDF: Adobe Acrobat X & XI

SCIENTIFIC JOURNALS GO DAISY

Using the Text Editor Tutorial

Accessible Documents: Word to PDF

Click here, type the title of your paper, Capitalize first letter

ABBYY FineReader 10. Professional Edition Corporate Edition Site License Edition. Small and medium-sized businesses or individual departments

Creating Accessible PDF Files using Microsoft Word 2010 and Adobe Acrobat Pro version X

Math 3820 Project. 1 Typeset or handwritten? Guidelines

Ten Ways to Share Your Publications With the World: A Guide to Creating Accessible PDF Documents in Adobe Acrobat Professional 7.

Baseline Structure Analysis of Handwritten Mathematics Notation

Instructions/template for preparing your ComEX manuscript (As of March 1, 2012)

INSTRUCTIONS FOR TYPESETTING MANUSCRIPTS USING L A TEX

Insert here, the title of your paper, Capitalize first letter

Electronic Production Guidelines

Conspicuous Character Patterns

Instructions to Authors for Registration and Paper Online Submission

Typesetting in wxmaxima

A Semi-automatic Support to Adapt E-Documents in an Accessible and Usable Format for Vision Impaired Users

1. The Joy of TEX. Check out this example!

INSTRUCTIONS FOR TYPESETTING MANUSCRIPTS USING TEX OR L A TEX

Introduction to Infographics and Accessibility

Semi-Automatic Generation of Arabic Digital Talking Books

California Open Online Library for Education & Accessibility

Word Template Instructions

Accessible PDFs. Center for Effective Teaching and Learning CETL. Cal State L.A. (323)

Best Practices for Accessible Course Materials

Automated Extraction of Large Scale Scanned Document Images using Google Vision OCR in Apache Hadoop Environment

Focusing of a Parallel Beam to Form a Point in the Particle Deflection Plane

My Mathematical Thesis

Usage of MathML for paper and web publishing

California Open Online Library for Education & Accessibility

The content editor has two view modes: simple mode and advanced mode. Change the view in the upper-right corner of the content editor.

Report Writing with Lyx

California Open Online Library for Education & Accessibility

DjVu Technology Primer

California Open Online Library for Education & Accessibility

Using L A TEX. A numbered list is just that a collection of items sorted and labeled by number.

Style template and guidelines for SPIE Proceedings

o Changes that require author intervention, e.g., address/ requirements and/or missing sections, should be conveyed through remarks.

Document/Presentation Accessibility Best Practices. Table of Contents. Microsoft Word 2013, PowerPoint 2013, Excel 2013 and Adobe PDF

Full Price (Academic): $57.00 Upgrade Price (Academic): $37.00 Check with Design Science for site license pricing if order is greater then 5 units

California Open Online Library for Education & Accessibility

Journal of Electrochemical Science and Engineering jese J. Electrochem. Sci. Eng.

Instructions for Typesetting Manuscripts Using L A TEX

}w!"#$%&'()+,-./012345<ya

Instructions for Typesetting Manuscripts Using L A TEX

Applying Hierarchical Contextual Parsing with Visual Density and Geometric Features to Typeset Formula Recognition

MathType. Check with Design Science for site license pricing if order is greater then 5 units

California Open Online Library for Education & Accessibility

The New Document Digital Polymorphic Ubiquitous Actionable Patrick P. Bergmans University of Ghent

Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network

Using MathType. The MathType Window. The Bars

Accessibility Guidelines

A Layout-Free Method for Extracting Elements from Document Images

California Open Online Library for Education & Accessibility

PDF: Accessible Manual

California Open Online Library for Education & Accessibility

Transcription:

Mathematical formulae recognition and logical structure analysis of mathematical papers DML 2010 July 7, 2010, Paris Masakazu Suzuki Kyushu University InftyProject ((http://www/inftyproject.org) Science Accessibility Net (http://www.sciaccess.net)

Plan of the talk About InftyProject Making Rich Digital Mathematical Libraries Process Flow and Technical Components Formulae Recognition Adaptive Method Character and Symbol Recognition Logical Structure Analysis http://www.inftyproject.org 2

Section 1 About Infty Project 3

InftyProject The beginning: - Started as a research project to help visually impaired people in scientific fields in 1995. - Digitization of of mathematical journals, books, etc.. Current research subjects: - Recognition and understanding of math documents, - User interface and data conversion, etc. Policy: - Priority in practical system development. 4

InftyProject Main system development InftyReader : Math OCR software InftyEditor : Editor of math documents Data conversion(xml, LaTeX, HTML, PDF, etc.) ChattyInfty : InftyEditor + speech output URL:http;//www.inftyproject.org Go 5

saccessnet InftyReader is used for - Helping people with visual handicaps working in scientific fields, - Digitization of mathematical/scientific Journals in Japan, e.g. J.Math.Soc.Japan, Japanese J. Math., Tokyo J.Math, etc., (11 journals of mathematics pulished in Japan) by the not-for-profit organization Science Accessibility Net http;//www.sciaccess.net/ Go http://www.inftyproject.org 6

InftyReader OCR software for math documents Demonstration. Original Image Recognition Result Sample: A sample of Math Journal digitized using InftyReader http://infty.kyushu-u.ac.jp 7

Section 2 Toward Rich DML 8

Digitization of Math Journals Different levels: Level 1: Scanned images of papers e.g. GIF, TIFF Level 2: Searchable digitized document e.g. PDF with hidden text Level 3: Structured document with links e.g. XML, HTML(+MathML), LATEX, Level 4: (partially) Executable document e.g. Mathematica, Maple Level 5: Formally presented document. e.g. Mizar, OMDoc http://infty.kyushu-u.ac.jp 9

Digitization of Math Journals Different levels: Level 1: Bitmap images of printed materials. e.g. GIF, TIFF Level 2: Searchable digitized document e.g. PDF with hidden text Infty : Level 1 Level 3 Level 3: Structured document with links. e.g. XML, HTML(+MathML), LATEX, Level 4: (partially) Executable document. e.g. Mathematica, Maple Level 5: Formally presented document. e.g. Mizar, OMDoc http://infty.kyushu-u.ac.jp 10

Process Flow of Digitization Image File (TIF) PDF Texts Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis) Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.) XML Outputs LaTeX, HTML+MathML, PDF, Braille codes, etc. http://infty.kyushu-u.ac.jp 11

Layout Analysis (Pre processing) Image File (TIF) Segmentation of Areas (Text, Table, Figure) PDF Recognition per line (Character recognition, Math. Structure analysis) Document Structure analysis (Title, Chapter, Section, Itemize, Theorem, Bib, etc.) XML Outputs LaTeX. HTML, Human readable TeX Braille codes, Speak data, etc. 12

Layout Analysis Image File (TIF) PDF Segmentation of Areas Table Analysis Recognition per line (Character recognition, Math. Structure analysis) Document Structure analysis (Title, Chapter, Section, Itemize, Theorem, Bib, etc.) XML Outputs LaTeX. HTML, Human readable TeX Braille codes, Speak data, etc. 13

Process Flow of Digitization Image File (TIF) PDF Texts Layout Analysis : Segmentation of Areas (Text, Table, Figure) Line Segmentation Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis) Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.) XML Outputs LaTeX, HTML+MathML, PDF, Braille codes, etc. http://infty.kyushu-u.ac.jp 14

Line Segmentation (Sample) 15

Line Segmentation (Sample) 16

A Method of Line Segmentation 17

Process Flow of Digitization Image File (TIF) PDF Texts Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis) Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.) XML Outputs LaTeX, HTML+MathML, PDF, Braille codes, etc. http://infty.kyushu-u.ac.jp 18

Math/Text Segmentation Number of characters in Math area is about 20% of all the characters in pure math journals. No math. structure Math. structure http://infty.kyushu-u.ac.jp 19

Math/Text Segmentation Recognition of Ordinary Texts & Math/Text Area Segmentation = Simultaneous Process using DP 1. Combination of different OCR engines 2. Score using relative position check Current version: Infty + Two commercial OCRs (Toshiba + Media Drive) New version: FineReader engine will be added. http://www.inftyproject.org 20

Methods For further improvement of ordinary text recognition Method 1 (Recognition of words) + multi-lingua recogniton. Introduction of ABBY FineReader engine. Method an effect F : N ü h a m a - g u n, E : N i i h a r n a - g u n, I : N i i / ι a m a - g u n, http://www.inftyproject.org 21

Methods For further improvement of ordinary text recognition Method 1. + multi-lingua recogniton. Introduction of ABBY FineReader engine. Method an effect F : N ü h a m a - g u n, E : N i i h a r n a - g u n, I : N i i / ι a m a - g u n, http://www.inftyproject.org 22

Methods For further improvement of ordinary text recognition Method 1. + multi-lingua recogniton. Introduction of ABBY FineReader engine. Method an effect Result N i i h a m a g u n, F : N ü h a m a - g u n, E : N i i h a r n a - g u n, I : N i i / ι a m a - g u n, http://www.inftyproject.org 23

Methods For further improvement of ordinary text recognition Method 2 (Use character sizes and positions) + multi-lingua recogniton. Introduction of ABBY FineReader engine. Method an effect http://www.inftyproject.org 24

Methods For further improvement of ordinary text recognition Method 2 (Use character sizes and positions) + multi-lingua recogniton. Introduction of ABBY FineReader engine. Method an effect http://www.inftyproject.org 25

Process Flow of Digitization Image File (TIF) PDF Texts Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis) Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.) XML Outputs LaTeX, HTML+MathML, PDF, Braille codes, etc. http://infty.kyushu-u.ac.jp 26

Formulae Recognition Recognition of Variety of Rare Symbols Distinction of Fonts. (Italic, Bold, Bbb, Caligraphic,etc.) Segmentation of Touched/Broken characters in Math Area is still a difficult problem. Stable Structure Analysis of math formulae against the miss-recognition of characters. Distinction of Noises and Small symbols http://infty.kyushu-u.ac.jp 27

Process Flow of Digitization Image File (TIF) PDF Texts Layout Analysis : Segmentation of Areas (Text, Table, Figure) Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis) Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.) XML Outputs LaTeX, HTML+MathML, PDF, Braille codes, etc. http://infty.kyushu-u.ac.jp 28

Document Structure Analysis Detection of : Title, Autor, Section, Subsection, Itemization, BibItem, Theorem, Lemma, etc. - A Naïve method: Line classification using the combination features such as: Character size, Font Information (Bold, Italic, Small Capital), Keywords, Indentation, Starting with Numbers or Special pattern (e.g. [Num] ), etc. - Stronger method is required in actual digitization. Hyperlink inside document. 29

Section 3 Formulae Recognition by Infty 30

Formulae recognition R. Anderson, Syntax directed recognition of hand-printed two dimensional mathematcs. Interactive System for Experimental Applied Mathematics, M.Klerer and J. Reinfelds, Eds, Academic Press, 1968, pp. 436-459 M. Okamoto and H. Twaakyondo, Structure analysis and recognition of mathematical expressions, 3rd ICDAR, 1995, Montreal, (1995), 430--437. R. J. Fateman, T. Tokuyasu, B. P. Berman and N.Mitchell, Optical Character Recognition and Parsing of Typeset Mathematics, Journal of Visual Communication and Image Representation vol.7, no.1, (1996), 2--15. Y. Eto and M.Suzuki, Mathematical formula recognition using virtual link network, 6 th ICDAR, 2001, Seattle, IEEE Computer Society Press, 430--437 31

Infty OCR engine Developed in Suzuki Lab.,using more than 1,500,000 sample images of characters and symbols from various math. books/journals. Recognizes more than 500 categories Various math symbols Various fonts: Roman,Italic,Calligraphic,Bbb, some German fonts, etc. High speed Three step classification : rough classification strict classification 32

3 step classifications Input image α Sectional features (3dim) classification DB Directional element features(36dim) classification DB Peripheral feature +Density fearure (128dim) classification α a d DB Recognition result(candidates) 33

Voting method Directional element feature(36dim) Peripheral feature (64dim) Density feature (64dim) 1st (5) 1st (5) α σ a d d a 1st (5) 2nd (4) 2nd (4) α 2nd (4) 3rd (3) 3rd (3) d 3rd (3) 4th (2) 4th (2) e 4th (2) a θ α a d e 5th (1) 5th (1) 5th (1) Order of candidates (Scores) α 1st (14) a 2nd (9) d 3rd (7) Candidates (Total scores) 34

Voting method Directional element feature(36dim) Peripheral feature (64dim) Density feature (64dim) 1st (5) 1st (5) α σ a d d a 1st (5) 2nd (4) 2nd (4) α 2nd (4) 3rd (3) 3rd (3) d 3rd (3) 4th (2) 4th (2) e 4th (2) a θ α a d e 5th (1) 5th (1) 5th (1) Order of candidates (Scores) α 1st (14) a 2nd (9) d 3rd (7) Candidates (Total scores) Voting Normalization of the score of symbol recognition 35

Structure Analysis of Formulae Output(Tree Structure) Input (image) Horizontal Horizontal Horizontal RSubScript RSubScript 36

Structure Analysis of Formulae n n Σ0 i= a i a i i = 0 37

Structure Analysis of Formulae Some difficult cases : 38

Structure Analysis of Formulae Some difficult cases : 39

Structure Analysis of Formulae Some difficult cases : 40

Structure Analysis of Formulae Link possibilities : 41

Structure Analysis of Formulae Similar characters : * z N - 1 s Z S Which candidates are appropiriate? 42

Structure Analysis of Formulae a b x 2 χ Virtual link network d x χ 43

Structure Analysis of Formulae a b x 2 Search for correct spanning tree d x 44

Input image Virtual link network Virtual link network Nodes =Candidates of character recognition Each Link has a label and the link cost Link: Horizontal, Upper, Under, Rsup, Rsub,Lsup, Lsub 45

Link Cost Definitions of : Normalized size (NSize) and Normalized center (NCenter) x:y:z = 28:51:21 (dafault value)

Link Cost 47

Link Cost 600 Distribution map in the (H,D)-plane Character pairs 400 200 0 200 400 600 800 1000 1200 1400-200 Horizontal position Superscript position Subscript position -400-600

600 Alphabets-Alphabets 600 Alphabets-Operators 400 400 200 200 0 200 400 600 800 1000 1200 1400-200 -400-600 0 200 400 600 800 1000 1200 1400-200 -400-600 600 Integrals-Alphabets 600 Big Operators-Alphabets 400 400 200 200 0 200 400 600 800 1000 1200 1400-200 -400-600 0 200 400 600 800 1000 1200 1400-200 -400-600

Network Extraction of Structure Tree Optimization under constraints Structure Tree Minimum total cost Link restrictions 50

Network Extraction of Structure Tree Extraction of minimum cost spanning tree is NP-hard! Strategy of the current version: Beam search Structure Tree Optimization under constraints Minimum total cost Link restrictions 51

Extraction of Structure Tree

Linearize R R R R R R R

Linearize R R R R R R R

Linearize R R Each path corresponds to a spanning tree! R R R R R

Search of spanning tree Each path corresponds to a spanning tree Beam Search: At each step, we hold a fixed number of paths (=Beam) with lowest costs, and use them at the next step.

Network Extraction of Structure Tree - Extraction of minimum cost spanning tree is NP-hard. - Beam search fails sometimes to get optimal solution. Structure Tree Optimization under constraints Minimum total cost Link restrictions 57

Network Extraction of Structure Tree - Extraction of minimum cost spanning tree is NP-hard. - Beam search fails sometimes to get optimal solution. - Some other better strategy? Structure Tree Optimization under constraints Minimum total cost Link restrictions 58

Section 4 Large Volume Recognition 59

Large Volume Digitization Retro-digitization of journals, Reproduction of old book/series of books, Translation to different languages, Braille transcription, DAISY talking book, etc. http://infty.kyushu-u.ac.jp 60

Large Volume Digitization Adaptive method is efficient: Get information from the target document: - Character features, - Math formula parameters, - Layout parameters, etc. (Directly) or Recognition After manual checking (Semi-automatic) http://infty.kyushu-u.ac.jp 61

InftySystem for large scald digitization Applications: 1. InftyReader downloadable from our web site: http://www.sciaccess.net 2. InftyReader Pro (professional version) 3. BatchInfty 4. CharImageManager http://infty.kyushu-u.ac.jp 62

InftySystem for large scald digitization Applications: 1. InftyReader downloadable from our web site: http://www.sciaccess.net 2. InftyReader Pro (professional version) 3. BatchInfty 4. CharImageManager http://infty.kyushu-u.ac.jp 63

InftySystem for large scald digitization Applications: 1. InftyReader downloadable from our web site: http://www.sciaccess.net 2. InftyReader Pro (professional version) 3. BatchInfty 4. CharImageManager http://infty.kyushu-u.ac.jp 64

InftySystem for large scale digitization Process Flow using BatchInfty & InftyReader pro 1. Noise reduction, centering, etc. 2. Trial recognition 3. Extraction features: - Document style Logical structure analysis - Character cluster images OCR engine 4. Recognition & verification 5. PDF output http://infty.kyushu-u.ac.jp 65

Problems Full automatization of the adaptive method From the target documents: Extraction of character features / layout parameters Improvement of - Character recognition - Formulae recognition - Logical structure analysis Without manual correction 66

Problems Further improvement of character/symbol recognition and structure analysis of math expressions. Touched characters, Broken characters in math area Low resolution image Different type face (Old books, typewriter prints, etc.) Bold char detection in math area 67

Problems Logical Structure Analysis (Automatic detection and manual correction) Title, Autor, Section, Subsection, Itemization, BibItem, Theorem, Lemma, etc. Hyperlink inside document. 68

Problems Detection/Analysis of Figures and Tables Detection of characters in figures Table structure analysis Graphs Tables 69

Challenge Is it possible to realize: OCR with higher accuracy than manual imput/correction by human? 70

Challenge Is it possible to realize: OCR with higher accuracy than manual imput/correction by human? (I hope a student who challenges to this difficult problem appears in near future!) 71

InftyProject. Conclusion Research group of math information processing. Demo (InftyReader). A Brief sketch of the methods used in Infty and the current state of the art. There are many problems unsolved, especially in practical sense. Proposed some problems to be attacked. 72

INFTY an integrated OCR for mathematical documents Thanks you! Masakazu Suzuki Graduate School/ Faculty of Math. Kyushu University E-mail: suzuki@math.kyushu-u.ac.jp InftyProject: http://www.inftyproject.org Science Accessibility Net: http://www.sciaccess.net 73