AIDAS: Incremental Logical Structure Discovery in PDF Documents

Similar documents
Automatic indexing of PDF documents with ontologies

PDF and Accessibility

3. Formatting Documents

Word Level 1: Beginner. Get started in Word. Apply basic text formatting. Arrange paragraphs on the page

Formatting documents in Microsoft Word Using a Windows Operating System

Assessment of Informational Materials (AIM) Tool. Funded by Alberta Enterprise and Education

Fall 2016 Exam Review 3 Module Test

Accessible PDF Documents with Adobe Acrobat 9 Pro and LiveCycle Designer ES 8.2

Lesson 4 - Basic Text Formatting

Readers are wary of out of date content, so it's important to actively manage the information you publish.

California Open Online Library for Education & Accessibility

Projected Message Design Principles

Web WRITING

Universal Design for Learning Tips

PDF Remediation Checklist

Automation of Semantic Web based Digital Library using Unified Modeling Language Minal Bhise 1 1

Creating Accessible Microsoft Word 2003 Documents Table of Contents

XF Rendering Server 2008

Chapter 6. Design Guides

A Design-for-All Approach Towards Multimodal Accessibility of Mathematics

Plainfield High School CTE Department

Accessibility 101. Things to Consider. Text Documents & Presentations: Word, PDF, PowerPoint, Excel, and General D2L Accessibility Guidelines.

PDF Accessibility Guide

Heading-Based Sectional Hierarchy Identification for HTML Documents

TABLE OF CONTENTS PART I: BASIC MICROSOFT WORD TOOLS... 1 PAGE BREAKS... 1 SECTION BREAKS... 3 STYLES... 6 TABLE OF CONTENTS... 8

Advanced Layouts in a Content-Driven Template-Based Layout System

What is Web Accessibility? Perspective through numbers... 2 Students will not always identify... 2

Graduate Health Sciences Word Topics

Correctly structuring and presenting a technical document

Setting Up Heading Numbers with a Multilevel List

Introduction to CS Page layout and graphics. Jacek Wiślicki, Laurent Babout,

ECDL / ICDL Presentation Syllabus Version 5.0

Setting Up Your Dissertation Format Using MS Word2000. Overview of the Process

How to Write a Technical Manual that Users Can Actually Use

Using Microsoft Office 2003 Intermediate Word Handout INFORMATION TECHNOLOGY SERVICES California State University, Los Angeles Version 1.

FILE TYPES & SIZES BOOK COVER

Accessible and Usable PDF Documents: Techniques for Document Authors Fourth Edition

Content Enrichment. An essential strategic capability for every publisher. Enriched content. Delivered.

EPiServer s Compliance to WCAG and ATAG

University of Sunderland. Microsoft Word 2007

Creating an Accessible Word Document. PC Computer. Revised November 27, Adapted from resources created by the Sonoma County Office of Education

Layout and Content Extraction for PDF Documents

Personal Computing EN1301

This is a paragraph. It's quite short.

Word Training - Maintaining Consistency Supporting Handout Designing Styles within a Word Template Version: Mac

Quark XML Author October 2017 Update with Business Documents

Comp 336/436 - Markup Languages. Fall Semester Week 2. Dr Nick Hayward

TECH 3381 CARRIAGE OF EBU-TT-D IN ISOBMFF VERSION: 1.0 SOURCE: SP/MIM XML SUBTITLES

Word Training - Maintaining Consistency Supporting Handout Designing Styles within a Word Template Version: Windows

UNSW Global Website Branding Guidelines. Website Brand Guidelines

MICROSOFT ACADEMY WORD STUDY GUIDE FOR CERTIFICATION EXAM

The major change in Word is the ribbon toolbar. The File menu has been replaced with a button.

PDF TO HTML CONVERSION Progress Report

IMPLICIT RELIGION Guidelines for Contributors March 2007

How to use styles, lists, columns and table of contents

Productivity Tools Objectives 1

Teach Yourself Microsoft Word Topic 12 - Multipage Document Features Part 1

Certified HTML5 Developer VS-1029

ADVANCED WORD PROCESSING

Introduction to Infographics and Accessibility

Accessibility Guidelines

The Processing of Form Documents

MICROSOFT WORD. Table of Contents. What is MSWord? Features LINC FIVE

This document describes the features supported by the new PDF emitter in BIRT 2.0.

Moving from FrameMaker to Blaze or Flare: Best Practices

Additional Support and Disability Advice Centre

Audience: - Executives and managers who have already been using MS Office want to migrate to Libre Office suit.

DRAFT Section 508 Basic Testing Guide PDF (Portable Document Format) Version 0.1 September 2015

Enhancing Document Structure Analysis using Visual Analytics

MDSRC , November, 2017 Wah/Pakistan. Template for Abstract Submission

Creating Universally Designed Word 2013 Documents - Quick Start Guide

Style guide for Department for Education research reports and briefs

Microsoft Word 2013 Lab. Course Outline. Microsoft Word 2013 Lab. 20 Apr

COMP519 Web Programming Lecture 8: Cascading Style Sheets: Part 4 Handouts

Modeling Systems Using Design Patterns

Word 2010 Skills Checklist

Easing into DITA Publishing with TopLeaf

Natural Language to Database Interface

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

INFS 2150 / 7150 Intro to Web Development / HTML Programming

WJEC Unit IT2, Task 1: Desktop Publishing Self-assessment Review

Required Fixes for Office Files

Latex Manually Set Font Size For Tables

Sample Report Failures by group

TestOut Desktop Pro Plus - English 4.x.x. MOS Instructor Guide. Revised

COURSE DESIGN RUBRIC

Word for Job Seekers. Things to consider while making your resume

This module sets out essential concepts and skills relating to demonstrating competence in using presentation software.

Structure in On-line Documents

California Open Online Library for Education & Accessibility

Web Content Accessibility Guidelines 2.0 Checklist

Extracting Algorithms by Indexing and Mining Large Data Sets

Making Your Word Documents Accessible

WHAT IS BFA NEW MEDIA?

Word Long Docs Quick Reference (Windows PC)

How to use character and paragraph styles

The same can also be achieved by clicking on Format Character and then selecting an option from the Typeface list box.

Text Topics. Human reading process Using Text in Interaction Design

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Quick Access Toolbar: Used for frequent commands and is customizable.

Transcription:

AIDAS: Incremental Logical Structure Discovery in PDF Documents Anjo Anjewierden Social Science Informatics, University of Amsterdam Roetersstraat 15, 1018 WB Amsterdam, The Netherlands anjo@swi.psy.uva.nl Abstract We describe the approach AIDAS uses to extract the logical document structure from PDF documents. The approach is based on the idea that the layout structure contains cues about the logical structure and that the logical structure can be discovered incrementally. 1. Introduction AIDAS is part of a research project in which the aim is to turn technical manuals into a database of indexed training material. The role AIDAS plays in this project is to take a PDF file, extract the logical structure and assign indexes to each element in this logical structure. The indexes can either be about the content of the element (e.g. this section is about the rear part of a car, this is a schematic drawing of the control unit ) or about how the element can be used in instruction (e.g. this section provides a brief overview of the components of a car ). An instructor can retrieve appropriate training material by querying the database using the indexes. The advantages of an automated approach to indexing training material and making it available electronically may be apparent. Whereas until recently the industrial partners in the project literally used scissors and glue to produce the material, they can now use the indexed database and cutand-paste to presentation programs. In order for AIDAS to be able to do its work we distinguish the following stages: Interpreting PDF. The starting point for AIDAS is a document in PDF [1]. PDF (Portable Document Format) is a page description language that is powerful in terms of the rendering capabilities it provides and widely used for document exchange. Section 2 describes how AIDAS extracts the layout structure from PDF. Discovering the logical document structure. The fragments that AIDAS needs to store correspond to the logical document structure (sections, tables, images, items, etc.). During this stage the layout structure is incrementally analysed in order to discover the logical structure (see Sections 3 4). Indexing and fragmenting the logical structure. The logical structure produced by the previous stage is annotated using a domain ontology and reasoning about the logical structure [4]. During this stage AIDAS looks at the content of the logical structure, for example comparing the title of a section to the list of concepts in the ontology. If a match between a title and a concept occurs, the section is indexed as a fragment with the name of the concept as the topic. This stage is not further discussed in this paper. Storing the document in a multi-media database. This takes the annotated fragments and converts them to a database. Text fragments are stored as ASCII and graphical fragments are stored as SVG [8]. 2. PDF vs. scanned images PDF represents a document as a set of instructions. When these instructions are processed by the PDF image model a bitmap is generated which can be rendered on a graphics device. The instructions in PDF fall into four categories: (1) control instructions which work on the image model and produce no output; (2) text instructions render glyphs; (3) graphics instructions render lines, curves and rectangles etc.; and (4) image instructions render bitmapped images. Many document analysis systems take a scanned image as the source and extract text, graphics and images from it. Such systems could also deal with PDF by including a pre-processing step that renders PDF in a bitmap. We have chosen for a design in which the stream of PDF instructions 1

is converted to a set of text, graphics and image objects. This conversion is implemented on top of xpdf [5]. The layout structure used by AIDAS consists of the set of objects generated by the conversion. There are about ten different classes of layout objects (text, line, rectangle, image, curve, etc.). And each of these classes has about ten features (position, font face, font size, color, line width etc.). A complication of PDF that does not occur with scanned images is that there are an infinite number of different PDF documents that render precisely the same image. For example, a word can be a single text object or there can be separate text objects for each character in the word. However, the layout structure generated from PDF is more reliable than extracting the layout structure from scanned images. There are complications, however, for example the issue of visibility. A PDF document could contain an instruction to render some text followed by an instruction to render a filled rectangle on top of it. 3. Logical structure discovery The logical structure (sections, item lists, tables, etc.) is not explicitly available in the layout structure and needs to be discovered. Various approaches to logical structure discovery have been suggested. If the document style (e.g. font face and size of section titles, line spacing, number of columns) is available then the use of top-down grammars that incorporate the document style is possible, but even then extensive error recovery is necessary to cater for idiosyncrasies [3]. An alternative approach is to first discover a hierarchy of layout objects (glyph, word, line, text block, column) and then map this hierarchy on the logical document structure hierarchy [9]. This approach will work well if the two hierarchies can be easily mapped on each other. The approach in AIDAS is based on the idea that layout objects represent only their layout features explicitly, but that these features contain cues about the role in the logical structure [6], [7]. For example, a text object in a large bold font (the form) contains the cue that it could be a section title (the function), and a text object containing the literal * (form) could be a bullet (function). The distinction between form and function is also seen at the logical structure. For example, a paragraph (form) can be the body of a bullet or an item (function) or it can, along with other paragraphs, be the body of a section. An extreme case is that the paragraph may be the title of a section in the case of a title spanning multiple lines. AIDAS uses this idea by assigning a set of possible functions to each layout object and incrementally chunking them to more complex objects. This process is performed incrementally until the logical structure is produced. Determining the possible functions of a layout object can be done in a bottom-up fashion, whereas the function is determined using top-down techniques (grammars). 4. Implementation The first step performed by AIDAS is to identify the overall layout of a document. This step identifies the columns, headers and footers, and the dominant font. All of these are determined by statistical analysis and can be overridden by the user. This process is easier in PDF compared to scanned images. Thereafter AIDAS breaks up each page in segments. Drawings are recognised by many graphical elements versus few text elements and tables are recognised by a lot of text classified as floating. All remaining segments are text segments. 4.1. Classifying individual layout objects The next step is to look at individual layout objects generated by the PDF interpreter. During this step AIDAS classifies each layout object on several dimensions: geometry, markup and (textual) content. For example, given the following layout object: <layout x=25 y=200 w=180 h=14 1.1 Introduction the classification results in: <layout x=25 y=200 w=20 h=14 <geometry alignment=left/> <markup size=larger emphasis=bold/> <function sectionnum="1.1"/> 1.1 <layout x=50 y=200 w=155 h=14 <geometry indentation=25/> <markup size=larger emphasis=bold/> Introduction The geometry element has been abstracted from the column boundaries, the markup element from the dominant font and the function element has been determined by matching the text against the patterns for section numbers.

This form abstraction can proceed without regard of the surrounding text because no information is lost. Should it later be discovered that the 1.1 object is not a section number then the abstractions are ignored and, perhaps, the two text objects are concatenated again (for example as part of the phrase see also Section 1.1 Introduction ). Other examples of abstractions are bullet characters ( *, -, etc.), enumeration indicators ( (a), 1), etc.) and, specifically for technical manuals, references to figures ( General Theory of Operation (Fig. 14-15A). 4.2. Shallow grammars to detect logical elements The logical structure itself is discovered by taking the set of layout objects, sorting them on y/x position and then running a set of shallow grammars on them. Each of these grammars detects a specific element of the logical structure and leaves the objects that do not match untouched. This architecture has several pleasant characteristics: (a) the grammars are shallow and complexity is roughly linear; (b) new grammars that discover other logical structure elements can easily be added; and (c) no error recovery is necessary as the grammars ignore unrecognised input. A simplified version of the grammar for detecting section titles is given below: sectiontitle(sectiontitle) --> section_number(num), section_name(name), SectionTitle is sectiontitle(sectionnumber(num), sectionname(name)). sectionnumber(num) --> Num is next_node, match(num, function, sectionnum=n), match(num, geometry, alignment=left), match(num, markup, emphasis=bold), match(num, markup, size=larger). sectionname(name) --> Name is next_node, match(name, geometry, indentation=n), match(name, markup, emphasis=bold), match(name, markup, size=larger). A match results in the replacement of layout objects by a structural object that represents a section title: <sectiontitle> <sectionnumber> <layout x=25 y=200 w=20 h=14.../> <geometry alignment=left/> <markup size=larger... /> <function sectionnum="1.1"/> 1.1 </sectionnumber> <sectionname> <layout x=50 y=200 w=155 h=14.../> <geometry indentation=25/> <markup size=larger.../> Introduction </sectionname> </sectiontitle> The incremental nature of the process can be illustrated further by considering how the next level in the logical structure, sections, is discovered: section(section) --> section_title(title), section_body(body), Section is section(title, sectionbody(body)). section_title(title) --> Title is next_node, element(title, sectiontitle). section_body(body) --> /* Collect everything until a sectiontitle element is seen. */ In the current implementation grammars for the following logical elements are used: section titles, sections, bullets, bullet lists, items, item lists, headings, paragraphs, tables and figures. The final step is to remove all abstractions from the objects and output the logical structure itself. For the above example this results in: <section level=2 id="1.1"> <sectiontitle>introduction</sectiontitle> <sectionbody>... </sectionbody> </section> This logical structure is then processed by the indexing and fragmentation module of AIDAS [2]. 5. Experimental results At the time of writing AIDAS is being used by the industrial partners to fragment technical manuals in three domains: car repair; military equipment; and electronic control systems for traffic lights.

The car repair manuals have a very specific logical structure. For each topic (e.g. body work) there is a separate section with short descriptions and lots of figures of the car and how to use tools for repair. A domain specific shallow grammar has been added to recognise this structure and it performs very well according to the users. The manuals for the military equipment (radar-based interception) consist of a mix of theory and technical drawings. The theoretical material is in four columns (A3 pages) and the drawings are in separate books with captions. One of the problems here was to create a link between the books containing the theoretical sections and the books containing the technical drawings. We have not received a formal evaluation for this material, but experiments on parts show that extracting the logical structure from the text is more than acceptable. One of the virtues of the incremental approach used in AIDAS is that it is robust with respect to unexpected input. Although it will not recognise the unexpected input (and therefore leave it as layout objects), domain specific shallow grammars can be added to recognise additional logical structures if necessary. This is very important for technical manuals, as there is no such thing as a logical document structure for technical manuals as is illustrated by the military and car domains. Technical manuals take into account the skills of the readers: in the military domain theory is supplemented by figures, whereas in the car repair domain the emphasis is on manual skills which is mainly supported by figures. The figure on the last page shows a screendump of the tool after analysing a page from the military domain. The rectangles represent segments. To the left of each recognised logical structure element is a the name of the element (e.g. sectiontitle, item, paragraph, heading). Colour is used to show to the user what part of the text is recognised. on Artificial Intelligence (BNAIC), pages 23 30, Amsterdam, October 2001. [3] T. Hu and R. Ingold. A mixed approach toward an efficient logical structure recognition from document images. Electronic Publishing, 6(4), December 1993. [4] S. Kabel, B. Wielinga, and R. de Hoog. Ontologies for indexing technical manuals for instruction. In Workshop on Ontologies for Intelligent Educational Systems, Le Mans, July 1999. [5] D. Noonburg. xpdf: A C++ library for accessing PDF. www.foolabs.com/xpdf. [6] K. Summers. Toward a taxonomy of logical document structures. In Electronic Publishing and the Information Superhighway: Proceedings of the Dartmouth Institute for Advanced Graduate Studies (DAGS), pages 124 133, 1995. [7] K. Summers. Automatic discovery of logical document structure. PhD thesis, Cornell University, August 1998. [8] W3C. SVG (Scalable Vector Graphics). www.w3c.org. [9] Y. Wang, I. Phillips, and R. Haralick. From image to SGML/XML representation: one method. In International workshop on Document Layout Interpretation and its Applications (DLIA), Bangalore, India, September 1999. 6. Acknowledgements The work presented in this paper has been supported by the European Commission as part of the IMAT (Integrating Materials and Training) project. We gratefully acknowledge comments on previous versions of this paper by Jan Wielemaker and Robert de Hoog. We also would like to thank the three anonymous reviewers whose comments we have tried to take into account. References [1] Adobe Systems Incorporated. PDF Reference version 1.3. Addison Wesley, Boston, second edition, July 2000. [2] A. Anjewierden and S. Kabel. Automatic indexing of documents with ontologies. In 13th Belgian/Dutch Conference

Figure 1. AIDAS analysing a technical manual