PDF Essentials. The Structure of PDF Documents

Similar documents
PDF PDF PDF PDF PDF internals PDF PDF

A Short Introduction to PDF

PDF and Accessibility

PDF in Smalltalk. Chris1an Haider

Introducing PDF/UA. The new International Standard for Accessible PDF Technology. Solving PDF Accessibility Problems

Vulnerability Report

DOWNLOAD OR READ : WHERE TO GET FREE ENGINE DIAGRAMS 1995 SPORT JET MERCURY PDF EBOOK EPUB MOBI

Portable Document Format Reference Manual Version 1.2

Instructions for Using PDF Tests and Journals

DEFINE FILE FORMAT SPECIFICATIONS. CSV /TSV Specifications

DATA HIDING IN PDF FILES AND APPLICATIONS BY IMPERCEIVABLE MODIFICATIONS OF PDF OBJECT PARAMETERS

How to set up a local root folder and site structure


HTMLDOC On Line Help

DOWNLOAD PDF WHAT IS OPEN EBOOK

DOWNLOAD OR READ : TOOLS FOR TEXT AND IMAGE ANALYSIS AN INTRODUCTION TO APPLIED SEMIOTICS PDF EBOOK EPUB MOBI

Webbed Documents 1- Malcolm Graham and Andrew Surray. Abstract. The Problem and What We ve Already Tried

Converting HTML to PDF author debbiet

Modern Requirements4TFS 2018 Release Notes

Portable Document Format (PDF): Security Analysis and Malware Threats

Reader. PDF Writer. PostScript File. Distiller. Control whether or not the Save File As dialog box is displayed.

Q U A L I T Y PR I NT I NG

UNIT - V DHARANI KUMAR.S/AP/MECH

Adobe Acrobat Reader 4.05

CREATING ACCESSIBLE SPREADSHEETS IN MICROSOFT EXCEL 2010/13 (WINDOWS) & 2011 (MAC)

SPECS. Innova Ideas & Services 304 Main Street, Ames, Iowa AD SIZES: TRIM SIZE OF PUBLICATION IS 8.5" X 11" FILE REQUIREMENTS:

DOWNLOAD OR READ : THE LATEST WORD OF UNIVERSALISM PDF EBOOK EPUB MOBI

BOXOFT Image to PDF s allow you scans paper documents and automatically s them as PDF attachments using your existing software

Using Label Merge Fields

User s Guide to Creating PDFs for the Sony Reader

Numerical Analysis Timothy Sauer Second Edition

Malicious Document Analysis Beginners Guide.

An Introduction To GCC: For The GNU Compilers GCC And G++ By Richard M. Stallman, Brian J. Gough

Supporting Level 2 Functionality

Adobe EXAM - 9A Adobe InDesign CS5 ACE Exam. Buy Full Product.

DRF Programs for Authors

HOW TO UTILIZE MICROSOFT WORD TO CREATE A CLICKABLE ADOBE PORTABLE DOCUMENT FORMAT (PDF)

AutoPagex Plug-in User s Manual

InDesign ACA Certification Test 50 terms hollymsmith TEACHER

This document describes the features supported by the new PDF emitter in BIRT 2.0.

Microsoft Office Publisher

BEST FILE FORMAT FOR HIGH RESOLUTION

k-depth Mimicry Attack to Secretly Embed Shellcode into PDF Files

Creating an Accessible Word Document. PC Computer. Revised November 27, Adapted from resources created by the Sonoma County Office of Education

FineReader Engine Overview & New Features in V10

LANGSEC. Arthur, Jan, Marco. Berlin, 14. Oktober Humboldt-Universität zu Berlin Institut für Informatik Lehrstuhl Praktische Informatik

DOWNLOAD OR READ : THE BEST OF IT PDF EBOOK EPUB MOBI

STD 7 th Paper 1 FA 4

FDA Portable Document Format (PDF) Specifications

Lecture 19 Media Formats

Dvipdfm User s Manual

Document management applications Raster Image Transport and Storage Use of ISO (PDF/raster) Version 1.0


What will I learn today?

Approach to Archive Programs in PDF and Retrieve Programs from PDF Files

Introduction to Network Operating Systems

III-6Exporting Graphics (Windows)

DOWNLOAD OR READ : WHAT TO DO WHEN YOU WANT TO BORROW SMALL BUSINESS MANAGEMENT GUIDES BOOK 1 PDF EBOOK EPUB MOBI

Introduction. Secondary Storage. File concept. File attributes

Chapter 2 Creating and Editing a Web Page

ADDITIONAL RECOMMENDED STANDARDS FOR DRS DEPOSIT:

Creating an Accessible Word Document. Mac Computer. Revised November 28, Adapted from resources created by the Sonoma County Office of Education

Universal Design for Learning Tips

Programs We Support. We accept files created in these major design and layout programs. Please contact us if you do not see your program listed below.

HARD COPIES TO BE SENT WITH THE OTHER END-OF-SEMESTER MATERIALS (see Semester Forms Checklist for list of all materials required):

Adobe Acrobat Reader Help

Creating PDF Files: QuarkXPress 7 thru 10 Mac

User Services. WebCT Integrating Images OBJECTIVES

Accessible Formatting for MS Word

Using 3D PDF with MIL-STD-31000A BEST PRACTICES 3D PDF CONSORTIUM DRAFT VERSION 3 09/25/17

compart PDF/A-Support in Compart Products PDF/A White Paper White Paper May 2006

D CLIENT for DIRECTOR/DIRECTOR PRO Series Publishing System Operator s Guide

Everything You Wanted to Know About Adobe Acrobat Annotation Handlers But Were Afraid to Ask

ABBYY FineReader 14. User s Guide ABBYY Production LLC. All rights reserved.

What you will learn 2. Converting to PDF Format 15 Converting to PS Format 16 Converting to HTML format 17 Saving and Updating documents 19

IMPORTING FILES FREE viviso.com IMPORTING FILES FREE. page 1 / 5

SQL Parsers with Message Analyzer. Eric Bortei-Doku

Creating Interactive PDF Forms

DOWNLOAD OR READ : WORD 10 FOR MAC OS X VISUAL QUICKSTART GUIDES PDF EBOOK EPUB MOBI

Creating PDF documents for User Manual

for Beginners COPYRIGHT NATIONAL SEMINARS TRAINING. ALL RIGHTS RESERVED.

REIHE INFORMATIK 5/2001 RTP/I Payload Type Definition for Shared Whiteboards Jürgen Vogel Universität Mannheim Praktische Informatik IV L15, 16

AFP Container Architecture

ICT IGCSE Practical Revision Presentation Web Authoring

Calc Guide. Chapter 6 Printing, Exporting and ing

qstart_guide.book Page 1 Tuesday, June 20, :52 AM Quick-Start Guide

Making Your PowerPoint Presentations Accessible

Preview tab. The Preview tab is the default tab displayed when the pdffactory dialog box first appears. From here, you can:

Data Compression. An overview of Compression. Multimedia Systems and Applications. Binary Image Compression. Binary Image Compression

Amyuni PDF Converter. User s Manual Updated September 26, 2005

MEDIA RELATED FILE TYPES

Text Style Italic ROTATE RGB RED. ROTATE CMYK Yellow. ROTATE CMYK Cyan ROTATE RGB BLUE

How to use the Acrobat interface and basic navigation

Numbers Basics Website:

Making Your Word Documents Accessible

Media Types. Web Architecture and Information Management [./] Spring 2009 INFO (CCN 42509) Contents. Erik Wilde, UC Berkeley School of

Blind digital watermarking in PDF documents using Spread Transform Dither Modulation.

PDF Expert for ipad User guide

The ABC of PDF with itext

Scan to PC Desktop Professional v9 vs. Scan to PC Desktop SE v9 + SE

Transcription:

Dr. Edgar Huckert 63773 Goldbach, Germany E-Mail: Huckert@compuserve.com 6-97 PDF Essentials PDF is an acronym for "Portable Document Format". The format has been designed by Adobe for the exchange of documents on various platforms. Currently the format is used under Windows, MAC and various UNIX platforms. This is essentially a final format - in contrast to SGML based formats which are aimed to be revisable. "Final" means: it is not primarily aimed at being edited - although this is possible. The most visible difference compared to other popular formats is the complete absence of information about the logical structure of a document: there are no information in PDF about chapters, headers, footers, tables of content etc. The most important structural concept is the page: PDF is like its close parent Postscript a page oriented format. PDF started in version 1.0 as a 7-bit encoded format. In later versions the size aspect became more important so that the 7-bit restriction was dropped. LZW compression for the text portions and JPEG or CCITT compressions without additional 7-bit encoding à la uuencode are now common in PDF documents. The other major addition to the set of PDF features was the enhancement of the link concept. You can now define links within a document, links to other documents and links to external applications. The link concept appears in PDF as a generalized annotation concept. The PDF specification has been published by Adobe. There is no major risk that PDF documents can't be understood in the future as the internal structure of PDF is known. This is its major advantage over formats like DOC used by Microsoft WORD whose internals are not known. The first Adobe PDF spec (version 1.0) was published as the normal book. The last PDF specification I have is the version 1.1. At the time of this writing a newer version 1.2 should be available. The specs can be found on the Adobe ftp home. They can also be downloaded from the Adobe web home. The Structure of PDF Documents PDF documents consist mainly of four parts: a very short header the document body the cross reference

the trailer The header identifies a document as a PDF document via a string "%PDF-1.1". This is like a magic number in UNIX files. The document body contains the information about the objects on the pages. This is a collection of objects organized as a tree describing the page structure, the pages and the elements (text, graphics etc.) on the pages. The PDF objects are the leafs of the PDF tree. Each object has three essential components: it has a number, it has a fixed position in the PDF file (an offset) and it has a content. The cross reference allows a PDF parser (like the Acrobat reader) to quickly access the objects. It starts with the keyword xref followed by the number of objects. An entry in the cross reference is mainly an offset in the document file and a flag indicating whether this object is free (can be reused) or occupied. For advanced applications multiple cross references can be kept in a PDF file thus allowing to maintain a chain of versions for a document. Note that the cross reference doesn t allow much syntactic freedom: it must be written and read like a fixed header in formats like PCX. This is the start of a typical cross reference: xref 0 20 0000000000 65535 f 0000000009 00000 n 0000000116 00000 n... The trailer is a quick entry into the most essential document parts: it contains a pointer to the start of the cross reference and the object numbers for the most essential objects: the root object (this is the start of the tree of pages) and the info object containing some important meta information. This is a sample trailer: trailer /Size 20 /Root 19 0 R /Info 18 0 R startxref 2354 %%EOF Whereas the PDF document body must be parsed by a program (the token must be identified and interpreted) the header, cross reference and the trailer must be handled like data structures. The body of a PDF document The body of a PDF document is essentially a tree of objects starting with the root object mentioned before. The root object has a reference to the /Pages object; this is the start of the page

tree. Remember that PDF is a page oriented format - you will find no logical layout information in the documents. If you look into the /Pages object you will see an array of type /Kids : this array contains the object numbers of the /Page objects, i.e. of the document pages. Each page starts with a /Page object. Such a /Page object has a set of page-wide attributes and a set of content attributes referenced by the /Contents array. Here is a typical /Page object: 17 0 obj /Type /Page /Parent 11 0 R /MediaBox [0 0 600 810] /Resources /Font /F1 1 0 R /F2 2 0 R /ProcSet 10 0 R /Contents [16 0 R 15 0 R ] The array following the /MediaBox keyword defines the size of the page. The /Resources dictionary (dictionaries appear between and ) names to fonts to be used on the page. If you follow the reference to the /ProcSet object you will see whether this page uses graphics oder video or audio components. Note that the page object also contains a reference to the parent object, i.e. to the /Pages object. Let s now look at the operators used to draw text and simple line graphics. The operators used in PDF have much in common with the operators used in Postscript. The following extract from a PDF file is an object that draws an header on the page. It displays a text line followed by a horizontal line used as a separator: 6 0 obj /Length 88 /F1 11 Tf 1 0 0 1 25 798 Tm (kopfzeile - 1 -) Tj 2 w 12 792 m 550 792 l S The important operators are her: Tj shows the text (enclosed in ( and ) Tm defines the text matrix (the coordinates) w defines the line width m moves the cursor

A PDF sample document Note: if you plan to write a simple PDF document with your favorite editor be careful with the offsets in the cross reference. You must know whether you favorite operating system expands linefeeds to CR + LF or not! If ever you transfer a PDF document from one operating system to an other operating system use the BINARY option with ftp. This sample PDF document contains two pages. Each page only shows a single line of text. A header line appears at the top of each page. If you try to understand the structure of the PDF file remember that PDF documents form a tree and that you should start from the end of the file: you have to locate the root object (here object 12) and then the /Pages object (here object 4). From there on you find the two /Page objects mentioned as /Kids: the objects 7 and 10. The rest should be easy. %PDF-1.1 1 0 obj /Type /Font /Subtype /Type1 /Name /F1 /BaseFont /Helvetica /Encoding /WinAnsiEncoding 2 0 obj /Type /Font /Subtype /Type1 /Name /F2 /BaseFont /Helvetica-Bold /Encoding /WinAnsiEncoding 3 0 obj [/PDF /Text /ImageB] 5 0 obj /Length 81 /F1 13 Tf 1 0 0 1 25 762 Tm (Das ist nur ein Beispielstext auf Seite 1) Tj 6 0 obj /Length 88 /F1 11 Tf 1 0 0 1 25 798 Tm (kopfzeile - 1 -) Tj

2 w 12 792 m 550 792 l S 7 0 obj /Type /Page /Parent 4 0 R /MediaBox [0 0 600 810] /Rotate 0 /Resources /Font /F1 1 0 R /F2 2 0 R /ProcSet 3 0 R /Contents [6 0 R 5 0 R ] 8 0 obj /Length 90 /F1 13 Tf 1 0 0 1 25 774 Tm () Tj /F1 13 Tf 1 0 0 1 25 762 Tm (Text auf Seite 2) Tj 9 0 obj /Length 88 /F1 11 Tf 1 0 0 1 25 798 Tm (kopfzeile - 2 -) Tj 2 w 12 792 m 550 792 l S 10 0 obj /Type /Page /Parent 4 0 R /MediaBox [0 0 600 810] /Rotate 0 /Resources /Font /F1 1 0 R /F2 2 0 R /ProcSet 3 0 R /Contents [9 0 R 8 0 R ] 11 0 obj /Subject (kopfzeile)

/Title (mktxtpdf demo) /Creator (Program mktxtpdf) /Author (Dr.E.Huckert, Goldbach (Germany)) /CreationDate (21.06.1997) 4 0 obj /Type /Pages /Count 2 /Kids [7 0 R 10 0 R ] 12 0 obj /Type /Catalog /Pages 4 0 R xref 0 13 0000000000 65535 f 0000000009 00000 n 0000000116 00000 n 0000000228 00000 n 0000001314 00000 n 0000000264 00000 n 0000000394 00000 n 0000000531 00000 n 0000000702 00000 n 0000000841 00000 n 0000000978 00000 n 0000001150 00000 n 0000001379 00000 n trailer /Size 13 /Root 12 0 R /Info 11 0 R startxref 1429 %%EOF