X X
The New Document Digital Polymorphic Ubiquitous Actionable Patrick P. Bergmans University of Ghent
The Traditional Document Documents have been around for thousands of years The Bible is a document The scrolls of the Dead Sea are documents Hieroglyphs of the Ancient Egypt are documents Documents have been and continue to be the support of a large fraction of human knowledge Documents are stored on a specific medium For centuries, the traditional medium for documents has been paper Recently, the storage medium for documents has become digital
The Traditional Document Paper documents were a fairly simple concept Digital documents are much more complex, because of their numerous additional attributes The Digital Document is polymorphic; it has many, many different embodiments and representations Computer Scientists have introduced formal Document Models These models are used to Analyze document transformations and evolutions Identify resources needed for those transformations Define Document Processes that govern these transformations
Dimensions of Document Space In these models, documents are contained in a multidimensional document space (content, structure, format, time, spatial, others ), identifying their specific properties along the axes of the space Documents transformations are trajectories in document space, describing the life of a document and its evolutions The multidimensional document space can be simplified by projecting it onto sub-spaces The initial (content, structure, format) model considers the subspace of documents independently of time and space
Expected Model Benefits Precise definitions (giving common terminology) Definition of (generic) operators for document transformation Copy, Move, Erase, Print,... Explicitly show where conceptual difficulties lie, giving some ideas of their fundamental nature Enable reasoning on document transitions (e.g. versioning and properties inheritances, document rights)
Content-Structure-Format Model (three-dimensional projection)
Content-Structure-Format model This sub-space of the full document model can be used to illustrate how knowledge, meaning and content are derived and transformed during the structuring, formatting and physical output of a document Vertical axis is some sort of overall evolutionary axis, but not exactly a time axis Local Transformations are Content transformations at any time Structure transformations in the structured document plane Format transformations in the styled document plane
Knowledge Intent Meaning Form Logical premises Language, Pictorial, Musical, Gesture Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 Layer & Property Examples Basic Content Logical Structure Text, Artwork DTD Structure/Format planes Transition Properties Structured Content Presentation Format SGML, XML, (HTML) Style sheet, XSL Styled Content Resources DOC, WPF, RTF, (HTML) Fonts Digital Documents Output Representation Media Properties PDF, PS, PCL, MIDI Page size, Screen Resolution Raw Digital Image Device Properties TIFF, GIF, BMP, WAV Screens, CD, Audio cassette, VHS, Minidisk, DVD Physical Representation Paper, Sound, Video, Voice
Digital Documents There are many forms of Digital Documents It is extremely important to distinguish them In function of expectations of usage In terms of storage, editability etc. Issue: coexistence at several levels of representation Logical and physical Logical concepts: chapter, paragraph, sentence, word space Physical concepts: page, column, line of text
The Four Types of Digital Documents Structured Styled PDL Bitmap The Paper Document The Digital Document
Digital - Bitmap Document stored as an array of pixels Is really a digital picture of the document Simple 1-to-1 representation of the physical Document Examples: TIFF, GIF, BMP, PNG, JPG Large storage volume Little processing for imaging Essentially not editable (except with image processing tools); no text reflow
Digital - Page Description Contains objects, such as characters (glyphs), graphics, images, and a description of where (and sometimes how) they appear on the page Examples: PostScript, PDF, PCL; but PostScript is a programming language PDF is a non-procedural data representation system Reasonably compact storage Processing required for imaging ( RIP ) Device independent Marginally editable (moving objects), but no text reflow
Digital - Styled Document Document contains styled and sequenced graphic elements, and a limited amount of structure Example : RTF, DOC (MS Word Document), WPF Reasonably compact storage Requires processing for output (driver) Completely editable, and text may be reflowed But not structure-driven editing
Digital - Structured Document Document is highly structured, and structure-controlled Examples: SGML, XML HTML is hybrid (many properties of ML, some of RTF) Powerful concept of document type definition (DTD) High structure-controlled editability Text is contained in unprocessed elements; text reflow is possible, because of its very representation Requires often complex editing tools Often used in technical documents
Select Knowledge Meaning Intent Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 Logical premises Transform Examples Express Form Language, Pictorial, Musical, Gesture Basic Content Text, Artwork Organize Logical Structure DTD Structured Content SGML, XML, (HTML) Framemaker XML parser Style Presentation Format Style sheet, XSL Styled Content DOC, WPF, RTF Microsoft Word, Quark Xpress Postscript Driver Compose Resources Fonts Output Representation PDF, PS, PCL, MIDI Adobe Illustrator RIP, Speech & Sound Synthesizer Render Media Properties Page size, Screen Resolution Raw Digital Image TIFF, GIF, BMP, WAV Adobe Photoshop Marking engine, CRT, LCD, AV System Playback Device Properties Screens, CD, Audio cassette, VHS, DVD Physical Representation Paper, Sound, Video
Starting from Paper What if the original Document is paper? Scan to Digital Document What level do we scan to? Digital-to-paper is many-to-one Green button operation Paper-to-digital is one-to-many Level depends on purpose For storage, bitmap level might be sufficient For edits, at least styled content level
Intent Knowledge Meaning Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 Learn Upward Transforms Form Understand Basic Content Logical Structure Fragment Structured Content Presentation Format Re- Structure Styled Content Resources Recognize Output Representation Media Properties Segment Raw Digital Image Device Properties Capture Physical Representation
Knowledge Intent Meaning Product Specification Customer Documentation Re-Targeting Form Basic Content English German French Translation Logical Structure Structured Content SGML HTML XML Structure Edits, Conversions Presentation Format Styled Content WPF DOC RTF Contents Edits, Conversions Resources Output Representation Media Properties Raw Digital Image TIFF GIF BMP Processing, Format Conversion Device Properties Physical Representation
Examples of Applications of the Model
Knowledge Intent Meaning Form Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 Knowledge Intent Meaning Form Analog Copier Basic Content Logical Structure Basic Content Logical Structure Structured Content Presentation Format Structured Content Presentation Format Styled Content Resources Styled Content Resources Output Representation Media Properties Output Representation Media Properties Raw Digital Image Device Properties Raw Digital Image Device Properties Physical Representation Physical Representation
Knowledge Intent Meaning Form Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 Knowledge Intent Meaning Form Digital Copier Basic Content Logical Structure Basic Content Logical Structure Structured Content Presentation Format Structured Content Presentation Format Styled Content Resources Styled Content Resources Output Representation Media Properties Output Representation Media Properties Raw Digital Image Bitmap Bitmap Raw Digital Image Device Properties Physical Representation Image Processing Device Properties Physical Representation
Knowledge Intent Meaning Form Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 Knowledge Intent Meaning Form Multi- Function Devices Basic Content Logical Structure Basic Content Logical Structure Structured Content Presentation Format Structured Content Presentation Format Styled Content Resources DOC TextBridge Styled Content Resources Output Representation Media Properties PDL OCR Output Representation Media Properties PDL Ripping Raw Digital Image Bitmap Bitmap Raw Digital Image Device Properties Physical Representation Image Processing Device Properties Physical Representation
Select Knowledge Intent Meaning Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 Learn Translating Copier Express Organize Form Basic Content Logical Structure Understand Fragment Style Structured Content Presentation Format Styled Content Re- Structure Compose Resources Recognize Render Output Representation Media Properties Segment Playback Raw Digital Image Device Properties Capture Physical Representation
Knowledge Intent Meaning Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 The function of the RIP Form Basic Content Logical Structure Structured Content The Digital World Presentation Format Styled Content The Digital World Resources One-to-one PDL ( PS, PDF, etc) Media Properties Raw Digital Image The Digital World The Digital World RIP Device Properties Physical Representation The Paper World
Intent Knowledge Learn Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008 Steps to improve OCR Meaning Meaning Form Logical Structure Basic Content Styled Content Fragment Structured Content Presentation Format Resources Re- Structure Recognize Output Representation Media Properties Device Properties Segment Raw Digital Image Capture Physical Representation Semantic Structure Syntactic Structure Form Parts of speech Language Content type Basic Content Pragmatic analysis Semantic Content Semantic analysis Content Dependencies Syntactic Parsing Tagged Content Morpho - analysis English Text Recognize Language Text content Recognise Basic Form Pragmatic knowledge Language Semantics Language syntax Understand Morphosyntactics Trigram model External Information
The Networked Document (1) Structured Styled PDL Bitmap Structured Styled The Networked Document PDL Bitmap Distributed and Hyperlinked Documents Documents with Network Intelligence Documents with Workflow Intelligence Mobile Documents The Paper Document The Digital Paper Document The Digital Document
The Networked Document (2) Parts of Documents are stored in different locations on the network For example, images on an image server Or a large number of logically linked servers Documents are dynamically assembled When viewing When printing Requires networks with High performance High availability Technology for dynamic document assembly is Hyperlinking
The Networked Document (3) Hypertext was a major fundamental advance in Document Storage Architecture Documents are linked to integrate external objects Powerfully implemented in HTML and XML HTML is vulnerable, unfit for a robust corporate Document Management System XML is much better for linking purposes (through XLL) Network-based storage of corporate documents requires a Document Storage Architecture with robust links and strong link management Making a difference between Intranet & Internet
The Networked Document (4) Bi-directional linking, link registry, link ownership and link lifetime management are key A B2 Intranet with full object control B1 C3 C1 C2 X1 X2 X3 C2 knows is is used by B1 and C3 X2, Z1 don t know who uses them Z1
The Network Intelligent Document (1) Documents which adapt themselves to the (limited) bandwidth of the network Documents with hierarchical information representation On a printer On a display On a Portable Document Reader Requires several levels of representation Explicitly Stored internally Or automatic summarization
The Network Intelligent Document (2) Automatically generated at authoring time To be available when needed (like thumbnails) Reproduction adapted to available bandwidth or storage Small Bandwidth Small Storage Full Images Full Text Summary t URL Large Bandwidth Large Storage
The New Document is Live, Dynamic, Updatable Freezes when printed or converted to a static (conventional digital) document, lives on the net Linked, Hyperlinked With links resolved at rendering time (printing or viewing) Implementing the ultimate late binding capability Inherently supporting variable/personalized publishing With reverse link control for document integrity Intelligent, Adaptable Integrates some of its own workflow procedures Understands the limitation of communication channels, and of viewing or printing equipment, and adapts itself Auto-translating, Auto-summarizing
The New Document is Generator of a whole new range of activities, such as Document-based collaboration activities Collaborative authoring Document-based business and administrative processes Supports complex pruduct design and approval cycles Integrates document rights Integrates digital signatures and biometric data Document-based search methods and engines Search engines for the WWW Search engines for DMS Meta-search engines and restricted-domain search engines
The New Document is Digital, of course, and Polymorphic; exists in many different variations, media, formats, etc Ubiquitous: linked and hyperlinked, distributed, dynamic and mobile Actionable: supporting business processes and generating activities unthinkable of a decade ago
Thank you Patrick P. Bergmans / The New Document / Analogous Spaces / May17, 2008
X X