File Formats for Digital Preservation

Similar documents
MEDIA RELATED FILE TYPES

Different File Types and their Use

MULTIMEDIA AND CODING

Elementary Computing CSC 100. M. Cheng, Computer Science

Part III: Survey of Internet technologies

Characterisation. Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th, 2008

Advanced High Graphics

Example 1: Denary = 1. Answer: Binary = (1 * 1) = 1. Example 2: Denary = 3. Answer: Binary = (1 * 1) + (2 * 1) = 3

Standard File Formats

Lecture 19 Media Formats

RECOMMENDED FILE FORMATS

Sustainable File Formats for Electronic Records A Guide for Government Agencies

UNDERSTANDING MUSIC & VIDEO FORMATS

3.01C Multimedia Elements and Guidelines Explore multimedia systems, elements and presentations.

DOWNLOAD OR READ : FREE SERVICE MANUAL 2006 GMC SIERRA PDF EBOOK EPUB MOBI

Computing in the Modern World


Digitization of Multimedia Elements

Multimedia. File formats. Image file formats. CSE 190 M (Web Programming) Spring 2008 University of Washington

EXCELLENT ACADEMY OF ENGINEERING. Telephone: /

Multimedia applications

Fundamental of Digital Media Design. Introduction to Audio

III-6Exporting Graphics (Windows)

Lesson 5: Multimedia on the Web

Video. Add / edit video

This is a piece of software that allows the user to make presentations. Its user interface is radically different to that of PowerPoint.

1.1 Technical Evaluation Guidelines and Checklist:

Compression; Error detection & correction

ednet. smart memory Smart storage expansion for your iphone or ipad

Lecture #3: Digital Music and Sound

Multimedia on the Web

Revision Guide. Creative Imedia R081

Working with Images and Multimedia

CPSC 301: Computing in the Life Sciences Lecture Notes 16: Data Representation

Uploading a File in the Desire2Learn Content Area

M4.2-R4: INTRODUCTION TO MULTIMEDIA

计算原理导论. Introduction to Computing Principles 智能与计算学部刘志磊

Camtasia Studio 5.0 PART I. The Basics

myprint help topics myprint account Creating a myprint account Closing your myprint account

What is Data Storage?

freetunes Engelmann Media GmbH

Final Study Guide Arts & Communications

What is PowerPoint Good For? Using PowerPoint. What is PowerPoint Not So Good For? What is PowerPoint Terrible At? Modifying the Layout

M4-R4: INTRODUCTION TO MULTIMEDIA (JAN 2019) DURATION: 03 Hrs

Unicode. Standard Alphanumeric Formats. Unicode Version 2.1 BCD ASCII EBCDIC

CTIS 155 Information Technologies I. Chapter 5 Application Software: Tools for Productivity

Prentice Hall. Learning Microsoft PowerPoint , (Weixel et al.) Arkansas Multimedia Applications I - Curriculum Content Frameworks

Lesson 5: Multimedia on the Web

What s New in Studio 6?

1.6 Graphics Packages

CR-8710 ios SD Card Reader

HTML5: MULTIMEDIA. Multimedia. Multimedia Formats. Common Video Formats

HOW TO SAVE YOUR DESIGN FILES

Formatting Support: Word 2008

8 TABLET MICROSCOPY SOLUTION

Funcom Multiplayer Online Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Tablet 300x x x x1024

Image Types Vector vs. Raster

AVS4YOU Programs Help

Common Technology Words and Definitions

Register your product and get support at HMP3008. EN User manual 7 ZH-CN 9

Streaming Technologies Glossary

Digital Audio Basics

Recording oral histories

Computers Are Your Future

PhotoFast MemoriesCable U2. Market leading design and technology

Chapter 1. Data Storage Pearson Addison-Wesley. All rights reserved

Multimedia Technology

HTML 5 and CSS 3, Illustrated Complete. Unit K: Incorporating Video and Audio

Compression; Error detection & correction

Atari Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

OneClick Video Converter Free Version

Movie Generation Guide

ishowdrive (WIB5012) User Manual

Scalable Vector Graphics: SVG. The Little-Known Treasure of Document Viewing

8/19/2018. Web Development & Design Foundations with HTML5. Learning Objectives (1 of 2) Learning Objectives (2 of 2) Helper Applications & Plug-Ins

Quick start guide to Blackboard at Keele

CR-8800 Connection Kit

Media Player MP-2020 Specification Sheet MP Media Player for 2.5 Hard disk Specification Sheet

Quicktime Player Error Codec For Avi Per

Key features: PN & UPC Codes: PN ITEM UPC

Turnitin currently accepts the following file types for upload into an assignment:

access to reformatted and born digital content regardless of the challenges of media failure and technological

Interactive Multimedia. Multimedia and the World Wide Web

Internet: An international network of connected computers. The purpose of connecting computers together, of course, is to share information.

Always there to help you. Register your product and get support at HMP5000. Question? Contact Philips.

Image coding and compression

Directory. Product overview. Connecting your media player. Specification. Interface. Explanation of the remote control. Connector Indication

Data Storage. Slides derived from those available on the web site of the book: Computer Science: An Overview, 11 th Edition, by J.

Instruction Manual. idiskk USB Flash Drive 32GB/64GB/128GB

OnDemand Discovery Quickstart Guide

BEST FILE FORMAT FOR HIGH RESOLUTION

HTML is a mark-up language, in that it specifies the roles the different parts of the document are to play.

Which Folders Shouldn t Be Backed Up?

CSC 170 Introduction to Computers and Their Applications. Lecture #1 Digital Basics. Data Representation

Experiments in Mathematical Web Animation

DOWNLOAD OR READ : WORD AND IMAGE IN ARTHURIAN LITERATURE PDF EBOOK EPUB MOBI

Honor 3C (H30-U10) Mobile Phone V100R001. Product Description. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

My Media Hub Quick Start Guide for USB Devices. Sharing media content with the Fetch Box from a USB device

3 Data Storage 3.1. Foundations of Computer Science Cengage Learning

Transcription:

File Formats for Digital Preservation Fabian M. Suchanek based on Best File Formats for Archiving

Pre-Digital Storage How old is this? Code Of Hammurabi 2

Pre-Digital Storage And this? St Cuthberg Gospel 3

4 Pre-Digital Storage And this? Can you still read it? US Declaration of Independence

Digital Storage And this? Can you still read it? Floppy Disk 5

6 Digital Storage And this? Can you still read it? CD-ROM

7 Digital Media may be irrecoverable # documents Digital Dark Age (hypothetical graph, log scale) Antiquity Middle Ages Modern times This poses problems for historians for society (legal problems) for the individual

Digital Media may be irrecoverable # documents Digital Dark Age (hypothetical graph, log scale) Antiquity Middle Ages Modern times This poses problems for historians for society (legal problems) for the individual The original footage of the 1969 moon landing was lost. Only low quality copies remain. Wikipedia / Apollo 11 missing tapes 8

Digital Media may be irrecoverable # documents Digital Dark Age (hypothetical graph, log scale) Antiquity Middle Ages Modern times This poses problems for historians for society (legal problems) for the individual Wired Magazine 2017-01-19 Other example 9

Digital Media may be irrecoverable # documents Digital Dark Age (hypothetical graph, log scale) Antiquity Middle Ages Modern times This poses problems for historians for society (legal problems) for the individual my grandfather in his twenties me in my twenties 10

Digital Preservation There are (at least) 3 sources of obsolescence: 1) Digital media decays 2) Digital media becomes obsolete 3) File format becomes obsolete >storage 11

Life expectancy of media Estimates differ widely. This one is by Crashplan.com. >storage 12

Life expectancy of media Orthogonal dangers: hazards (fire, theft, loss, mishandling) media no longer supported Estimates differ widely. This one is by Crashplan.com. Opinion of the Digital Preservation Workshop: Media technology changes so rapidly that high longevity media is likely to be threatened by obsolescence before its useful life is over 13

Digital media becoming obsolete DB Workshop See also: Museum of Obsolete media 14

Digital media becoming obsolete Example: BestBuy stops selling CDs. Today: Optical storage USB keys cloud services The best solution appears to copy the data always from the old medium to the new one. DB Workshop See also: Museum of Obsolete media 15

File Formats can become obsolete The [British] National Archives, which holds 900 years of written material, has more than 580 terabytes of data the equivalent of 580,000 encyclopaedias in older file formats that are no longer commercially available. [BBC: Warning of data ticking time bomb, 2007-07-03] 16

Established File Formats A file format is considered established if it has been around for a sufficiently long time it is supported by several vendors (and not just by a single company) it is platform-independent (work on Windows, Mac, Linux, mobile) Examples: MP3 for audio JPG for images PDF for documents since 1993 1992 1993 >Flash 17

Example: Flash Flash is a software suite by Adobe for production of animations, browser games, rich Internet applications, desktop applications, mobile applications and mobile games. It consists of FLA: the main file format of Flash projects SWF (Shockwave Flash): a file format for multimedia and action scripts FLV: the main file format for Flash videos Flash has been around since 2000 can be played in most desktop browsers is thus platform-independent => very established 80% of Web users interacted with Flash at least once a day in 2014. [Chromium.org] 18

19 BUT: Flash was abandoned Flash has security problems, and was superseded by HTML5 capabilities. Adobe News

Not everybody has noticed... You may also still have Flash videos on your computer! FLV is a container format, so you might be able to recover the content losslessly. 20

Established & not abandoned The rest of this lecture is concerned with file formats (1) that are established and (2) that show no signs of abandonment. 21

File Formats A file format is a standard way that information is encoded for storage in a computer file [Wikipedia]. Data Data as a sequence of bytes stored in a file: Fileformat defines the translation ñé*0nvéyo9rqann$=m o lqe é>jçc!éˆhüxau6kndtp2k iépbj1ïeûkwwmròok ñ7f... 22

File Extension The file extension is the part of the file name behind the last dot. It identifies the file format. Data Data as a sequence of bytes stored in a file: File extension: JPG ñé*0nvéyo9rqann$=m o lqe é>jçc!éˆhüxau6kndtp2k iépbj1ïeûkwwmròok ñ7f... Text documents: Images: Audio:... DOCX JPG MP3 ODT PNG OGG......... 23

Types of Data Introduction Images Audio Video Office Documents Summary 24

SVG Scalable Vector Graphics is a file format intended for vector images (= images that consist of simple geometric shapes). File extension: SVG Data File format: SVG Data stored in file <circle x=30 y=30 r=10 stroke=blue /> <line x1=15 x2=15...... (simplified) SVG describes the shapes in XML (a human-readable format). Try it out! 25

SVG is for geometric shapes SVG has been around since 1999, and can be displayed in all browsers => very established SVG is great for geometric shapes, but NOT for more complex images. Data stored in file File format: SVG <man look=left nose=big /> <tie style=old color=... This does NOT WORK! 26

PNG Portable Networks Graphics is a file format intended for raster images (= images that consist of pixels). File extension: PNG Data Data stored in file File format: PNG WWWWWWBBBBB WWWBBBBBBBB... (simplified) The file stores the color of every pixel. The data is then compressed. >details 27

PNG Details PNG files start with 0x89 PNG 0x0D 0x0A 0x1A 0x0A, i.e. if a DOS CRLF were transformed into a Linux CR or vice versa, we would notice. PNG files define their colors in a palette. Palette: 0 = darkgray 1 = lightgray 2 = light brown... Data stored in file 22222220000011111111... (simplified) There are also standard palettes (most notably red/green/blue). >details 28

PNG Details There are more filtering steps. Finally, the data is then compressed using the same algorithm as ZIP. Data stored in file 7 2 5 0 8 1... (simplified) PNG can interlace the data, so that the image shows in low resolution when it has been transferred partially. 29

PNG Summary PNG exists since 1997, can be displayed in any image software and in any browser, is most widely used lossless image format on the Web => very established PNG is great for scanning photos screenshots...but not so great for geometric shapes (use SVG) Try it out! >TIFF 30

PNG Competitors Compared to GIF, PNG supports transparency PNG supports 16m colors PNG does not support animation Compared to TIFF PNG is more widely supported PNG does not support multi-page (and many other features) PNG does not support the CMYK color model >CMYK 31

32 CMYK Color model used on the screen used in printing c Mississippi State University

Resolution The resolution of an image is the number of pixels in each dimension. 1500 pixels 2500 pixels For paper, the resolution is often given in dots per inch (DPI): 1 inch (= 2.54cm) for example: 600 pixels in 1 inch => 600 DPI 33

Choosing the Resolution Human eye d Image d One eye cell can distinguish 31.5 arc seconds => 6000 pixels in an image of height d If you stand at least as far away from the image as the image is high, the image does not need more than 6000 pixels vertically. (A higher resolution is needed for closer distances, zooming, post-processing, etc.) The resolution scales linearly with the distance: verticalpixels = height distance 6000 34

The problem with PNG A typical smartphone picture nowadays has a resolution of 3000 4000 pixels. That s 20 megabytes per picture! (If you scan a photo at 600 DPI, you get 10MB-20MB) 35

JPEG JPEG (also: JPG) is a file format for raster images that omits details that are less visible to the human eye. File extension: JPG or JPEG Nobody distinguishes the shades here => omit detail In return, JPG adds artifacts around sharp contours see picture >details 36

JPEG Details The human eye is more sensitive to light than to color => JPG stores color at a lower resolution (subsampling). 1. split the image into chroma (color) and luma (light) c Algr Store only half the resolution for chroma 37

JPEG Summary JPG exists since 1992, it can be displayed in any image program and any browser, it is the most common format for photos on the Web => extremely established. Competitors are: HEIC (iphones, more space efficient) WebP (Google, also for space)... but these are nowhere as established as JPG. 38

Lossy File Formats A file format is lossy if it looses information (beyond resolution). Lossy file formats can degrade in quality when the files are repeatedly modified and saved the files are transfered to a different file format JPEG is lossy: after repeated modification transfer to another format (illustrative example) (illustrative example) PNG & SVG are lossless: after repeated modification transfer to another lossless file format 39

Established Image Formats Scalable Vector Graphics (SVG) Vectorized (i.e., lossless), only for geometric shapes Portable Network Graphics (PNG) Lossless, for raster images, high space consumption (similar: TIFF, GIF) JPEG Lossy, for raster images, less space consumption (similar but less established: HEIC, WebP) 40

Types of Data Introduction Images Audio Video Office Documents Summary >MIDI 41

MIDI The Musical Instrument Digital Interface provides a file format for music that stores the notes together with the instruments. File Extension: MIDI Data File format: MIDI Data stored in file Piano: a (simplified) Try it out The MIDI file just contains the name of the instrument + the notes (in an encoded format). 42

MIDI Summary MIDI cannot be used to record music, because (1) it is not easy to separate the instruments in played music, and (2) MIDI cannot express variations in sound, force, distance, perfection, and volume. MIDI can only store vectorized music (a bit like SVG for images). It is lossless. MIDI cannot store arbitrary sounds (or voice). MIDI exists since the 1980 s, it is very popular in the digital instrument community, and can be played on all major operating systems => very established 43

FLAC The Free Lossless Audio Codec is a file format for digital audio. File extension: FLAC Data Data stored in file File format: FLAC (simplified) FLAC stores a digital version of the sound wave. 44

Sampling Rate The sampling rate of an audio file is the number of datapoints per second, measured in Hz ( the resolution of the audio file). sampling rate = # datapoints per second 1 second 8 khz Telephone, /s/ sounds like /f/ 32 khz Camcorder, satellite radio 44 khz Audio CD 48 khz Professional digital equipment >50kHz Brings no advantage to [Wikipedia] the human ear gold standard for consumers 45

FLAC Summary FLAC exists since 2001, and can be played in all major browsers => well established FLAC is lossless (up to the chosen resolution). BUT: FLAC files are very large ( 20 MB for a song of 3 minutes) 46

MP3 MPEG-1 Audio Layer III (or MPEG-2 Audio Layer III) is a lossy file format for audio. File extension: MP3 Data Sampled Data stored in file MP3 (simplified) MP3 discards details that are less audible to the human ear, thus saving space. MP3 uses insights from psychoacoustics to determine what to leave out, e.g., soft sounds in the presence of loud sounds. >details 47

MP3 Details MP3 loses data on two fronts: 1. by sampling (good sampling rate: 44,100 Hz) Data Sampled Data stored in file MP3 (simplified) 2. by compressing Compression is measured in kilo-bits per second (kbit/s). More kbit/s => more truthful, more space consumption. Humans cannot distinguish 256 kbit/s from the original. 48

MP3 Summary MP3 exists since 1993, can be played in all browsers, on all major operating systems, is by far the most popular audio format => extremely established MP3 is lossy, but file size is 20%-30% of FLAC (at 256 kbit/s). BUT: Technicolor held a patent on MP3, and required all MP3 software producers to pay a fee => MP3 was not open. (This did not prevent people from using and implementing MP3 in practice. The patent expired in 2017.) 49

Open & Proprietary File Formats 1. File formats without public documentation Proprietary Compression format RAR, Audio format WMA 2. File formats with a documentation for a fee ISO standardized file formats 3. File formats with software licenses, patents, or IP rights Audio format MP3, Video format HEVC, Image format HEIC 4. File formats where a company claims IP rights in retrospect Image format JPG 5. Free file formats under control of a single company 6. Free file formats standardized by a consortium Microsoft Office Formats, Document format PDF, Audio MIDI 7. Free file formats developed by a community Image formats SVG & PNG, Audio format FLAC Open 50

Opus Opus (the successor of Vorbis) is a completely open lossy audio format. File extension: OGG, OGA, or OPUS (Technically, OGG is the container, and Opus is the codec.) Data Data stored in file File format: Opus (simplified) The project started in 2000. Nearly all browsers can play Opus. Wikipedia encourages the use of Vorbis/Opus. => reasonably established, but not as established as MP3 Opus is open and less lossy than MP3 at the same bit rate. 51

Established Audio File Formats MIDI: lossless, vectorized, practically open, but only for musical notes FLAC: lossless, open but very large file sizes (compresses better than WAV) (proprietary competitors: ALAC, M4A, WMA) MP3: lossy, practically open today (less lossy, less open competitor: MP4+AAC) (truly open, less lossy, but less established competitor: Opus) 52

Types of Data Introduction Images Audio Video Office Documents Summary 53

Containers Videos live in container formats that contain the video data, the audio data, subtitles and/or other information. These nested formats are called codecs. Audio codec, e.g. MP3, Opus Container format, e.g., OGG, MP4, WebM Video codec, e.g. AVC, AV1 Usually, certain containers go mainly together with certain codecs. 54

MPG MPG is a lossy, nowadays practically open video container format, together with a video codec, and an audio codec. File extension: MPG Data stored in file File format: MPG (simplified) I-frames store an entire picture. P-frames store the difference to the previous frame. >details 55

MPG Details With each P-Frame, MPG can also store a motion vector. 1. Move by 2. Add in Like JPEG, MPG uses color-subsampling. It also quantizes the data, limiting each pixel to a fixed number of different values. In addition, MPG uses Runlength encoding, or Huffman coding. >DVD 56

MPG Summary The audio of MPG videos is stored as MP3 or as MP2 (the predecessor of MP3). MPG is lossy, and nowadays practically open. MPG exists since ca. 1990, and is the most widely compatible lossy audio/video format in the world => very established MPG is used on Video DVDs: Folder VIDEO TS: VTS 01 1.VOB VTS 01 2.VOB... These contain MPG videos The other files contain menus, etc. 57

Resolution of videos Videos have 3 types of resolutions: Resolution of the image 320 240 (for mobile devices) 1920 1080 (1080p Full HD) 4096 2160 (4K Digital cinema, iphone) 7680 4320 (HD, 8K, maximum on Youtube) Resolution in time (pictures or frames per second) usually between 24 (cinema) and 30 Sampling rate of the audio as discussed before The higher the resolution, the more space the video will occupy. 58

MP4+AVC+AAC MP4 is a container format, that is often used together with the lossy video codec AVC (H.264) and the lossy audio codec AAC. File extension: MP4 It improves on MPG by allowing different subsampling rates more fine-grained motion vectors P-frames to reference more than one other frame => it uses half as much space as MPG MP4+AVC+AAC is one of the most established video formats. caniuse.com >HEVC 59

MP4 and HEVC MP4 was inspired by Apple s Quicktime movie format (MOV), and MOV can be transformed losslessly into MP4. The successor of MP4/AVC is the highly efficient video codec HEVC. iphones support HEVC. MPG MP4 HEVC MOV BUT: Neither HEVC nor MP4 are free! Licensors claim that a license fee has to be paid for every copy of a software that supports MP4 or HEVC. => big problem for free software! Firefox uses the implementation of the operating system. 60

WebM WebM is a free container format that goes with the free lossy Opus audio codec and the free lossy AV1 video codec (successor of VP8 and VP9). Extension: WEBM WebM is championed by the Alliance for Open Media, where Google is a driving force. Mozilla Only Apple stuck to HEVC. 61

WebM WebM is a free container format that goes with the free lossy Opus audio codec and the free lossy AV1 video codec (successor of VP8 and VP9). Extension: WEBM WebM is championed by the Alliance for Open Media, where Google is a driving force. Apple joined in January 2018. Mozilla 62

Video Formats All common video formats are usually lossy. obsolete c Moving Picture Expert Group most common video format, practically open (nearly equivalent: VOB video DVDs) very established format, better compression than MPEG, not open (nearly equivalent: MOV) new, open format, not established. Better compression than MP4. (non-free competitor: HEVC) 63

Types of Data Introduction Images Audio Video Office Documents Summary All file formats for Office documents presented here are lossless. 64

Plain Text Documents A plain text document is a file that stores text without any formatting (no fonts, no text styles, no images, etc.). File extension: TXT. Data Hello! Data stored in a file: Hello! File format: TXT >caveats 65

Plain Text Document Details To write accents or non-latin characters, you need to choose a character encoding. The standard nowadays is UTF-8. ->character-encodings On Windows, text documents open with the Notepad software It does not save UTF-8 by default It s buggy It cannot deal with Unix line-breaks On Mac, text documents open with the TextEdit software TextEdit is buggy TextEdit will do WYSIWYG with HTML 66

Plain Text Documents Summary Plain text documents are the easiest, safest, most compatible, and most established way to store a text. TXT is completely open. Caveats: you cannot use formatting watch out for the character encoding (use UTF-8) use Notepad++ on Windows (open-source, most used editor) 67

Formatted Text Documents Formatted text documents can contain different fonts, different font styles (italic, bold, colored, etc.), and other objects such as images. Data Hello world! Data stored in file: Hello <font color=blue> world</font>! <img src=grpa.jpg> grpa.jpg The extra information is usually sprinkled in one way or the other into the plain text. The way of annotating the text document is called a markup language. External objects are usually linked. >HTML&LaTex 68

HTML The Hypertext Markup Language HTML is an open file format for formatted text that is developed by the W3C. Extension: HTML HTML-file: Hello <font color=blue> world</font>! <img src=grpa.jpg> grpa.jpg displayed in Web browser >details 69

HTML Caveats Most software for writing HTML requires knowledge of HTML, is not free, is outdated, does not support all HTML features, bloats the HTML, or shows the layout slightly differently. in a browser in LibreOffice >details 70

HTML Caveats An HTML file refers to external objects (such as images, style sheets, fonts, videos, etc.). This causes problems if the file or the object is moved, renamed, or deleted. index.html grpa.jpg Hello Solution 1: Store everything in <font color=blue> a single folder, treat it as a unit. world</font>! <img src=grpa.jpg> in one folder (bundle) 71

HTML Caveats An HTML file refers to external objects (such as images, style sheets, fonts, videos, etc.). This causes problems if the file or the object is moved, renamed, or deleted. HTML-file: grpa.jpg Hello Solution 2: Encode the external <font color=blue> object in Base64, embed it into world</font>! the HTML file. <img src=data: image/jpeg;base... data:image/jpeg;base64,4aaqsk ZJRgABAQAASABIAjv69sej2IB18HSW.. 72

HTML Summary HTML exists since 1992, can be displayed in any browser, and read and written by many word processing programs => extremely established HTML-file: HTML is developed by the Hello World Wide Web Consortium <font color=blue> and is thus completely open. world</font>! <img src=data: image/jpeg;base... However, there is no outstanding software support for writing HTML => often written by hand Markdown is an easy markup language that can be compiled to HTML. It is open and currently being standardized. >LaTex 73

Latex Latex is an open file format for formatted text that is very popular in academia. File extension: TEX Data Hello world! File format: Latex Data stored in file: Hello textcolor{blue}{ world }! includegraphics{ grpa.jpg} grpa.jpg 74

Latex is difficult see 7 other answers 75

Latex Summary Latex has been around since 1985 => very established (in academia). Latex is completely open. Latex has to be compiled to PDF in order to be displayed: Data TEX file PDF Hello Hello world! Latex layout textcolor{blue}{ world }! includegraphics[ Latex compiler width=2cm]{grpa} There are tools to help with writing Latex, but one passes usually at least as much time doing the layout as writing the text (2% of a Latex document are backslashes). 76

PDF The Portable Document Format PDF is a file format for formatted developed by Adobe, and nowadays open. File extension: PDF Most text processing software can produce PDF. Data Hello Microsoft Office PDF world! LibreOffice Google Docs Latex Web Browser compiler >details 77

PDF Details PDF is based on PostScript, a language that describes how a document shall be printed. The layout and the fonts are vectorized. This allows for very precise, scalable, and immutable layouts. PDF c J. Lajus & F. M. Suchanek @ WWW2018 /Courier 20 selectfont 72 500 moveto (Hello world!) show 78

PDF cannot be easily modified PDF defines the layout, and the semantic structure gets lost => PDFs cannot be easily modified => one cannot always copy/paste from a PDF Copy/paste may yield disconnected areas. Copy/paste will merge ligatures ( fi, ff, etc.). c J. Lajus & F. M. Suchanek @ WWW2018 79

PDF Scans Scanners can produce PDF documents from paper documents, but these are just large images, not actual characters. c Martin Hosse 1975 80

PDF Summary PDF has been around since 1993, and can be displayed by all Web browsers, as well as on all operating systems => extremely established PDF has been standardized and is nowadays practically open. PDF allows for precise, scalable, and immutable layout => it is perfect for sending documents and for printing PDF cannot be modified easily => the document content cannot be modified => the text often cannot be recovered => the transformation to another data format may be lossy => PDF is the end of the line All other file formats presented in this lecture are modifiable, even though lossy file formats suffer from the modification. 81

ODT The Open Document Format is an open file format for formatted text that can be produced by what-you-see-is-what-you-get software. File extension: ODT (The Open Document Format also defines file formats for spreadsheets, presentations, etc.; see later) Data stored in file: ODT <xml> <text> Hello world </text> </xml> (simplified) LibreOffice ODT stores the formatted text in XML ( HTML), and then ZIPs it. 82

ODT Summary ODT is a free and open standard. It can be read and written by a wide range of software (including Microsoft Office). It is the mandatory standard in the NATO countries. ODT exists since 2006 => established Caveat: Since ODT is noncommercial, ODT software is sometimes perceived as not as stable, as comfortable, and as interoperable as Microsoft Office. LibreOffice 83

DOCX Office Open XML (the successor of Microsoft Word documents DOC and the Rich Text Format RTF) is a file format for formatted text that is used by the Microsoft Word software. File extension: DOCX (Office Open XML also defines file formats for spreadsheets, presentations, etc.) Data stored in file: <xml> Hello world DOCX <text> Hello world </text> </xml> Microsoft Word (simplified) Office Open XML became a standard after ODT, against heated opposition by the ODT community. 84

DOCX Summary DOCX exists since 2006. It is ubiquitous, and for some people the only formatted text format that they know. Hello world DOCX can be displayed natively on Windows and on ios, and is supported by a large range of software (some of which is free) => very established DOCX became a free standard in 2006. BUT: Only Microsoft products implement the full standard of DOCX. DOCX remains difficult to handle on Linux. >Google Docs 85

Google Docs Google Docs is a Web-based word-processor that Google offers as part of its free Google Drive cloud storage. Google Docs is easy to use can be used collaboratively stores infinite history does not require software installation is free BUT: you share all your documents with Google (and the NSA) -> Data Security European open alternative: framapad.org 86

Google Docs Exporting For archiving purposes, Google Doc documents have to be exported. We still have to make the choice of the file format 87

Established File Formats for Text Plain text TXT: vanilla standard Watch out with UTF-8 encoding. No formatting. Formatted Text DOCX: Microsoft standard, ubiquitous ODT: Open competitor of DOCX PDF: Ubiquitous, but not modifiable ( end of the line ) For geeks HTML: open, ubiquitous, but requires knowledge of markup language LaTex+PDF: de facto standard in academia, complicated to write >PPTX&XLS 88

Established Formats for Spreadsheets Plain tabular data TSV: can be processed by all spreadsheet software and databases Just cellular data, no calculations, graphs, layout. Spreadsheet Office Software XLSX: Microsoft standard, ubiquitous (Excel) ODS: Open competitor of XLSX PDF: Ubiquitous, but read-only ( end of the line ) For geeks HTML: open, but no established standards for calculations or graphs >PPTX&XLS 89

Established Formats for Presentations Presentation Office Software PPTX: Microsoft standard, ubiquitous (PowerPoint) ODP: Open competitor of PPTX PDF: Ubiquitous, but read-only ( end of the line ) For geeks HTML: open, but no established standard for slides SVG: open, but no established standard for slides LaTex+PDF (Beamer): open, easy to read, difficult to write 90

Summary Established Images Audio Video Office SVG PNG JPG MIDI FLAC MP3 MPG MP4 HEVC WebM TXT Yes Yes Yes Yes Yes Yes Yes Yes No Not yet (?) Yes DOCX/PPTX/XLSXYes ODT/ODS/ODP HTML/SVG PDF Yes Yes for text Yes Lossless Yes (vector) Yes No Yes (vector) Yes No No No No No Yes Yes Yes Yes for text Read-only Open Yes Yes Disputed Yes Yes Yes (now) Yes No No Yes Yes Yes, but... Yes Yes Yes 91