File Formats for Digital Preservation Fabian M. Suchanek based on Best File Formats for Archiving
Pre-Digital Storage How old is this? Code Of Hammurabi 2
Pre-Digital Storage And this? St Cuthberg Gospel 3
4 Pre-Digital Storage And this? Can you still read it? US Declaration of Independence
Digital Storage And this? Can you still read it? Floppy Disk 5
6 Digital Storage And this? Can you still read it? CD-ROM
7 Digital Media may be irrecoverable # documents Digital Dark Age (hypothetical graph, log scale) Antiquity Middle Ages Modern times This poses problems for historians for society (legal problems) for the individual
Digital Media may be irrecoverable # documents Digital Dark Age (hypothetical graph, log scale) Antiquity Middle Ages Modern times This poses problems for historians for society (legal problems) for the individual The original footage of the 1969 moon landing was lost. Only low quality copies remain. Wikipedia / Apollo 11 missing tapes 8
Digital Media may be irrecoverable # documents Digital Dark Age (hypothetical graph, log scale) Antiquity Middle Ages Modern times This poses problems for historians for society (legal problems) for the individual Wired Magazine 2017-01-19 Other example 9
Digital Media may be irrecoverable # documents Digital Dark Age (hypothetical graph, log scale) Antiquity Middle Ages Modern times This poses problems for historians for society (legal problems) for the individual my grandfather in his twenties me in my twenties 10
Digital Preservation There are (at least) 3 sources of obsolescence: 1) Digital media decays 2) Digital media becomes obsolete 3) File format becomes obsolete >storage 11
Life expectancy of media Estimates differ widely. This one is by Crashplan.com. >storage 12
Life expectancy of media Orthogonal dangers: hazards (fire, theft, loss, mishandling) media no longer supported Estimates differ widely. This one is by Crashplan.com. Opinion of the Digital Preservation Workshop: Media technology changes so rapidly that high longevity media is likely to be threatened by obsolescence before its useful life is over 13
Digital media becoming obsolete DB Workshop See also: Museum of Obsolete media 14
Digital media becoming obsolete Example: BestBuy stops selling CDs. Today: Optical storage USB keys cloud services The best solution appears to copy the data always from the old medium to the new one. DB Workshop See also: Museum of Obsolete media 15
File Formats can become obsolete The [British] National Archives, which holds 900 years of written material, has more than 580 terabytes of data the equivalent of 580,000 encyclopaedias in older file formats that are no longer commercially available. [BBC: Warning of data ticking time bomb, 2007-07-03] 16
Established File Formats A file format is considered established if it has been around for a sufficiently long time it is supported by several vendors (and not just by a single company) it is platform-independent (work on Windows, Mac, Linux, mobile) Examples: MP3 for audio JPG for images PDF for documents since 1993 1992 1993 >Flash 17
Example: Flash Flash is a software suite by Adobe for production of animations, browser games, rich Internet applications, desktop applications, mobile applications and mobile games. It consists of FLA: the main file format of Flash projects SWF (Shockwave Flash): a file format for multimedia and action scripts FLV: the main file format for Flash videos Flash has been around since 2000 can be played in most desktop browsers is thus platform-independent => very established 80% of Web users interacted with Flash at least once a day in 2014. [Chromium.org] 18
19 BUT: Flash was abandoned Flash has security problems, and was superseded by HTML5 capabilities. Adobe News
Not everybody has noticed... You may also still have Flash videos on your computer! FLV is a container format, so you might be able to recover the content losslessly. 20
Established & not abandoned The rest of this lecture is concerned with file formats (1) that are established and (2) that show no signs of abandonment. 21
File Formats A file format is a standard way that information is encoded for storage in a computer file [Wikipedia]. Data Data as a sequence of bytes stored in a file: Fileformat defines the translation ñé*0nvéyo9rqann$=m o lqe é>jçc!éˆhüxau6kndtp2k iépbj1ïeûkwwmròok ñ7f... 22
File Extension The file extension is the part of the file name behind the last dot. It identifies the file format. Data Data as a sequence of bytes stored in a file: File extension: JPG ñé*0nvéyo9rqann$=m o lqe é>jçc!éˆhüxau6kndtp2k iépbj1ïeûkwwmròok ñ7f... Text documents: Images: Audio:... DOCX JPG MP3 ODT PNG OGG......... 23
Types of Data Introduction Images Audio Video Office Documents Summary 24
SVG Scalable Vector Graphics is a file format intended for vector images (= images that consist of simple geometric shapes). File extension: SVG Data File format: SVG Data stored in file <circle x=30 y=30 r=10 stroke=blue /> <line x1=15 x2=15...... (simplified) SVG describes the shapes in XML (a human-readable format). Try it out! 25
SVG is for geometric shapes SVG has been around since 1999, and can be displayed in all browsers => very established SVG is great for geometric shapes, but NOT for more complex images. Data stored in file File format: SVG <man look=left nose=big /> <tie style=old color=... This does NOT WORK! 26
PNG Portable Networks Graphics is a file format intended for raster images (= images that consist of pixels). File extension: PNG Data Data stored in file File format: PNG WWWWWWBBBBB WWWBBBBBBBB... (simplified) The file stores the color of every pixel. The data is then compressed. >details 27
PNG Details PNG files start with 0x89 PNG 0x0D 0x0A 0x1A 0x0A, i.e. if a DOS CRLF were transformed into a Linux CR or vice versa, we would notice. PNG files define their colors in a palette. Palette: 0 = darkgray 1 = lightgray 2 = light brown... Data stored in file 22222220000011111111... (simplified) There are also standard palettes (most notably red/green/blue). >details 28
PNG Details There are more filtering steps. Finally, the data is then compressed using the same algorithm as ZIP. Data stored in file 7 2 5 0 8 1... (simplified) PNG can interlace the data, so that the image shows in low resolution when it has been transferred partially. 29
PNG Summary PNG exists since 1997, can be displayed in any image software and in any browser, is most widely used lossless image format on the Web => very established PNG is great for scanning photos screenshots...but not so great for geometric shapes (use SVG) Try it out! >TIFF 30
PNG Competitors Compared to GIF, PNG supports transparency PNG supports 16m colors PNG does not support animation Compared to TIFF PNG is more widely supported PNG does not support multi-page (and many other features) PNG does not support the CMYK color model >CMYK 31
32 CMYK Color model used on the screen used in printing c Mississippi State University
Resolution The resolution of an image is the number of pixels in each dimension. 1500 pixels 2500 pixels For paper, the resolution is often given in dots per inch (DPI): 1 inch (= 2.54cm) for example: 600 pixels in 1 inch => 600 DPI 33
Choosing the Resolution Human eye d Image d One eye cell can distinguish 31.5 arc seconds => 6000 pixels in an image of height d If you stand at least as far away from the image as the image is high, the image does not need more than 6000 pixels vertically. (A higher resolution is needed for closer distances, zooming, post-processing, etc.) The resolution scales linearly with the distance: verticalpixels = height distance 6000 34
The problem with PNG A typical smartphone picture nowadays has a resolution of 3000 4000 pixels. That s 20 megabytes per picture! (If you scan a photo at 600 DPI, you get 10MB-20MB) 35
JPEG JPEG (also: JPG) is a file format for raster images that omits details that are less visible to the human eye. File extension: JPG or JPEG Nobody distinguishes the shades here => omit detail In return, JPG adds artifacts around sharp contours see picture >details 36
JPEG Details The human eye is more sensitive to light than to color => JPG stores color at a lower resolution (subsampling). 1. split the image into chroma (color) and luma (light) c Algr Store only half the resolution for chroma 37
JPEG Summary JPG exists since 1992, it can be displayed in any image program and any browser, it is the most common format for photos on the Web => extremely established. Competitors are: HEIC (iphones, more space efficient) WebP (Google, also for space)... but these are nowhere as established as JPG. 38
Lossy File Formats A file format is lossy if it looses information (beyond resolution). Lossy file formats can degrade in quality when the files are repeatedly modified and saved the files are transfered to a different file format JPEG is lossy: after repeated modification transfer to another format (illustrative example) (illustrative example) PNG & SVG are lossless: after repeated modification transfer to another lossless file format 39
Established Image Formats Scalable Vector Graphics (SVG) Vectorized (i.e., lossless), only for geometric shapes Portable Network Graphics (PNG) Lossless, for raster images, high space consumption (similar: TIFF, GIF) JPEG Lossy, for raster images, less space consumption (similar but less established: HEIC, WebP) 40
Types of Data Introduction Images Audio Video Office Documents Summary >MIDI 41
MIDI The Musical Instrument Digital Interface provides a file format for music that stores the notes together with the instruments. File Extension: MIDI Data File format: MIDI Data stored in file Piano: a (simplified) Try it out The MIDI file just contains the name of the instrument + the notes (in an encoded format). 42
MIDI Summary MIDI cannot be used to record music, because (1) it is not easy to separate the instruments in played music, and (2) MIDI cannot express variations in sound, force, distance, perfection, and volume. MIDI can only store vectorized music (a bit like SVG for images). It is lossless. MIDI cannot store arbitrary sounds (or voice). MIDI exists since the 1980 s, it is very popular in the digital instrument community, and can be played on all major operating systems => very established 43
FLAC The Free Lossless Audio Codec is a file format for digital audio. File extension: FLAC Data Data stored in file File format: FLAC (simplified) FLAC stores a digital version of the sound wave. 44
Sampling Rate The sampling rate of an audio file is the number of datapoints per second, measured in Hz ( the resolution of the audio file). sampling rate = # datapoints per second 1 second 8 khz Telephone, /s/ sounds like /f/ 32 khz Camcorder, satellite radio 44 khz Audio CD 48 khz Professional digital equipment >50kHz Brings no advantage to [Wikipedia] the human ear gold standard for consumers 45
FLAC Summary FLAC exists since 2001, and can be played in all major browsers => well established FLAC is lossless (up to the chosen resolution). BUT: FLAC files are very large ( 20 MB for a song of 3 minutes) 46
MP3 MPEG-1 Audio Layer III (or MPEG-2 Audio Layer III) is a lossy file format for audio. File extension: MP3 Data Sampled Data stored in file MP3 (simplified) MP3 discards details that are less audible to the human ear, thus saving space. MP3 uses insights from psychoacoustics to determine what to leave out, e.g., soft sounds in the presence of loud sounds. >details 47
MP3 Details MP3 loses data on two fronts: 1. by sampling (good sampling rate: 44,100 Hz) Data Sampled Data stored in file MP3 (simplified) 2. by compressing Compression is measured in kilo-bits per second (kbit/s). More kbit/s => more truthful, more space consumption. Humans cannot distinguish 256 kbit/s from the original. 48
MP3 Summary MP3 exists since 1993, can be played in all browsers, on all major operating systems, is by far the most popular audio format => extremely established MP3 is lossy, but file size is 20%-30% of FLAC (at 256 kbit/s). BUT: Technicolor held a patent on MP3, and required all MP3 software producers to pay a fee => MP3 was not open. (This did not prevent people from using and implementing MP3 in practice. The patent expired in 2017.) 49
Open & Proprietary File Formats 1. File formats without public documentation Proprietary Compression format RAR, Audio format WMA 2. File formats with a documentation for a fee ISO standardized file formats 3. File formats with software licenses, patents, or IP rights Audio format MP3, Video format HEVC, Image format HEIC 4. File formats where a company claims IP rights in retrospect Image format JPG 5. Free file formats under control of a single company 6. Free file formats standardized by a consortium Microsoft Office Formats, Document format PDF, Audio MIDI 7. Free file formats developed by a community Image formats SVG & PNG, Audio format FLAC Open 50
Opus Opus (the successor of Vorbis) is a completely open lossy audio format. File extension: OGG, OGA, or OPUS (Technically, OGG is the container, and Opus is the codec.) Data Data stored in file File format: Opus (simplified) The project started in 2000. Nearly all browsers can play Opus. Wikipedia encourages the use of Vorbis/Opus. => reasonably established, but not as established as MP3 Opus is open and less lossy than MP3 at the same bit rate. 51
Established Audio File Formats MIDI: lossless, vectorized, practically open, but only for musical notes FLAC: lossless, open but very large file sizes (compresses better than WAV) (proprietary competitors: ALAC, M4A, WMA) MP3: lossy, practically open today (less lossy, less open competitor: MP4+AAC) (truly open, less lossy, but less established competitor: Opus) 52
Types of Data Introduction Images Audio Video Office Documents Summary 53
Containers Videos live in container formats that contain the video data, the audio data, subtitles and/or other information. These nested formats are called codecs. Audio codec, e.g. MP3, Opus Container format, e.g., OGG, MP4, WebM Video codec, e.g. AVC, AV1 Usually, certain containers go mainly together with certain codecs. 54
MPG MPG is a lossy, nowadays practically open video container format, together with a video codec, and an audio codec. File extension: MPG Data stored in file File format: MPG (simplified) I-frames store an entire picture. P-frames store the difference to the previous frame. >details 55
MPG Details With each P-Frame, MPG can also store a motion vector. 1. Move by 2. Add in Like JPEG, MPG uses color-subsampling. It also quantizes the data, limiting each pixel to a fixed number of different values. In addition, MPG uses Runlength encoding, or Huffman coding. >DVD 56
MPG Summary The audio of MPG videos is stored as MP3 or as MP2 (the predecessor of MP3). MPG is lossy, and nowadays practically open. MPG exists since ca. 1990, and is the most widely compatible lossy audio/video format in the world => very established MPG is used on Video DVDs: Folder VIDEO TS: VTS 01 1.VOB VTS 01 2.VOB... These contain MPG videos The other files contain menus, etc. 57
Resolution of videos Videos have 3 types of resolutions: Resolution of the image 320 240 (for mobile devices) 1920 1080 (1080p Full HD) 4096 2160 (4K Digital cinema, iphone) 7680 4320 (HD, 8K, maximum on Youtube) Resolution in time (pictures or frames per second) usually between 24 (cinema) and 30 Sampling rate of the audio as discussed before The higher the resolution, the more space the video will occupy. 58
MP4+AVC+AAC MP4 is a container format, that is often used together with the lossy video codec AVC (H.264) and the lossy audio codec AAC. File extension: MP4 It improves on MPG by allowing different subsampling rates more fine-grained motion vectors P-frames to reference more than one other frame => it uses half as much space as MPG MP4+AVC+AAC is one of the most established video formats. caniuse.com >HEVC 59
MP4 and HEVC MP4 was inspired by Apple s Quicktime movie format (MOV), and MOV can be transformed losslessly into MP4. The successor of MP4/AVC is the highly efficient video codec HEVC. iphones support HEVC. MPG MP4 HEVC MOV BUT: Neither HEVC nor MP4 are free! Licensors claim that a license fee has to be paid for every copy of a software that supports MP4 or HEVC. => big problem for free software! Firefox uses the implementation of the operating system. 60
WebM WebM is a free container format that goes with the free lossy Opus audio codec and the free lossy AV1 video codec (successor of VP8 and VP9). Extension: WEBM WebM is championed by the Alliance for Open Media, where Google is a driving force. Mozilla Only Apple stuck to HEVC. 61
WebM WebM is a free container format that goes with the free lossy Opus audio codec and the free lossy AV1 video codec (successor of VP8 and VP9). Extension: WEBM WebM is championed by the Alliance for Open Media, where Google is a driving force. Apple joined in January 2018. Mozilla 62
Video Formats All common video formats are usually lossy. obsolete c Moving Picture Expert Group most common video format, practically open (nearly equivalent: VOB video DVDs) very established format, better compression than MPEG, not open (nearly equivalent: MOV) new, open format, not established. Better compression than MP4. (non-free competitor: HEVC) 63
Types of Data Introduction Images Audio Video Office Documents Summary All file formats for Office documents presented here are lossless. 64
Plain Text Documents A plain text document is a file that stores text without any formatting (no fonts, no text styles, no images, etc.). File extension: TXT. Data Hello! Data stored in a file: Hello! File format: TXT >caveats 65
Plain Text Document Details To write accents or non-latin characters, you need to choose a character encoding. The standard nowadays is UTF-8. ->character-encodings On Windows, text documents open with the Notepad software It does not save UTF-8 by default It s buggy It cannot deal with Unix line-breaks On Mac, text documents open with the TextEdit software TextEdit is buggy TextEdit will do WYSIWYG with HTML 66
Plain Text Documents Summary Plain text documents are the easiest, safest, most compatible, and most established way to store a text. TXT is completely open. Caveats: you cannot use formatting watch out for the character encoding (use UTF-8) use Notepad++ on Windows (open-source, most used editor) 67
Formatted Text Documents Formatted text documents can contain different fonts, different font styles (italic, bold, colored, etc.), and other objects such as images. Data Hello world! Data stored in file: Hello <font color=blue> world</font>! <img src=grpa.jpg> grpa.jpg The extra information is usually sprinkled in one way or the other into the plain text. The way of annotating the text document is called a markup language. External objects are usually linked. >HTML&LaTex 68
HTML The Hypertext Markup Language HTML is an open file format for formatted text that is developed by the W3C. Extension: HTML HTML-file: Hello <font color=blue> world</font>! <img src=grpa.jpg> grpa.jpg displayed in Web browser >details 69
HTML Caveats Most software for writing HTML requires knowledge of HTML, is not free, is outdated, does not support all HTML features, bloats the HTML, or shows the layout slightly differently. in a browser in LibreOffice >details 70
HTML Caveats An HTML file refers to external objects (such as images, style sheets, fonts, videos, etc.). This causes problems if the file or the object is moved, renamed, or deleted. index.html grpa.jpg Hello Solution 1: Store everything in <font color=blue> a single folder, treat it as a unit. world</font>! <img src=grpa.jpg> in one folder (bundle) 71
HTML Caveats An HTML file refers to external objects (such as images, style sheets, fonts, videos, etc.). This causes problems if the file or the object is moved, renamed, or deleted. HTML-file: grpa.jpg Hello Solution 2: Encode the external <font color=blue> object in Base64, embed it into world</font>! the HTML file. <img src=data: image/jpeg;base... data:image/jpeg;base64,4aaqsk ZJRgABAQAASABIAjv69sej2IB18HSW.. 72
HTML Summary HTML exists since 1992, can be displayed in any browser, and read and written by many word processing programs => extremely established HTML-file: HTML is developed by the Hello World Wide Web Consortium <font color=blue> and is thus completely open. world</font>! <img src=data: image/jpeg;base... However, there is no outstanding software support for writing HTML => often written by hand Markdown is an easy markup language that can be compiled to HTML. It is open and currently being standardized. >LaTex 73
Latex Latex is an open file format for formatted text that is very popular in academia. File extension: TEX Data Hello world! File format: Latex Data stored in file: Hello textcolor{blue}{ world }! includegraphics{ grpa.jpg} grpa.jpg 74
Latex is difficult see 7 other answers 75
Latex Summary Latex has been around since 1985 => very established (in academia). Latex is completely open. Latex has to be compiled to PDF in order to be displayed: Data TEX file PDF Hello Hello world! Latex layout textcolor{blue}{ world }! includegraphics[ Latex compiler width=2cm]{grpa} There are tools to help with writing Latex, but one passes usually at least as much time doing the layout as writing the text (2% of a Latex document are backslashes). 76
PDF The Portable Document Format PDF is a file format for formatted developed by Adobe, and nowadays open. File extension: PDF Most text processing software can produce PDF. Data Hello Microsoft Office PDF world! LibreOffice Google Docs Latex Web Browser compiler >details 77
PDF Details PDF is based on PostScript, a language that describes how a document shall be printed. The layout and the fonts are vectorized. This allows for very precise, scalable, and immutable layouts. PDF c J. Lajus & F. M. Suchanek @ WWW2018 /Courier 20 selectfont 72 500 moveto (Hello world!) show 78
PDF cannot be easily modified PDF defines the layout, and the semantic structure gets lost => PDFs cannot be easily modified => one cannot always copy/paste from a PDF Copy/paste may yield disconnected areas. Copy/paste will merge ligatures ( fi, ff, etc.). c J. Lajus & F. M. Suchanek @ WWW2018 79
PDF Scans Scanners can produce PDF documents from paper documents, but these are just large images, not actual characters. c Martin Hosse 1975 80
PDF Summary PDF has been around since 1993, and can be displayed by all Web browsers, as well as on all operating systems => extremely established PDF has been standardized and is nowadays practically open. PDF allows for precise, scalable, and immutable layout => it is perfect for sending documents and for printing PDF cannot be modified easily => the document content cannot be modified => the text often cannot be recovered => the transformation to another data format may be lossy => PDF is the end of the line All other file formats presented in this lecture are modifiable, even though lossy file formats suffer from the modification. 81
ODT The Open Document Format is an open file format for formatted text that can be produced by what-you-see-is-what-you-get software. File extension: ODT (The Open Document Format also defines file formats for spreadsheets, presentations, etc.; see later) Data stored in file: ODT <xml> <text> Hello world </text> </xml> (simplified) LibreOffice ODT stores the formatted text in XML ( HTML), and then ZIPs it. 82
ODT Summary ODT is a free and open standard. It can be read and written by a wide range of software (including Microsoft Office). It is the mandatory standard in the NATO countries. ODT exists since 2006 => established Caveat: Since ODT is noncommercial, ODT software is sometimes perceived as not as stable, as comfortable, and as interoperable as Microsoft Office. LibreOffice 83
DOCX Office Open XML (the successor of Microsoft Word documents DOC and the Rich Text Format RTF) is a file format for formatted text that is used by the Microsoft Word software. File extension: DOCX (Office Open XML also defines file formats for spreadsheets, presentations, etc.) Data stored in file: <xml> Hello world DOCX <text> Hello world </text> </xml> Microsoft Word (simplified) Office Open XML became a standard after ODT, against heated opposition by the ODT community. 84
DOCX Summary DOCX exists since 2006. It is ubiquitous, and for some people the only formatted text format that they know. Hello world DOCX can be displayed natively on Windows and on ios, and is supported by a large range of software (some of which is free) => very established DOCX became a free standard in 2006. BUT: Only Microsoft products implement the full standard of DOCX. DOCX remains difficult to handle on Linux. >Google Docs 85
Google Docs Google Docs is a Web-based word-processor that Google offers as part of its free Google Drive cloud storage. Google Docs is easy to use can be used collaboratively stores infinite history does not require software installation is free BUT: you share all your documents with Google (and the NSA) -> Data Security European open alternative: framapad.org 86
Google Docs Exporting For archiving purposes, Google Doc documents have to be exported. We still have to make the choice of the file format 87
Established File Formats for Text Plain text TXT: vanilla standard Watch out with UTF-8 encoding. No formatting. Formatted Text DOCX: Microsoft standard, ubiquitous ODT: Open competitor of DOCX PDF: Ubiquitous, but not modifiable ( end of the line ) For geeks HTML: open, ubiquitous, but requires knowledge of markup language LaTex+PDF: de facto standard in academia, complicated to write >PPTX&XLS 88
Established Formats for Spreadsheets Plain tabular data TSV: can be processed by all spreadsheet software and databases Just cellular data, no calculations, graphs, layout. Spreadsheet Office Software XLSX: Microsoft standard, ubiquitous (Excel) ODS: Open competitor of XLSX PDF: Ubiquitous, but read-only ( end of the line ) For geeks HTML: open, but no established standards for calculations or graphs >PPTX&XLS 89
Established Formats for Presentations Presentation Office Software PPTX: Microsoft standard, ubiquitous (PowerPoint) ODP: Open competitor of PPTX PDF: Ubiquitous, but read-only ( end of the line ) For geeks HTML: open, but no established standard for slides SVG: open, but no established standard for slides LaTex+PDF (Beamer): open, easy to read, difficult to write 90
Summary Established Images Audio Video Office SVG PNG JPG MIDI FLAC MP3 MPG MP4 HEVC WebM TXT Yes Yes Yes Yes Yes Yes Yes Yes No Not yet (?) Yes DOCX/PPTX/XLSXYes ODT/ODS/ODP HTML/SVG PDF Yes Yes for text Yes Lossless Yes (vector) Yes No Yes (vector) Yes No No No No No Yes Yes Yes Yes for text Read-only Open Yes Yes Disputed Yes Yes Yes (now) Yes No No Yes Yes Yes, but... Yes Yes Yes 91