National Digital Heritage Archive Programme

Size: px
Start display at page:

Download "National Digital Heritage Archive Programme"

Transcription

1 National Digital Heritage Archive Programme Evaluating the historical persistence of DROID asserted PUIDs

2 Document control Revision history Revision Date Author Reason for change th February 2012 Jay Gattuso First draft for consultation th March 2012 Jay Gattuso Second draft for review nd March 2012 Jay Gattuso Final Draft th June 2012 Jay Gattuso Final Related documents Path Title N/A Main Results of the NLNZ DROID Version Tests 2 P a g e

3 Summary... 7 Audience... 7 Motivation... 7 Expected Outcomes... 8 Summary of Results... 9 Summary of Conclusions... 9 Background... 9 The Purpose of File Format Assertions... 9 Basic DROID Operation Signature Matching File Extension Matching Logical/Hierarchical based Matching Container Matching DROID FAST Mode Rosetta Development and File Format Tools Example 1: File Formats Not Known To PRONOM Example 2: Files with Multiple PUID Assertions Scope of Research DROID Test Variables Selection DROID Version DROID Signature File DROID Fast/Slow Mode Describing the Test Set Selection Criteria Source Set List Discussion Being Open about File Format Identification Capabilities Test Sets - Crowd Sourcing Solutions? Managing Change and Supporting Legacy Assertions Explaining the Results File Format Error Types Method Hardware Used Software Used P a g e

4 Step One - Running The Data Through DROID Step Two - Ingest Results into Database Step Three - Construct MySQL Queries to Extract Meaningful Data Example Results Table Step four Analysis and Visualisation of Data Example Error Types and Visualisations Reading the Visualisations Example Error Types Target Pattern: None... Error! Bookmark not defined. Type A - versions of DROID Pattern for Type A - versions of DROID errors Type B - assertions between signatures Pattern for Type B - assertions between signatures errors Type C - inconsistent FAST and SLOW Pattern for Type C - inconsistent FAST and SLOW errors Type D - extension ID is inconsistent with signature ID Pattern for Type D - extension ID is inconsistent with signature ID errors Pattern for Type E - subset/format variation errors Pattern for Type E - subset/format variation errors Type F Multi PUID Pattern for Type F Multi PUID errors Results Summary of Results Results by PUID ExL -fmt/22 - CDX ExL-fmt/41 - epub (OPS) epubs Exl-fmt/61 - MPEG-4 Media File mp Exl-fmt/62 - Microsoft Office Open XML docx, xlsx, pptx Exl-fmt/ FLAC (Free Lossless Audio Codec) - FLAC fmt/3 - Graphics Interchange Format 1987a - gif fmt/4 - Graphics Interchange Format 1989a - gif fmt/5 - Audio/Video Interleaved Format - avi fmt/6 - Waveform Audio - wav fmt/7 - Tagged Image File Format - tif P a g e

5 fmt/11 - Portable Network Graphics png fmt/12 - Portable Network Graphics png fmt/14 - Acrobat PDF Portable Document Format - pdf fmt/15 - Acrobat PDF Portable Document Format - pdf fmt/16 - Acrobat PDF Portable Document Format - pdf fmt/17 - Acrobat PDF Portable Document Format - pdf fmt/18 - Acrobat PDF Portable Document Format - pdf fmt/19 - Acrobat PDF Portable Document Format - pdf fmt/20 - Acrobat PDF Portable Document Format - pdf fmt/39- Microsoft Word for Windows Document 6.0/95 - doc fmt/40 - Microsoft Word for Windows Document doc fmt/41 - Raw JPEG Stream - jpg fmt/42 - JPEG File Interchange Format jpg fmt/43 - JPEG File Interchange Format jpg fmt/44 - JPEG File Interchange Format jpg fmt/45 - Rich Text Format rtf fmt/49 - Rich Text Format rtf fmt/50 - Rich Text Format rtf fmt/52 - Rich Text Format rtf fmt/61 - Microsoft Excel 97 Workbook - xls fmt/62 - Microsoft Excel Workbook- xls fmt/95 - Acrobat PDF/A - Portable Document Format - pdf fmt/96 - Hypertext Markup Language html, htm fmt/99 - Hypertext Markup Language 4.0 html, htm fmt/100 - Hypertext Markup Language 4.01 html, htm fmt/101 - Extensible Markup Language xml fmt/111 - OLE2 Compound Document Format - doc, xls, ppt fmt/116 - Windows Bitmap bmp fmt/117 - Windows Bitmap 3.0 NT - bmp fmt/126 - Microsoft Powerpoint Presentation ppt fmt/132 - Windows Media Audio - wma fmt/132 Only With File Extensions fmt/133 - Windows Media Video - wmv fmt/134 - MPEG 1/2 Audio Layer 3 mp P a g e

6 fmt/149 - JTIP (JPEG Tiled Image Pyramid) jpg fmt/276 - Acrobat PDF Portable Document Format - pdf x-fmt/ x-fmt /62 - Log File - log x-fmt /92 - Adobe Photoshop - psd x-fmt /111 - Plain Text File txt, log x-fmt /135 - Audio Interchange File Format - aiff x-fmt /219 - Alexa Archive File - arc x-fmt/263 - ZIP Format - zip x-fmt/279 - MPEG 1/2 Audio Layer 3 Streaming mp x-fmt/279 Only With File Extensions x-fmt/279 Only Without File Extensions x-fmt/385 - Exchangeable Image File Format (Uncompressed) - tif x-fmt/387 - Exchangeable Image File Format (Compressed) jpg x-fmt/390 - Exchangeable Image File Format (Compressed) jpg x-fmt/391 - Exchangeable Image File Format (Compressed) jpg x-fmt/394 - WordPerfect for MS-DOS/Windows Document wpd, wp x-fmt/398 - Exchangeable Image File Format (Compressed) jpg x-fmt/409 - MS-DOS Executable - exe x-fmt/411 - Windows Portable Executable - exe P a g e

7 Summary Audience This paper is intended for Digital Preservation (DP) practitioners it is especially targeted at users of the National Archive (UK) (TNA) DROID file format characterisation tool 1, the TNA PRONOM file format registry 2, or who have an active interest in the process of file type identification. Any observed changes to the DROID PUID assertions relative to the various parameters will be of interest to any DROID / PRONOM primary stakeholders, as any changes will provide evidence for assessing the outcomes and impact of changes to records /data used by these two systems. Motivation The key question this paper asks is 'how do changes to PRONOM and related tools affect the preservation planning of any collecting institution? The idea for this research paper was the result of format related discussions in the National Digital Heritage Archive (NDHA) in the National Library of New Zealand (NLNZ). The author and other DP colleagues both in the Library, Archives New Zealand (ANZ) and other international institutions (including TNA, the British Library and the Library's DP system vendor, ExLibris) held a number of discussions that sought to understand how format identification has changed over recent years, and how changes made to the TNA toolset had resulted in changes to the file identifications being made in enterprise-level digital repositories. Of specific interest was garnering a historic view of some primary file types, and understanding how a single file could have different file assertions made against it, as file identification signatures and the DROID tool has evolved and matured over time. This is especially pertinent to the NDHA, as these changes are found in the toolsets that are adopted in the Library s DP system (Rosetta), and reflected in various digital objects relative to the development path of the Rosetta application. Specifically, DROID is currently incorporated as the primary format identification tool, and has been since v1.0 of the Rosetta system. To add a layer of complexity, the Rosetta application undertakes the extension matching aspect of DROID, checking an internal replication of the PRONOM database for matches. When the Rosetta application hands a file to DROID for format identification, its hands over a bitstream with no file extension label, forcing DROID to only make signature based matches. This complex blending of standalone tools, globally recognised record-sets, less-than-vanilla deployment of tools means that it is essential that NDHA keep a close eye on both how the DRIOD tool behaves, and how the Rosetta application performs. Any changes in behaviour need to be understood, and the root cause identified. NDHA have encountered over time a number of occasions where systematic changes in primary processes (that are linked to format identity) have been explained by changes to a PRONOM record or a behavioural change in newer versions of DROID P a g e

8 The purpose of this paper is not to comment on the accuracy or completeness of the various PRONOM records, more so to find compelling methods of finding and assessing any changes to DROID file type assertions, and to comment on the day-to-day and long-term impact for DROID/PRONOM users for these changes. This paper was conceived to give the NDHA an opportunity to explore the history of PUIDs offered by the DROID tool in response to being given a file to characterise, as they have been encountered by the Library. Expected Outcomes It is expected that a concise report would be completed that would compile a broadly comparative corpus of data which explores historical file format assertions, identify both points of interest or difference, and makes some recommendations on the future direction of format identification tools. 8 P a g e

9 Summary of Results 1) There is a clear pattern of some file type assertions being less than persistent. 2) Of the 61 tested PUIDs 75% performed identically for all tested versions of DROID and signature versions including files with no signature match and files with multiple PUID assertions. 3) Of the 61 tested PUIDs 40% consistently offered a single PUID across the range of DROID tests. 4) In 26% of the 61 tested file types multiple PUIDs are equally asserted by DROID at various times. 5) In 16% of the 61 tested file types DROID version 6 in FAST mode performs differently DROID version 6 in standard mode, both using the same signature file. This is a surprising and concerning outcome Summary of Conclusions 6) There is a clear requirement for a community owned dataset that spans the PRONOM catalogue 7) It is strongly recommended that more research is undertaken looking at the question of persistence PUID assertions to give the community a more complete history of file type assertions by PRONOM/DROID 8) It is noted that the DROID v6 FAST function demonstrates some behavioural difference for a specific set of file types, is not used on integrated systems with legacy data for these reasons 9) It is suggested that management of changes to the PRONOM records be explored with a view of limiting the impact of change on PRONOM users. Background The Purpose of File Format Assertions NDHA makes purposeful use of file format assertions as a primary tenet of the Rosetta Digital Preservation System. All files that are ingested into the system have a file format assigned to them predominantly via the automated process (DROID and system based auto correction rules 3 ) or through a manual process generally evoked by the system where neither DROID, nor automatic processing results in a single file type assertion being made. File type is used by NDHA/Rosetta to steer any digital object through the various technical processes that are used to validate, enrich, and deliver digital information. File type is also considered by NDHA to be one of the primary identifiers used to associate any digital object with its inherent risks. These risks include access to significant/technical properties, 3 In the context of the Rosetta, an auto correction rule is a system used to force some characterisation information about an object. Rules are typically used where automatic tools are unable to give a single positive result (e.g. DROID returning multiple PUIDs against a single file) or to fill knowledge gaps (e.g. where a format is not found in PRONOM/DROID, but can be accurately identified by its specific flow into the system). Rules generally allow a PUID to be asserted without a positive DROID ID. Rules are constructed using simple logic statements, such as if file extension = ABC, PUID = fmt/abc. The source of the file format IDs is predominantly PRONOM. 9 P a g e

10 preservation task planning, rendering, metadata extraction and any other Library business process that requires some interaction with a digital object. Given the importance of the file format assertion to the preservation process, it is clearly essential that these assertions can be made accurately and consistently. To ensure that file format assertions could be made accurately and consistently, the PRONOM file format registry was selected as the source for file format references early in the development of Rosetta. PRONOM supports the use of a single opaque descriptor as the method of describing a file format assertion. This descriptor is known as a Persistent Unique Identifier, or PUID 4, and it is this very persistence of identification, and the ramifications of any non-persistence of identifiers that this paper explores. Basic DROID Operation To understand the impact of the data collected during this paper, it s worth reflecting briefly on what DROID does, and why. DROID is file format classifier. Its purpose is to assess a file and offer (with as much accuracy as possible) a file format assertion, taken from the PRONOM catalogue. A single file format assertion is the ideal outcome, however on occasions DROID may only be able to offer a small number of file type assertions, or it may not be able to offer a file type assertion at all. DROID is not a file type validator. DROID has no concept of file validity, formedness or other such structural concepts. The mechanisms used by DROID to make file type assertions can be described as having four processes; (1) Signature Matching (2) File Extension Matching (3) Logical/Hierarchical based Matching (4) Container Matching Signature Matching In this process a signature is developed that includes any number of bytes that can be found in various locations inside the file these are typically at the Beginning Of File or BOF, at the End Of File or EOF, or variable throughout the file. These byte patterns often comprise of ASCII/UTF-8 strings that are consistently found in fixed or variable locations through the file, or other such binary elements that are consistently found in the target format. Various offsets can be used to tune the pattern matching process and allow data blocks to be masked as potentially containing a specific pattern. 4 UID.29_scheme 10 P a g e

11 DROID uses advanced searching algorithms (e.g. Boyer Moore Horpsool) 5 which allows files to be rapidly scanned for signature matches. There are currently some 579 distinct sets of patterns that are used inside the DROID signature file 6 to make signature based file format assertions. File Extension Matching In this process the file extension supplied with the source file is used to make a file format assertion by DROID. This is considered to be a lower weighted function than signature pattern matching (as described above). A signature pattern is generally associated with a file extension or a number of file extensions. When a signature match is made by DROID, a file extension match is also attempted, and any mismatch is noted by DROID (i.e. if the result of the signature match indicates that the given file should have a file extension of.jpeg or.jpg, and this is found to be true the fill passes the file extension match, if, however, a different extension is found, e.g. tif, then a file extension mismatch is reported.) There are currently some 1141 distinct sets of file extensions that are used inside the DROID signature file 7 to make file extension based file format assertions. Logical/Hierarchical based Matching This process is used by DROID to attempt to resolve some overlapping signature matches. Within the file format description included in the DROID signature file is a mechanism that can be used to rank a signature match as higher than others. In practice this process allows multiple matches to be filtered by priority - if there is a matching priority statement found in any of the file format records returned by DROID, the priority statements are assessed and any matching or relevant statements are used to refine the file format matches returned by DROID. There are currently some 113 distinct sets of priority statements that are used inside the DROID signature file 8 to make file extension based file format assertions. Container Matching A new process added to DROID v6 is the notion of containers. This process allows DROID to identify a container object, via some alternate mechanisms. This process was not explored in this paper, and the same container signature file was used through all tests. It is worth being aware of the container signature process for DROID v6 onwards as it fundamentally changes how DROID processes some specific format types. This newer process is very similar to how the existing file signature functions, but allows a higher level of analysis of an individual file by looking into its potential payload. It is expected that this new process will be reflected in the assertions recorded during this paper. DROID FAST Mode Another new process added to DROID v6 was the ability to select the portion of the file to be scanned. Native behaviour of DROID versions prior to v6 only allowed the scanning of the whole file Signature File v56 used. 7 Signature File v56 used. 8 Signature File v56 used. 11 P a g e

12 One of the requirements for v6 was to include a method that supports the more rapid scanning of larger files and thus the maximum byte scanning feature was introduced. In essence this feature allows a DROID user to define a block of data, as a small portion of a complete file, which DROID uses to search for patterns within. This block size is used to clip the beginning section and the end section of the file for pattern searching, and allows the middle section of the file to be ignored. The block size is variable and declared by the DROID user in an options menu. The block size is the same for the BOF and EOF sections. For these tests a block size of 64kB was used simply because it is the default value in DROID v6.01. Broadly speaking, it is expected that the use of the DROID Fast mode will not change the PUID assertion where the same signature file is used in the standard way (where the whole file is scanned for pattern matches, otherwise as slow mode for the remainder of this paper). Any inconsistency between FAST and SLOW modes of operation would be concerning, and unwelcome. Rosetta Development and File Format Tools For a number of years the NDHA has been using DROID as an embedded part of its production Digital Preservation System (DPS) Rosetta. Rosetta has gone through a number of development iterations, and with each of these iterations aspects of DROID have been upgraded, ranging from signature versions being updated, through to full implementation of new versions of the DROID tool. With every upgrade the production system has been used as a live product, meaning that actual live data is continuously being ingested. The table below describes the versioning history, how it relates to DROID versions/signatures, and the number of files ingested during each iteration. Rosetta Version Duration of version as used by the NDHA DROID Version Signature Version Files Ingested during time period V1 30/10/2008 to 26/03/2009 V3 V13 203,042 V1.1 27/03/2009 to 12/03/2010 V3 V13 106,420 V2 13/03/2010 to 20/11/2010 V3 V13 312,262 V2.1 21/11/2010 to 26/06/2011 V5 V37 1,631,366 V /06/2011 to 05/11/2011 V5 V45 1,259,311 V2.2 06/11/2011 to 01/02/2012 V5 V49 2,314,572 Total 30/10/2008 to 01/02/ ,826,973 As well as using DROID to offer a PUID assertion for each file being ingested, Rosetta uses a rules based method to further refine DROID outputs where required. In Rosetta, the given file extension is stripped from the file prior to the file being presented to DROID, forcing DROID to operate exclusively in a signature only mode (i.e. the implementation of DROID has no knowledge of the original file extension, meaning only signature based assertions can be made by DROID). The file extension matching function of DROID is replicated internally by the 12 P a g e

13 Rosetta inbuilt format library function. This process allows for internally derived rules to be used to help narrow any ambiguous or tentative DROID results. The Rosetta system demands that only a single PUID is asserted, and that PUID assertion must be a valid PUID as found in the internal format library. This library is predominantly comprised of the PRONOM registry, with some additional formats that are not currently found in PRONOM. Example 1: File Formats Not Known To PRONOM NDHA ingests a large number of.cdx files as a part of web harvesting activities. CDX files are not currently known to PRONOM, so DROID is unable to offer a correct PUID assertion for these files. Using the auto correction rules, CDX files that come from the web harvest process are assigned a PUID of ExL-fmt/22, based on their file extension, and the ingest source. The file format record is held in the internal format library structure. Example 2: Files with Multiple PUID Assertions Until very recently all.tif files characterised by DROID received (tentative) PUID assertions of fmt/7, fmt/8, fmt/9 and fmt/10. These PUIDs cover tif version 3 through to tif version 6. DROID is not able to offer an accurate single tif version as there is currently no signature that can differentiate between versions. To allow Rosetta to refine this multiple PUID assertion, an auto correction rule is used to assert a single PUID against the files 9. Scope of Research The tests included in this paper are limited to a basic exploration of the file format PUIDS offered by various versions of the DROID application in response to a closed set of files being repeatedly tested. The coverage of the tests files is described in the section entitled Describing the Test Set. The various signature files, DROID versions and DROID implementations are described in the section Test Variables Selection. It is not the purpose of this paper to comment on the accuracy of any of the individual DROID/PRONOM format descriptions. It may be possible to make some specific recommendations by inspecting the resulting data; however, this is not the primary purpose of the tests, nor is it within the scope of this paper. DROID Test Variables Selection The primary areas of interest were identified through day to day business as usual operations of the Digital Preservation unit NDHA. These areas were reduced to a number of core variations, 9 It is worth noting that the PRONOM registry was amended in signature version 51, to include fmt/353, an encompassing tif PUID that allows generic description of tif files, without making reference to a specific tif version. NDHA will be reflecting this change for all its ingested tif files, and amending the asserted PUID from fmt/7 to fmt/353. The impact of this change for NLNZ is not trivial, and will require a full auditable change in the records held for many hundreds of thousands of digital objects. See Discussion: Managing Change and Supporting Legacy Assertions 13 P a g e

14 mainly to reduce variable space to something more manageable, and the following justifications are described below. DROID Version DROID versions v3, v5, and v6 were identified as being critical to the NHDA. Specifically v3 was used in the Rosetta DPS, Rosetta was upgraded to v5, and will be upgraded to v6 in due course. For the purpose of this paper DROID v6 FAST, and DROID v6 SLOW are considered to be different versions of DROID. This distinction is made simply to reduce the complexity of the variables being assessed. This means that there are ostensibly four versions of DROID under test. DROID Signature File The Rosetta DPS has been deployed with signature versions v13, v37, v45 and v49. Therefore NDHA contains digital objects that have been run through DROID with any of these signatures. Signature file v50 was also added to the pool as it was the latest released version at the time the testing was planned. DROID Fast/Slow Mode This variable describes the use of the DROID v6 feature called 'Max Byte Scan'. The default value is 64Kb. This means that only the first and last 64Kb of any file scanned for signature matches. This variable can be set to -1, where the whole file will be scanned (as per the usual operation with DROID v3 and DROID v5). There is a current proposal to include the DROID v6 Max Byte Scan feature, set to 64Kb as the default setting for the Rosetta DPS. NDHA wanted to investigate the broad impact of this proposal, explore this function and thereby assess and understand its impact. It is worth noting that this is not a binary variable. Any block size can be stipulated and the NDHA has not explored the impact of any other block size. Describing the Test Set To understand the results of these tests it is important to understand the notion of ground-truthed data and the specific implications this has for format identification and file characterisation. The term ground-truthed data refers to the use of a dataset that is known to the user. The quality of being 'known' in this case specifically means that each digital object has a 100% accurate PUID assertion prior to any experimental testing. This is to support the analysis of any experimental data (e.g. file A.jpg is known to be a JPEG version 1.02 file, known to in PRONOM as fmt/44. This particular file could be known to be of type fmt/44 as; it was created by a suitable image creation tool (and that the tool conforms to the corresponding JPEG specification), or it has been verified by a community accepted process or tool to be of that given type. Any format type (PUID) assertions resulting from tests that do not agree with this statement should be regarded as a 'fail' and of interest. Any PUID assertions that agree with this statement should be regarded as a 'pass'. In ideal conditions any experimental results should match the ground-truthed results 100% - in which case the specific experimental setup producing those results behaves consistently with the expected PUID assertions. 14 P a g e

15 There is currently no community accepted 'gold standard' dataset that covers a broad range of format types, with 100% accuracy, especially at the types of volumes required for statistically sound experimentation. If such a set was available it would have been ideal as the file set used in this research. All the files used in these tests are taken from the NDHA Digital Repository. This means that all files have previously been through the predominantly automated object characterisation process which includes being run through DROID (and JHove). The automated processing, a feature of Rosetta, also includes some 'auto-correction' rules to mitigate standard exceptions, such as multiple PUID assertions given by DROID (e.g. where DROID indicates fmt/7, fmt/8 fmt/9 and fmt/10 as equal options for tif files the auto correction rule would conflate these options to a single assertion of fmt/7) and finally some degree of manual format assertion, where a human operator can override or influence specific PUID assertions (e.g. a file may not get a DROID based PUID assertion, at which time the human operator can investigate the file and make an appropriate PUID assertion if required). Another consideration requiring some discussion is the use of different signature files during business as usual (BAU) processing. As these files were culled from a live system without an accurate history of their processing data (e.g. signature version used, auto rules used etc) the only definitive statement that can be made being at some point in time, the NDHA has either accepted an automatic PUID assertion, or given a manual PUID of the type associated with the object. It is not correct or accurate to claim the file set that was used is a 'gold standard' ground-truthed reference. In fact at best it could be described as being a baseline set or in simpler terms 'a starting point used for comparison'. This means that there is no high degree of confidence in the source PUID, simply that it is a reference point with which to compare all other results. It may be incorrect. A file maybe is listed as being of jpeg v1.01 (PUID fmt/43), but the file would be more accurately listed as being in the jpeg v1.02 (fmt/44) set. There are examples of incorrect source PUID assertions in the test set, but it is a referential 'point in time' for the objects under test, and should be viewed as such. This lack of ground truth clearly impacts on the consistency with which PUID assertions can be measured and compared, and therefore affect the strength of any comments that can be made about the individual PRONOM file format records. It would be very beneficial to the whole community of DROID users to develop a shared catalogue of test/exemplar files for each PUID listed in PRONOM. This ground-truthed data set would be the gold standard all other tests are compared with allowing some definitive and clear patterns and behaviours to be explored. Selection Criteria The National Library of New Zealand has a broad collection policy ranging from legal deposits to agreed and negotiated donations. With this broad range of items comes a range of access restrictions. To ensure that the items are accessed appropriately (by staff and Library users alike) the Library uses a level system to describe the access conditions. Access Level 400 is the most restricted (visible only with the explicit permission of the curator and on the Library s internal technology domain), and Access Level 100 is the least restricted (open access with no restrictions and accessible from anywhere, including the internet). 15 P a g e

16 With this in mind it was important to ensure that any objects selected for this research were of a suitable Access Level, both to ensure that the Library s access policy was followed, and to support the possibility of sharing data/files in the future. Notwithstanding the access criteria a substantial number of files were required to provide some robustness to the results, and to cover as many file formats as possible. An upper limit of 500 files per PUID was selected to allow the collection of a manageable number of files. The final selection criteria were twofold; i) All objects have a restriction type of 100. ii) All PUIDs that appear in the repository (and are of the access type described above) must be represented, at a maximum of 500 individual files (and a minimum of 1 file where there is a limit to the number of files available) These criteria resulted in a set of 13,326 files being exported from Rosetta, which spanned 61 different PUIDs. Of those collected files it was only possible to gather 500 items for ~24 PUIDs 10, with some PUIDs only being represented by a few files. There was no specific attention given to the diversity of objects inside each PUID set (e.g. different creation applications, creation time periods, creation operating systems etc). The assumption here was that all objects of the stated PUID are 'equal'. This is not likely to be an accurate statement, however as PUID is the lowest granularity of format description, and that these objects reside in the NDHA with these PUID assertions. The statement that all objects of the stated PUID are equal, should be accepted as an axiom. The smallest granular identification that can be made with DROID/PROOM is at the format level, and so it follows that all files of an identified format must share the same characteristics and behavioural responses, else the PRONOM label provides no purposeful granular separation of bitstreams. As the DROID file characterisation process contains an element of file extension matching for some format types, the base set was extended to include a duplication of all the selected files. Their file extensions removed and a duplicate set of 13,326 files was created that were exact copies of the original set, other than having their file extensions removed. This second set was created to explore the difference between signature match based assertions and file extension match based assertions. This resulted in a set of 23,352 files covering 61 PUID types and spanning nearly 570Gb of storage. 10 This is an approximation because the data set was not significantly ground truthed prior to testing. The reference or starting point PUIDs were taken from the live Rosetta system data held for each item, and so it only as accurate as the PUID assertion made against the file using the version of DROID / DROID signature file deployed at the time. 16 P a g e

17 Source Set List There are 5 PUIDs on the baseline list that do not have PRONOM PUIDs when they where first ingested: ExL-fmt/22 - CDX file format - part of the Web Curator Tool file structure ExL-fmt/ the same as fmt/ Exl-fmt/41 - epub (OPS) 2.0 files Exl-fmt/61 - the same as fmt/199 Exl-fmt/62 - the same as fmt/189 These files have non-pronom PUID names as they were classified by the NDHA prior to appearing in PRONOM/DROID signature file that was deployed in Rosetta, or there was no equivalent PRONOM PUID at the time of testing. PUID Format Name Extensions Qty ExL-fmt/22 CDX CDX 1000 ExL-fmt/41 epub (OPS) 2.0 epub 30 ExL-fmt/61 MPEG-4 Media File mp4 290 ExL-fmt/62 Microsoft Office Open XML 2007 docx, xlsx, pptx 2 ExL-fmt/ FLAC (Free Lossless Audio Codec) flac 1000 fmt/3 Graphics Interchange Format 1987a gif 12 fmt/4 Graphics Interchange Format 1989a gif 38 fmt/5 Audio/Video Interleaved Format avi 130 fmt/6 Waveform Audio wav 1000 fmt/7 Tagged Image File Format tif, tiff 1000 fmt/11 Portable Network Graphics 1.0 png 34 fmt/12 Portable Network Graphics 1.1 png 56 fmt/14 Acrobat PDF Portable Document Format pdf 32 fmt/15 Acrobat PDF Portable Document Format pdf 86 fmt/16 Acrobat PDF Portable Document Format pdf 986 fmt/17 Acrobat PDF Portable Document Format pdf 996 fmt/18 Acrobat PDF Portable Document Format pdf 1000 fmt/19 Acrobat PDF Portable Document Format pdf 1000 fmt/20 Acrobat PDF Portable Document Format pdf 1000 fmt/39 Microsoft Word for Windows Document 6.0/95 doc 22 fmt/40 Microsoft Word for Windows Document doc 1000 fmt/41 Raw JPEG Stream jpeg, jpg 1000 fmt/42 JPEG File Interchange Format 1.00 jpeg, jpg 220 fmt/43 JPEG File Interchange Format 1.01 jpeg, jpg 1000 fmt/44 JPEG File Interchange Format 1.02 jpeg, jpg 1000 fmt/45 Rich Text Format 1.0 rtf 2 fmt/49 Rich Text Format 1.4 rtf 88 fmt/50 Rich Text Format 1.5 rtf This format label is correct, as per the Rosetta internal format library. 17 P a g e

18 fmt/52 Rich Text Format 1.7 rtf 186 fmt/61 Microsoft Excel 97 Workbook xls 2 fmt/62 Microsoft Excel Workbook xls 1000 fmt/95 Acrobat PDF/A - Portable Document Format pdf 1000 fmt/96 Hypertext Markup Language html, htm 6 fmt/99 Hypertext Markup Language 4.0 html, htm 50 fmt/100 Hypertext Markup Language 4.01 html, htm 8 fmt/101 Extensible Markup Language 1.0 xml 1000 fmt/111 OLE2 Compound Document Format doc, xls, ppt 1000 fmt/116 Windows Bitmap 3.0 bmp 310 fmt/117 Windows Bitmap 3.0 NT bmp 2 Fmt/126 Microsoft Powerpoint Presentation ppt 64 fmt/132 Windows Media Audio wma 98 fmt/133 Windows Media Video wmv 2 fmt/134 MPEG 1/2 Audio Layer 3 mp fmt/149 JTIP (JPEG Tiled Image Pyramid) jpg 6 fmt/276 Acrobat PDF Portable Document Format pdf 142 x-fmt/16 Unicode Text File txt, log 1000 x-fmt/62 Log File log 1000 x-fmt/92 Adobe Photoshop psd 126 x-fmt/111 Plain Text File txt, log 2 x-fmt/135 Audio Interchange File Format aiff 2 x-fmt/219 Alexa Archive File arc 1000 x-fmt/263 ZIP Format zip 26 x-fmt/279 MPEG 1/2 Audio Layer 3 Streaming mp3 20 x-fmt/385 MPEG-1 Video Format mpg, mpeg 4 x-fmt/387 Exchangeable Image File Format (Uncompressed) tif 1000 x-fmt/390 Exchangeable Image File Format (Compressed) 2.1 jpg 1000 x-fmt/391 Exchangeable Image File Format (Compressed) 2.2 jpg 1000 x-fmt/394 WordPerfect for MS-DOS/Windows Document 5.1 wp, wpd, wp5 148 x-fmt/398 Exchangeable Image File Format (Compressed) 2.0 jpg 48 x-fmt/409 MS-DOS Executable exe 2 x-fmt/411 Windows Portable Executable exe 2 18 P a g e

19 Discussion Being Open about File Format Identification Capabilities In very broad terms, most collecting institutions undertaking even basic digital preservation use file type identification as a fundamental process, often driving decisions that relate to access, storage, searching as well as core digital preservation functions such as migration, emulation or other specialist activities. With this in mind it is essential that as a community, DP practitioners firstly understand the basic principle of file type identification, and secondly, are collaboratively invested in the development of tools and processes that support this activity. The PRONOM registry is the most commonly used reference for the DP community, and its existence is paramount to the success of a large number of current DP activities. The NHDA benefits significantly from the expertise and knowledge that goes into the development of PRONOM and related tools. It would be impractical for the NDHA (or any such collecting institution) to repeat the work completed to date on supporting file type identification, and in global terms the heritage sector stands to make significant efficiency savings by leveraging open access works (such as PRONOM). Perhaps it is time to explore the question of ownership/stewardship of these sources, and maybe even to look towards building more complex file identification tools that allow both a common community agreed centre, and locally derived extensions that can also be shared amongst interested users. It is likely that format identification will never be an exact science - and as we all strive to make accurate decisions about the long term care of our digital objects what message should we be sending with them into the future? Is the answer to wait until the tools of tomorrow can answer the question of accurate file type identification? Or should we strive to cooperate more effectively to help support the activities of today in preparation for the preservation actions of tomorrow? Test Sets - Crowd Sourcing Solutions? It was very apparent that this research would have benefitted from having access to a large sized controlled set of ground truthed digital objects. This is not the first time the NDHA has encountered situations where a set of accurate examples for any of the 'known' file types would have significantly aided research and other DP related activities. It is to the absolute determent of the global DP community that to date there is no established repository that allows DP practitioners to contribute to a pool of accurately assessed digital objects, to share files, and file information in a controlled way, or otherwise collaboratively engage in the process of filling in the collective knowledge gaps. Accurate file identification is likely to be one of the single most important functions DP practitioners undertake, as its forms the nucleus for most, if not all the following preservation actions. As our 19 P a g e

20 individual awareness and knowledge grows it is proposed that we should work more collectively to tackle this problem. Managing Change and Supporting Legacy Assertions This paper is one of the first studies undertaken to explore the issue of longevity and persistence in the PRONOM signature domain. The research has explored the difficulties of ongoing changes to PRONOM/DROID identifiers and the Max Byte Scanning function of DROID v6. It also calls the digital preservation community s attention to the work that remains to be completed to ensure that we can deal effectively with format type as a persistent identifier. The Library is in the relatively novel and fortunate position of having a custom-built end-to-end preservation-centric repository. In the four years since Rosetta was initially developed the NDHA has learnt a great deal about the automated management of digital objects, and what precisely happens when file type assertions change over time. One of the most significant challenges the NDHA must address is how PRONOM/DROID identifier changes affect the hundreds-of-thousands of digital objects in our archives. Should the file type assertions be updated on all previously ingested objects? Should the new PUID only receive support from the time it is deployed into the system via a signature update? Perhaps the new PUID should be ignored in favour of an old PUID considering that one of the purposes of file type assignation is to support the meticulous cataloguing of the contents of our repository. There are a number of reasons why a PUID may change such as correcting an inaccuracy, refining an existing signature, adding a signature pattern to an existing extension-based record, or simply adding a new format to the registry. While these remain valid grounds for changing the PRONOM record they can also have a considerable impact on wider production systems if they are not carefully planned and managed. Perhaps there should be a robust, visible threshold for a change request to existing signatures as well as a peer-reviewed quality test? Or is it is enough that the PRONOM owner receives or creates an update, and seeds it into the registry for release with the next update? While the impact of these changes remains a central challenge for the NDHA and the Rosetta vendor, Ex-Libris, we must remain wary of the significant obstacles raised by changes to PRONOM signatures modifications that are often abrupt, contain minimal detail, and a limited historical audit or visible discussion of the justifications for these changes. Secondly, this research spotlights the Max Byte Scanning function in DROID v6. The Library feels significant attention should be given to this paper s results, especially since there is a clear, demonstrable difference in behaviour when DROID v6 is utilised. The Library do not currently support the use of this feature as it stands, given the substantial percentage of the tested format types that displayed some change in their PUID assertions. It is reasonable to assume that this mode is transparent to the operation of DROID, meaning that any assertions made by DROID are consistent irrespective of the use of this feature. Some signature types are not suited to the fast mode of operation and the Library feel that this concern should be explored and resolved before deploying this feature in a BAU context. 20 P a g e

21 Managing the PRONOM record is a complex, time consuming and costly venture. Fortunately for the digital preservation community the TNA have willingly stepped into the void and continue to maintain this record for our collective benefit. Nevertheless, it is time to reflect on the present model, our long-term strategy for managing change in the PRONOM records and to consider whether we, as a community, have got it right. Does the burden of ownership rest in the best place? Are the controls around that ownership correctly resourced? Have we invested enough intellectually to ensure that format identification is managed collaboratively as we move forward? The Library invites all members of the digital preservation community to reflect upon these questions. 21 P a g e

22 Explaining the Results In an ideal world there would be no difference between any of the test setups - and all files would receive the same PUID assertion. This would mean that regardless of the choice of DROID version, DROID implementation or DROID signature file that any file would be consistently identified as being of a single PUID. Whilst desirable this scenario is not likely to be encountered for every format type that is being investigated. Not least because there are known changes/corrections and additions to the signature files that will affect PUID assertions, regardless of individual DROID version performance. It is worth noting that for these results the fact that DROID offered no PUID as a response (due to the file not matching any signature/extension) is as useful an outcome as any PUID assertion. As this paper is exploring the response of DROID over time it is the consistency of the response that is of most interest, not the specific response itself. The perfect result would be: APUID = BPUID for all DROID Conditions with no extension based variation. As these tests were undertaken using real world data collected from a live system, and given the prior knowledge of documented 12 changes to the DROID signatures covering file format types defined by this paper as baselined PUIDs, it is anticipated that a number of different classes of errors and inconstancies will be identified and grouped in a useful way P a g e

23 File Format Error Types Type A - Files get inconsistent PUID assertions between versions of DROID APUID BPUID for all DROID Versions APUID = BPUID for all DROID Signature Versions APUID = BPUID for all 'FAST' / 'SLOW' Type B - Files get inconsistent PUID assertions between signature versions APUID = BPUID for all DROID Versions APUID BPUID for all DROID Signature Versions APUID = BPUID for all 'FAST' / 'SLOW' Type C - PUID assertions are inconsistent between FAST and SLOW settings APUID = BPUID for all DROID Versions APUID = BPUID for all DROID Signature Versions APUID BPUID for all 'FAST' / 'SLOW' Type D - No signature ID, or Extension ID is inconsistent with signature ID Of given set BPUID, at least two distinct sets of APUID can be identified, where A1PUID and A2PUID differ only by the presence (or not) of a file extension Type E - PUID assertions indicate a significant format subset/ format variation Of given set BPUID, at least two distinct sets of APUID can be identified Type F - PUID assertions are consistently inconsistent. At least 2 competing PUIDs are given to one file. Of given set BPUID, at least two overlapping sets of APUID can be identified None - no significant inconsistencies observed. APUID = BPUID 23 P a g e

24 Method Hardware Used All tests were completed on an HP Pavilion dv7 laptop- (i7 processor with 8Gb RAM) and the test files stored on an external 2TB Seagate (USB2) Hard Disk Drive. Software Used All tests were undertaken on the above PC, running a 64bit install of Windows 7. DROID v3.0, DROID v5.0.3, DROID v6.01, Python 2.7 WAMPSERVER 2.2 (MySQL v5.5.20, PHP v5.3.9, Apache v2.2.21, PHPMyADMIN v3.4.9) Step One - Running The Data Through DROID A matrix was drawn up to ensure that all permutations of DROID version, signature Version, and 'FAST'/'SLOW' mode were covered. DROID VERSION Signature Version v3 v5 v6 FAST v6 SLOW v13 v37 v45 v49 v50 On each occasion the appropriate version of DROID was run set to the signature version under test, and where applicable the max byte scanning value set. Once setup a new DROID profile was started and the test set of files added for characterisation. Upon completion of the characterisation process the resulting log file was exported as a comma separated value (CSV) file. As each of the three versions of DROID result in different CSV structures some python code was written to homogenise all the outputs into a single structure type so they could be more easily compared Step Two - Ingest Results into Database The reshaped CSV files were ingested into a MySQL database, and some further minor cleaning tasks were undertaken to ensure that only pertinent and accurate data was represented. The table structure selected was: 24 P a g e

25 Apart from columns 1, 2 and 3 the structure is taken directly from the DROID v6 export CSV. Columns 1, 2 and 3 were added to allow test setups to be tracked against their respective PUID assertions A second MySQL table was constructed that included basic details of the source test set: Where SourceFileName is the original file name as supplied to DROID, SourceExt is the file extension of the file, hasextenstion is a binary flag used to record the presence of file extension and SourcePUID is the baseline PUID statement that was previously given to the file in question. 25 P a g e

26 Step Three - Construct MySQL Queries to Extract Meaningful Data The ingested logs resulted in nearly 800,000 individual rows being added to the main table. To suitably understand this data it was essential to construct some basic MySQL queries that would allow some detailed analysis. Basic Query There is a single 'question' that can be asked of resulting data: "What PUIDs were given to any set of files, defined by having a common baseline PUID?" This question was 'translated' into MySQL, and used to collect the results for each set: SELECT `PUID`,`DROID_V`,`SIG_V`,`SPEED`, COUNT(distinct IF(sourcelist.hasExtension=1,NAME,NULL)) as Ext, COUNT(distinct IF(sourcelist.hasExtension=0,NAME,NULL)) as NoExt, COUNT(distinct NAME) as `All` FROM sourcelist, main_small WHERE sourcelist.sourcepuid = PUID_of_interest AND main_small.name = sourcelist.sourcefilename GROUP BY `PUID`,`DROID_V`,`SIG_V`,`SPEED` ORDER BY `DROID_V` ASC, `SIG_V`; ASC, `SPEED` The only variation being: PUID_of_interest which was changed to explore the specific source PUID of interest (for example, for HTML documents the whole query would be: SELECT `PUID`,`DROID_V`,`SIG_V`,`SPEED`, COUNT(distinct IF(sourcelist.hasExtension=1,NAME,NULL)) as Ext, COUNT(distinct IF(sourcelist.hasExtension=0,NAME,NULL)) as NoExt, COUNT(distinct NAME) as `All` FROM sourcelist, main_small WHERE sourcelist.sourcepuid = fmt/96 AND main_small.name = sourcelist.sourcefilename GROUP BY `PUID`,`DROID_V`,`SIG_V`,`SPEED` ORDER BY `DROID_V` ASC, `SIG_V`; By using this query and a number of more narrow expressions it was possible to track the performance of an individual file through all the tests. 26 P a g e

27 This is an invaluable method of exploring what is a very rich dataset. It is recommend that the following printed tables and images are used for guidance only and any further analysis be completed via MySQL queries directly from the database. This can be made available upon request. Example Results Table The files used to generate these results were six files that purport to be HTML documents (fmt/96). Three of the files had the appropriate file extension (in this case.html or.htm), and three files had no file extension (but where otherwise complete copies of the previous three files.) The above table should be interpreted as described: DROID_V describes the version of DROID used, SIG_V describes the signature version used and SPEED is only relevant for the v6 FAST vs SLOW comparisons. The Ext column is a count of the number of files that match the given row PUID (and described DROID conditions) and where the files used had the original file extension. The NoExt column is as Ext, except it only includes files where there was no file extension offered. Tally is a sum of the Ext and NoExt fields. 27 P a g e

28 Step four Analysis and Visualisation of Data The table data as exported from the MySQL query was collected and is presented as a primary supporting paper to this document (available on request) Having demonstrated the accuracy of the MySQL the final stage was to write a method for visualising the resulting data so it can be understood. Python was used to create some representations of the dataset, based on the output from the SQL. This tool was used to generate the following source PUID summary visualisations. Example Error Types and Visualisations Reading the Visualisations There is essentially three informational parts to the visualisations used to display the results: A describes the source file set used to generate a set of results B describes the results encountered for a specific set of files (described in A) C describes the colour coding used to display the relationship between the results found, and the combination of DROID versions and DROID signatures that caused each result set 28 P a g e

29 These separate parts contain some specific and useful pieces of information which are explained in detail below. As the resulting data was vast and of a complex nature, the following visualisation approach was found to be the simplest method for concisely displaying the complex relationships discovered as a result of this research. It is worth spending some time getting familiar with the following data representations before exploring the results. The centre circle describes the source file set used to generate a set of results Source is a label to note that it represents the source for the test A fmt/a is the PUID associated with the source set of files (as noted previously, this is not a ground truth, but a live system record associated with a file) (All) indicates which files of the source type were used. The options used for the entire series of test were: (All), (No Extension), (Extension) or (DROID v6 only) where: All indicates that all the files of the associated PUID were used for the test No Extension indicates that only files of the associated PUID without file extension label were used for the test Extension indicates that only the files of the associated PUID with a file extension were used for the test DROID v6 indicates that only DROID v6 (fast and slow) were used for the test Hits: n is the total number of times and tested version of DROID made a PUID assertion against a file in the source set this gives an indication of the numbers of comparisons made the higher the number, the more individual comparisons made (and thus the higher confidence in the result) Files: n is the total number of files that comprised the source set for the associated PUID. This value is useful to give an indication to the relative confidence there is in the resulting data. If this is a low number (e.g. 2 files) the results are not very definitive, and the results may not be very indicative of the source format type.. If this is a high number (e.g. 2,000 files) the results can be regarded as relatively indicative for the source format type. 29 P a g e

30 The outer ring of circles describes and represents the results generated by a source set of files. For every test the number of outer circles matches the number of individual PUID labels return across all the tests undertaken in each set. If there is only one outer circle, DROID only returned one PUID label for all files and all combinations of DROID versions / DROID signature files. If there are fifteen outer circles, DROID returned fifteen different PUID labels for all files comprising the test set, across the different combinations of DROID versions / DROID signature files. These circles always hold three pieces of data: B fmt/x identifies a specific PUID label that DROID returned for files in the source set n% the percentage figure indicates the number of the total hits that make up this resulting subset of PUID assertions. If this figure is 100%, it indicates that all the matches made in the duration of the test relate to the single result PUID. If this figure is 50%, it indicates that half the matches made in the duration of the test relates to the associated result PUID. To place this into some context, if the total number of hits was 22,000 then for a 50% result set, 11,000 hits were for the associated PUID. In the visible part of the example on the left, 20% of the hits were for fmt/e and 20% of the hits were for fmt/d Files: n describes the number of distinct files that triggered the result set. This number can be either the same as the source Files value or less than. It cannot be more than the source Files value If this figure is less than the source Files value it indicates that some files in the source set are being appraised differently by DROID. The specific difference is captured in the colour of the circle, and described by the key. The size if the circle also conveys some information. The larger the result circle, the larger percentage of hits it represents. A small sized circle indicates that a small number of hits are being represented by the result circle. Using the colour, the number of, and the size of the results circles, it is possible to very quickly understand the broad types of relationships and information encountered while exploring a specific source PUID set. Finally the key object helps to explain in quite some detail the number of DROID versions used to generate a result and the number of DROID signature files used to generate a result. 30 P a g e

31 The key is essentially a look- up table, with two axis Number of DROIDs and Number of Signature files. To allow the graphic to be quickly read, these relationships have been colour coded, and the results set (outer circles) have been coloured accordingly. The highlighted sections (the areas with the bold border) assist the reader in seeing at a glance what colours are used by all the results circles in the result set. There is a very basic relationship between the colour used, and the information it portrays The red horizontal line (starting top left, and moving progressively towards pink when read left to right) indicates that only one version of DROID generated the associated result set. The darkest red indicates that all the signature files tested gave the same resulting PUID; the lightest pink indicates that only one signature file gave the resulting PUID. Similarly, the green horizontal line (starting bottom left, and moving progressively towards light green when read left to right) indicates that all four tested versions of DROID agreed with the associated resulting PUID. The darkest green indicates that all the signature files tested gave the same resulting PUID; the lightest green indicates that only one signature file gave the resulting PUID. This relationship can be checked by reading the two axis labels to establish the number being indicated, however the basic rule of thumb is that the colour hue itself relates to the number of DROID versions and the colour saturation/intensity relates to the number of DROID signature files. C In this example there are three separate result sets being indicated by the key. The would be accompanied by three circles of corresponding colour The first set (darkest green) indicates that the result circle of the matching colour was comprised of results from all versions of DROID and all DROID signatures The second set (middle green) indicates that the result circle of the matching colour was comprised of results from all versions of DROID and three of the DROID signatures The third set (lightest green) indicates that the result circle of matching colour was comprised of matches from all versions of DROID and one DROID signature These three informational components can be drawn together to allow the fast comprehension of the basic trends and patterns found when undertaking these tests. 31 P a g e

32 From the example above it is possible to make a few definitive statements about the results behind the graphic: 1) There were three different PUIDS identified as being the file type of the source files 2) One PUID (PUID/B) was indicated for 200 files for all versions of DROID and one of the five tested signature versions. 3) One PUID (PUID/C) was indicated for 300 files for all versions of DROID and for three of the five tested signature versions. 4) One PUID (PUID/D) was indicated for all 500 test files in all versions of DROID and with all signature versions. 5) There is a Type B (signature variation) error associated with these files 6) There is a Type F (Multi PUID) error associated with at least 200 of these files 7) There could be a Type E (subset) error associated with at least 200 of these files 8) Of all the hits 50% of all the DROID matches where for PUID/D 9) The test results cover all files of the set PUID/A, including variants with and without file extensions 32 P a g e

33 Example Error Types None - No inconsistencies This is the best case result. For all versions of droid and all signature files all the files of the given source format returned the same PUID. This indicates that (1) all the files are characterised as the same PUID, (2) all versions of DROID treat the files in the same way (including the FAST mode in DROID v6 and (3) the file signature is consistent for all signature versions. This type is identifiable by there being only one result set, as indicated in the key. Pattern for None - No inconsistencies To quickly detect a set with no errors the pattern to look for is: A single dark green result circle o Of equal size to the Source circle o With a 100% hit rate value o With matching File counts in the source circle, and the result circle A single highlighted square in the Key o Matching 4 DROID versions and 5 signature files Check that you can find these features in the above example. 33 P a g e

34 Type A - versions of DROID Files get inconsistent PUID assertions between versions of DROID In this example it is apparent that different versions of DROID return different PUIDS irrespective of the signature version. This indicates the different versions of DROID are handling the files differently. This error type is identifiable by the result sets found in the 'No. Droids' columns in positions other than '4' as indicated in the key. Pattern for Type A - versions of DROID errors To quickly detect a set with type A errors (inconsistent PUID assertions between versions of DROID) the pattern to look for is: At least 2 non green results circle o With matching File counts in the source circle, and the result circles At least 2 highlighted squares in the Key o Matching 1, 2 or 3 DROID versions and 5 signature files Check that you can find these features in the above example. 34 P a g e

35 Type B - assertions between signatures Files get inconsistent PUID assertions between signature versions: In this example different signature versions return different PUIDs irrespective of the version of DROID used. This indicates that the signature used for a particular file type has changed over time. This error type is identifiable by the result sets found in the 'No. Sigs' rows in positions other than '5' as indicated in the key. Pattern for Type B - assertions between signatures errors To quickly detect a set with Type B errors (inconsistent PUID assertions between signatures) the pattern to look for is: More than one green results circle o Without matching File counts in the source circle, and the result circles At least 2 highlighted squares in the Key o Matching 4 DROID versions and 1,2,3,4 or 5 signature files Check that you can find these features in the above example. 35 P a g e

36 Type C - inconsistent FAST and SLOW PUID assertions are inconsistent between FAST and SLOW settings: In this example, the DROID v6 'FAST' mode specifically caused some results inconsistent with the rest of the results for that set. A specific graphic is used in this case to clearly demonstrate that the two versions of DROID (DROID v6 'SLOW' and DROID v6 'FAST') offered different results. If there was no difference, the graphic would only contain green sets. The graphic does not show which version of DROID created which result. Pattern for Type C - inconsistent FAST and SLOW errors To quickly detect a set with Type C errors (PUID assertions are inconsistent between FAST and SLOW settings) the pattern to look for is: More than one results circle, at least one being red. o Without matching File counts in the source circle, and the result circles o The Key only contains 2 rows, Red and Green More than one highlighted square in the Key o With at least one matching 1 DROID version These errors are always indicated with a second graphic (displaying the reduced Key table) Check that you can find these features in the above example. 36 P a g e

37 Type D - extension ID is inconsistent with signature ID No signature ID or extension ID is inconsistent with signature ID: In this example, the files respond differently if they have no file extension. Two additional graphics are used in this case to clearly demonstrate how the two sets of files (with files extensions and without file extensions) offer different results. The different subsets (with extension and without extension) are indicated as highlighted. Pattern for Type D - extension ID is inconsistent with signature ID errors This error type always has three graphics, notable by the combination of (All), (Ext) and (No Ext) labels. It is the only error type to have these three graphics. Check that you can find these features in the above example. 37 P a g e

38 Pattern for Type E - subset/format variation errors PUID assertions indication a significant format subset/format variation: In this example it s apparent that some of the files are treated by DROID/Signature versions differently to the main set. This response is consistent across all variations, indicating that there are actually two distinct sets of file types in this set. This type is identifiable by looking at the number of files found in any results set. In the example above it is apparent that there is a set of 112 files that behaved differently to a second set of 388 files. Pattern for Type E - subset/format variation errors To quickly detect a set with Type E errors (PUID assertions indication a significant format subset/format variation) the pattern to look for is: More than one results circle, all being dark green. o Without matching File counts in the source circle, and each of the result circles Only 1 highlighted squares in the Key o Matching 4 DROID versions and 5 signature files. Check that you can find these features in the above example. 38 P a g e

39 Type F Multi PUID PUID assertions are consistently inconsistent at least 2 PUIDs are given to one file In this example it is apparent that there is no single PUID offered, and the all the files in the set are indicated as equally belonging to a number of possible PUIDs. This type is identifiable by looking at the number of files found in any results set. In the example above it is apparent that all sets of results have the same number of files. This figure may match the number of source files (as in this example) or it may be comprised of a discrete subset of files. Pattern for Type F Multi PUID errors To quickly detect a set with Type F errors (PUID assertions are consistently offer multiple PUIDs for a single file) the pattern to look for is: More than one results circle, all of the same colour. o With matching File counts in the source circle, and each of the result circles o Each of the associated results circles have matching File counts N.B. This error type can be found as a subset result. This means the indicators are slightly broader than for other error types. Check that you can find these features in the above example. 39 P a g e

40 Results To start to understand the result, each set of source PUID files will be evaluated for their resulting asserted PUIDS and grouped by the DROID conditions the file was passed through. This raw data can be found in the supporting paper: Main Results of the NLNZ DROID version tests a summary table, which endeavours to classify the source PUID sets into error sets as per the above follows: PUID Extensions Files Error Types: DROID Sig FAST Extension Subset Multi PUID NONE (A) (B) (C) (D) (E) (F) ExL-fmt/22 CDX 1000 ExLfmt/24417 epub 1000 ExL-fmt/41 mp4 30 ExL-fmt/61 docx, xlsx, pptx 290 ExL-fmt/62 flac 2 fmt/3 gif 12 fmt/4 gif 38 fmt/5 avi 130 fmt/6 wav 1000 fmt/7 tif, tiff 1000 fmt/11 png 34 fmt/12 png 56 fmt/14 pdf 32 fmt/15 pdf 86 fmt/16 pdf 986 fmt/17 pdf 996 fmt/18 pdf 1000 fmt/19 pdf 1000 fmt/20 pdf 1000 fmt/39 doc 22 fmt/40 doc 1000 fmt/41 jpeg, jpg 1000 fmt/42 jpeg, jpg 220 fmt/43 jpeg, jpg 1000 fmt/44 jpeg, jpg P a g e

41 fmt/45 rtf 2 fmt/49 rtf 88 fmt/50 rtf 136 fmt/52 rtf 186 fmt/61 xls 2 fmt/62 xls 1000 fmt/95 pdf 1000 fmt/96 html, htm 6 fmt/99 html, htm 50 fmt/100 html, htm 8 fmt/101 xml 1000 fmt/111 doc, xls, ppt 1000 fmt/116 bmp 310 fmt/117 bmp 2 fmt/126 ppt 64 fmt/132 wma 98 fmt/133 wmv 2 fmt/134 mp fmt/149 jpg 6 fmt/276 pdf 142 x-fmt/16 txt, log 1000 x-fmt/62 log 1000 x-fmt/92 psd 126 x-fmt/111 txt, log 2 x-fmt/135 aiff 2 x-fmt/219 arc 1000 x-fmt/263 zip 26 x-fmt/279 mp3 20 x-fmt/385 mpg, mpeg 4 x-fmt/387 tif 1000 x-fmt/390 jpg 1000 x-fmt/391 jpg 1000 x-fmt/394 wp, wpd, wp5 148 x-fmt/398 jpg P a g e

42 x-fmt/409 exe 2 x-fmt/411 exe 2 Summary of Results Of the 61 different PUIDs tested; 75% displayed the same results for all versions of DROID and all signature files, including multi PUID and extension errors 40% displayed no inconsistencies a. By extension: gif, avi, png, jpg, html, xml, bmp, wp, and some subsets of doc, ppt and exe 7% displayed some inter-droid version inconsistencies (excluding DROID v6 FAST issues) a. By extension: doc, xls, ppt, some pdf, and wma 26% displayed some inter signature version inconsistencies a. By extension: docx,xlsx,pptx, some pdf, doc, xls, ppt, txt, log, aiff, and arc 16% displayed some specific DROID v 6 'FAST' mode inconsistencies a. By extension: epubs, mp4, flac, wav, zip and some subsets of pdf, xls, tif and exe 23% displayed some extension related inconsistencies a. By extension: cdx, some pdf, mp3, mpg, some jpg, txt, log and psd 7% indicated some significant subset inside the set a. By extension: mp3, arc, some doc, and xls 26% consistently indicated multiple PUIDs for a single file a. By extension: tif, pdf, rtf, txt, arc and some xls Results by PUID It is important to understand that the following graphics offer only a high level view of the results for each source PUID. There are a number of limitations to the presentation of complex datasets and the weaknesses in this approach should be apparent. For example, it is not possible to tell which direction the signature version changes have taken solely by using the following graphs (e.g. has the signature go more accurate or less accurate over time). The same can be said for most of the error types identified as being of interest. It should also be understood that signature based matches are unquestionably preferred over file extension based matches. The relative weakness of extension matches is a known concern with the PRONOM file registry, however it is included in these results as it remains an important 42 P a g e

43 consideration for assessing the consistency of file type identifications, especially where an extension based match is replaced with a signature based match at some point in time. It is possible to make some inferences directly from the graphic; however detailed analysis should be completed using the graphics, the results tables, and ideally the MySQL database as informational sources (available upon request). 43 P a g e

44 ExL -fmt/22 - CDX ExL -fmt/22 Only With File Extensions ExL -fmt/22 Only Without File Extensions 44 P a g e

45 ExL-fmt/41 - epub (OPS) epubs ExL-fmt/41 DROID v 6 Only 45 P a g e

46 Exl-fmt/61 - MPEG-4 Media File mp4 Exl-fmt/62 - Microsoft Office Open XML docx, xlsx, pptx 46 P a g e

47 Exl-fmt/62 DROID v 6 ONLY Exl-fmt/ FLAC (Free Lossless Audio Codec) - FLAC 47 P a g e

48 fmt/3 - Graphics Interchange Format 1987a - gif fmt/4 - Graphics Interchange Format 1989a - gif 48 P a g e

49 fmt/5 - Audio/Video Interleaved Format - avi 49 P a g e

50 fmt/6 - Waveform Audio - wav 50 P a g e

51 fmt/6 DROID v 6 ONLY fmt/7 - Tagged Image File Format - tif 51 P a g e

52 fmt/11 - Portable Network Graphics png fmt/12 - Portable Network Graphics png 52 P a g e

53 fmt/14 - Acrobat PDF Portable Document Format - pdf 53 P a g e

54 fmt/14 - Only With File Extensions fmt/14 - Only Without File Extensions 54 P a g e

55 fmt/15 - Acrobat PDF Portable Document Format - pdf 55 P a g e

56 fmt/15 DROID v 6 ONLY fmt/16 - Acrobat PDF Portable Document Format - pdf 56 P a g e

57 fmt/16 DROID v 6 ONLY fmt/17 - Acrobat PDF Portable Document Format - pdf 57 P a g e

58 fmt/17 - Only With File Extensions 58 P a g e

59 fmt/17 - Only Without File Extensions fmt/18 - Acrobat PDF Portable Document Format - pdf 59 P a g e

60 fmt/18 - Only With File Extensions fmt/18 - Only Without File Extensions 60 P a g e

61 fmt/19 - Acrobat PDF Portable Document Format - pdf 61 P a g e

62 fmt/19 - Only With File Extensions fmt/19 - Only Without File Extensions fmt/20 - Acrobat PDF Portable Document Format - pdf 62 P a g e

63 63 P a g e

64 fmt/20 - Only With File Extensions 64 P a g e

65 fmt/20 - Only Without File Extensions fmt/39- Microsoft Word for Windows Document 6.0/95 - doc fmt/40 - Microsoft Word for Windows Document doc 65 P a g e

66 fmt/40- Only With File Extensions 66 P a g e

67 fmt/40 - Only Without File Extensions 67 P a g e

68 fmt/41 - Raw JPEG Stream - jpg fmt/42 - JPEG File Interchange Format jpg fmt/43 - JPEG File Interchange Format jpg fmt/44 - JPEG File Interchange Format jpg 68 P a g e

69 fmt/45 - Rich Text Format rtf fmt/49 - Rich Text Format rtf 69 P a g e

70 fmt/50 - Rich Text Format rtf 70 P a g e

71 fmt/52 - Rich Text Format rtf 71 P a g e

72 fmt/61 - Microsoft Excel 97 Workbook - xls fmt/61 DROID v 6 Only 72 P a g e

73 fmt/62 - Microsoft Excel Workbook- xls fmt/62 DROID v 6 Only 73 P a g e

74 fmt/95 - Acrobat PDF/A - Portable Document Format - pdf 74 P a g e

75 fmt/95 Only With File Extensions fmt/95 Only Without File Extensions 75 P a g e

76 fmt/96 - Hypertext Markup Language html, htm 76 P a g e

77 fmt/99 - Hypertext Markup Language 4.0 html, htm fmt/100 - Hypertext Markup Language 4.01 html, htm fmt/101 - Extensible Markup Language xml 77 P a g e

78 fmt/111 - OLE2 Compound Document Format - doc, xls, ppt fmt/111 - DROID v 6 ONLY 78 P a g e

79 fmt/116 - Windows Bitmap bmp fmt/117 - Windows Bitmap 3.0 NT - bmp fmt/126 - Microsoft Powerpoint Presentation ppt 79 P a g e

80 fmt/132 - Windows Media Audio - wma fmt/132 Only With File Extensions fmt/132 Only Without File Extensions 80 P a g e

81 fmt/133 - Windows Media Video - wmv fmt/134 - MPEG 1/2 Audio Layer 3 mp3 81 P a g e

82 fmt/134 Only With File Extensions fmt/134 Only Without File Extensions fmt/149 - JTIP (JPEG Tiled Image Pyramid) jpg 82 P a g e

83 fmt/149 Only With File Extensions fmt/149 Only Without File Extensions fmt/276 - Acrobat PDF Portable Document Format - pdf 83 P a g e

84 84 P a g e

85 fmt/276 Only With File Extensions fmt/276 Only Without File Extensions 85 P a g e

86 x-fmt/16 - Unicode Text File txt, log 86 P a g e

87 x-fmt/16 Only With File Extensions x-fmt/16 Only Without File Extensions x-fmt /62 - Log File - log 87 P a g e

88 x-fmt/62 Only With File Extensions 88 P a g e

89 x-fmt/62 Only Without File Extensions x-fmt /92 - Adobe Photoshop - psd x-fmt/92 Only With File Extensions 89 P a g e

90 x-fmt/92 Only Without File Extensions x-fmt /111 - Plain Text File txt, log 90 P a g e

91 x-fmt/111 Only With File Extensions x-fmt/111 Only Without File Extensions 91 P a g e

92 x-fmt /135 - Audio Interchange File Format - aiff x-fmt/135 Only With File Extensions x-fmt/135 Only Without File Extensions 92 P a g e

Characterisation. Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th, 2008

Characterisation. Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th, 2008 Characterisation Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th, 2008 Manfred Thaller Universität zu * Köln manfred.thaller@uni-koeln.de * University at, NOT

More information

Different File Types and their Use

Different File Types and their Use Different File Types and their Use.DOC (Microsoft Word Document) Text Files A DOC file is a Word processing document created by Microsoft Word, a word processor included with all versions of Microsoft

More information

RECOMMENDED FILE FORMATS

RECOMMENDED FILE FORMATS Research and Enterprise Services RECOMMENDED FILE FORMATS University of Reading Research Data Archive Contents Introduction: file format categories... 1 Overview: formats for preservation and use... 1

More information

MEDIA RELATED FILE TYPES

MEDIA RELATED FILE TYPES MEDIA RELATED FILE TYPES Data Everything on your computer is a form of data or information and is ultimately reduced to a binary language of ones and zeros. If all data stayed as ones and zeros the information

More information

Importance of cultural heritage:

Importance of cultural heritage: Cultural heritage: Consists of tangible and intangible, natural and cultural, movable and immovable assets inherited from the past. Extremely valuable for the present and the future of communities. Access,

More information

Where to store research data during and after a project. Dr. Chris Emmerson Research Data Manager

Where to store research data during and after a project. Dr. Chris Emmerson Research Data Manager Where to store research data during and after a project Dr. Chris Emmerson Research Data Manager Welcome Research Data Service Data Lifecycle Data Storage Questions 1 Research Data Service 2 Research Data

More information

QLIKVIEW ARCHITECTURAL OVERVIEW

QLIKVIEW ARCHITECTURAL OVERVIEW QLIKVIEW ARCHITECTURAL OVERVIEW A QlikView Technology White Paper Published: October, 2010 qlikview.com Table of Contents Making Sense of the QlikView Platform 3 Most BI Software Is Built on Old Technology

More information

HTM, HTML, MHT, MHTML Web document Brightspace Learning Environment strips the <title> tag and text within the tag from user created web documents

HTM, HTML, MHT, MHTML Web document Brightspace Learning Environment strips the <title> tag and text within the tag from user created web documents Dropbox basics What is Dropbox? Learners use the tool to upload and submit assignment submissions to assignment submission folders in Brightspace Learning Environment, eliminating the need to mail, fax,

More information

Preserving PDF at the coalface

Preserving PDF at the coalface Preserving PDF at the coalface PDF/A at the Archaeology Data Service Tim Evans 15-07-2015 Introduction The Archaeology Data Service: Established in 1996 Based within the Department of Archaeology, University

More information

Basics in good research data management (RDM) for reviewing DMPs

Basics in good research data management (RDM) for reviewing DMPs Basics in good research data management (RDM) for reviewing DMPs S. Venkataraman Digital Curation Centre, Edinburgh s.venkataraman@ed.ac.uk https://doi.org/10.5281/zenodo.1461601 FOSTER & OpenAIRE webinar,

More information

SIP AIP AIP DIP. Preservation Planning. Data Management. Ingest. Access. Archival Storage. Administration MANAGEMENT P R O D U O N S U M E R E R 4-1.

SIP AIP AIP DIP. Preservation Planning. Data Management. Ingest. Access. Archival Storage. Administration MANAGEMENT P R O D U O N S U M E R E R 4-1. Performance Study of Digital Object Format Identification & Validation Tools Quyen Nguyen ERA Systems Engineering National Archives & Records Administration Agenda Background Format Identification Tools

More information

ExtremeTech Technology News - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

ExtremeTech Technology News - FTP Site Statistics. Top 20 Directories Sorted by Disk Space ExtremeTech Technology News - FTP Site Statistics Property Value FTP Server ftp.extremetech.com Description ExtremeTech Technology News Country United States Scan Date 14/Oct/2014 Total Dirs 281 Total

More information

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE PROCESSING AND CATALOGUING DATA AND DOCUMENTATION - QUALITATIVE....... INGEST SERVICES UNIVERSITY OF ESSEX... HOW TO SET UP A DATA SERVICE, 8-9 NOVEMBER 2012 PRE - PROCESSING Liaising with depositor: consent

More information

File Upload extension User Manual

File Upload extension User Manual extension User Manual Magento & Download extension allows admin to upload product attachments for users in order to provide additional information for products. Table of Content 1. Extension Installation

More information

National Aeronautics and Space Admin. - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

National Aeronautics and Space Admin. - FTP Site Statistics. Top 20 Directories Sorted by Disk Space National Aeronautics and Space Admin. - FTP Site Statistics Property Value FTP Server ftp.hq.nasa.gov Description National Aeronautics and Space Admin. Country United States Scan Date 26/Apr/2014 Total

More information

The IDN Variant TLD Program: Updated Program Plan 23 August 2012

The IDN Variant TLD Program: Updated Program Plan 23 August 2012 The IDN Variant TLD Program: Updated Program Plan 23 August 2012 Table of Contents Project Background... 2 The IDN Variant TLD Program... 2 Revised Program Plan, Projects and Timeline:... 3 Communication

More information

AVS4YOU Programs Help

AVS4YOU Programs Help AVS4YOU Help - AVS Document Converter AVS4YOU Programs Help AVS Document Converter www.avs4you.com Online Media Technologies, Ltd., UK. 2004-2012 All rights reserved AVS4YOU Programs Help Page 2 of 39

More information

DOWNLOAD OR READ : CONVERTING WORD DOCUMENT TO FORM PDF EBOOK EPUB MOBI

DOWNLOAD OR READ : CONVERTING WORD DOCUMENT TO FORM PDF EBOOK EPUB MOBI DOWNLOAD OR READ : CONVERTING WORD DOCUMENT TO FORM PDF EBOOK EPUB MOBI Page 1 Page 2 converting word document to form converting word document to pdf converting word document to form How Do I improve

More information

Challenges and Successes: Running the Rosetta Format Library

Challenges and Successes: Running the Rosetta Format Library Challenges and Successes: Running the Rosetta Format Library Peter McKinney and Jan Hutař on behalf of Format Library Working Group Rosetta User Group meeting, Leuven, June 2015 Background (1) The Format

More information

Atari Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Atari Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space Property Value FTP Server ftp.infogrames.net Description Atari Games Country United States Scan Date 02/Apr/2015 Total Dirs 488 Total Files 1,547 Total Data 26.66 GB Top 20 Directories Sorted by Disk Space

More information

Supported File Types

Supported File Types Supported File Types This document will give the user an overview of the types of files supported by the most current version of LEP. It will cover what files LEP can support, as well as files types converted

More information

WakeSpace Digital Archive Policies

WakeSpace Digital Archive Policies WakeSpace Digital Archive Policies Table of Contents 1. Community Policy... 1 2. Content Policy... 1 3. Withdrawal Policy... 2 4. WakeSpace Format Support... 2 5. Privacy Policy... 8 1. Community Policy

More information

Sustainable File Formats for Electronic Records A Guide for Government Agencies

Sustainable File Formats for Electronic Records A Guide for Government Agencies Sustainable File Formats for Electronic Records A Guide for Government Agencies Electronic records are produced and kept in a wide variety of file formats, often dictated by the type of software used to

More information

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE

PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE PROCESSING AND CATALOGUING DATA AND DOCUMENTATION: QUALITATIVE.... LIBBY BISHOP... INGEST SERVICES UNIVERSITY OF ESSEX... HOW TO SET UP A DATA SERVICE, 3 4 JULY 2013 PRE - PROCESSING Liaising with depositor:

More information

GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES

GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES GUIDELINES FOR CREATION AND PRESERVATION OF DIGITAL FILES October 2018 INTRODUCTION This document provides guidelines for the creation and preservation of digital files. They pertain to both born-digital

More information

Uploading a File in the Desire2Learn Content Area

Uploading a File in the Desire2Learn Content Area Uploading a File in the Desire2Learn Content Area Login to D2L and open one of your courses. Click the Content button in the course toolbar to access the Content area. Locate the Table of Contents on the

More information

Advanced High Graphics

Advanced High Graphics VISUAL MEDIA FILE TYPES JPG/JPEG: (Joint photographic expert group) The JPEG is one of the most common raster file formats. It s a format often used by digital cameras as it was designed primarily for

More information

Funcom Multiplayer Online Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Funcom Multiplayer Online Games - FTP Site Statistics. Top 20 Directories Sorted by Disk Space Property Value FTP Server ftp.funcom.com Description Funcom Multiplayer Online Games Country United States Scan Date 13/Jul/2014 Total Dirs 186 Total Files 1,556 Total Data 67.25 GB Top 20 Directories

More information

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment

Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Archives in a Networked Information Society: The Problem of Sustainability in the Digital Information Environment Shigeo Sugimoto Research Center for Knowledge Communities Graduate School of Library, Information

More information

New Features. Importing Resources

New Features. Importing Resources CyberLink StreamAuthor 4 is a powerful tool for creating compelling media-rich presentations using video, audio, PowerPoint slides, and other supplementary documents. It allows users to capture live videos

More information

Workshop Background. Purpose. Context. To provide you with resources and tools to help you know how to handle file format decisions as a researcher.

Workshop Background. Purpose. Context. To provide you with resources and tools to help you know how to handle file format decisions as a researcher. Workshop Background Purpose To provide you with resources and tools to help you know how to handle file format decisions as a researcher. Context Workshop Series: Preservation and Curation of ETD Research

More information

ABBYY FineReader 14 YOUR DOCUMENTS IN ACTION

ABBYY FineReader 14 YOUR DOCUMENTS IN ACTION YOUR DOCUMENTS IN ACTION Combining powerful OCR with essential PDF capabilities, FineReader provides a single solution for working with PDFs and scanned paper documents. Content Your Single Solution for

More information

Administration Guide. BlackBerry Workspaces. Version 5.6

Administration Guide. BlackBerry Workspaces. Version 5.6 Administration Guide BlackBerry Workspaces Version 5.6 Published: 2017-06-21 SWD-20170621110833084 Contents Introducing the BlackBerry Workspaces administration console... 8 Configuring and managing BlackBerry

More information

Developing an Electronic Records Preservation Strategy

Developing an Electronic Records Preservation Strategy Version 7 Developing an Electronic Records Preservation Strategy 1. For whom is this guidance intended? 1.1 This document is intended for all business units at the University of Edinburgh and in particular

More information

Website Overview. Your Disclaimer Here. 1 Website Overview

Website Overview. Your Disclaimer Here. 1 Website Overview This training guide will provide an overview of the Client Website. The Client Website is a Personal Financial Website that will provide you with a consolidated view of your financial information. There

More information

Open Preservation Foundation and The Preservation Action Registry. Martin Wrigley, Executive Director, OPF

Open Preservation Foundation and The Preservation Action Registry. Martin Wrigley, Executive Director, OPF Open Preservation Foundation and The Preservation Action Registry Martin Wrigley, Executive Director, OPF Martin Wrigley 30+ years experience delivering software and solutions -mostly in Mobile Telecoms

More information

Computing: Digital Media Elements for Applications (SCQF level 5)

Computing: Digital Media Elements for Applications (SCQF level 5) National Unit Specification General information Unit code: F1KS 11 Superclass: CB Publication date: November 2013 Source: Scottish Qualifications Authority Version: 02 Unit purpose This Unit is designed

More information

Technical What s New. Autodesk Vault Manufacturing 2010

Technical What s New. Autodesk Vault Manufacturing 2010 Autodesk Vault Manufacturing 2010 Contents Welcome to Autodesk Vault Manufacturing 2010... 2 Vault Client Enhancements... 2 Autoloader Enhancements... 2 User Interface Update... 3 DWF Publish Options User

More information

DOWNLOAD OR READ : WHEN YOU ARE CONVERTED PDF EBOOK EPUB MOBI

DOWNLOAD OR READ : WHEN YOU ARE CONVERTED PDF EBOOK EPUB MOBI DOWNLOAD OR READ : WHEN YOU ARE CONVERTED PDF EBOOK EPUB MOBI Page 1 Page 2 when you are converted when you are converted pdf when you are converted JPG to PDF Free Online Converter Our JPG to PDF converter

More information

Invitation to Tender Content Management System Upgrade

Invitation to Tender Content Management System Upgrade Invitation to Tender Content Management System Upgrade The IFRS Foundation (Foundation) is investigating the possibility of upgrading the Content Management System (CMS) it currently uses to support its

More information

Introduction to Content

Introduction to Content Content Introduction to Content... 2 Understanding the Organization of Content... 3 Course Overview... 3 Bookmarks... 3 Upcoming Events... 3 Table of Contents... 3 Create a New Module... 4 New Module...

More information

Digital Preservation DMFUG 2017

Digital Preservation DMFUG 2017 Digital Preservation DMFUG 2017 1 The need, the goal, a tutorial In 2000, the University of California, Berkeley estimated that 93% of the world's yearly intellectual output is produced in digital form

More information

PEERNET File Conversion Center

PEERNET File Conversion Center PEERNET File Conversion Center Automated Document Conversion Using File Conversion Center With Task Scheduler OVERVIEW The sample is divided into two sections: The following sample uses a batch file and

More information

SciVee Conferences AUTHOR GUIDE

SciVee Conferences AUTHOR GUIDE SciVee Conferences AUTHOR GUIDE 1 TABLE OF CONTENTS 1. ABOUT THIS DOCUMENT... 3 INTENDED READERSHIP... 3 FREQUENTLY USED TERMS... 3 2. SYSTEM REQUIREMENTS, PUBLISHING AND PERMISSIONS... 3 SYSTEM REQUIREMENTS...

More information

A Standards-Based Registry/Repository Using UK MOD Requirements as a Basis. Version 0.3 (draft) Paul Spencer and others

A Standards-Based Registry/Repository Using UK MOD Requirements as a Basis. Version 0.3 (draft) Paul Spencer and others A Standards-Based Registry/Repository Using UK MOD Requirements as a Basis Version 0.3 (draft) Paul Spencer and others CONTENTS 1 Introduction... 3 1.1 Some Terminology... 3 2 Current Situation (Paul)...4

More information

CollegiateLink Student Leader User Guide

CollegiateLink Student Leader User Guide CollegiateLink 2011 Last updated February 2011 0 Table of Contents Getting Started... 2 Managing Your Organization s Site... 3 Managing Your Organization s Interests... 5 Managing Your Organization s Roster...

More information

Example 1: Denary = 1. Answer: Binary = (1 * 1) = 1. Example 2: Denary = 3. Answer: Binary = (1 * 1) + (2 * 1) = 3

Example 1: Denary = 1. Answer: Binary = (1 * 1) = 1. Example 2: Denary = 3. Answer: Binary = (1 * 1) + (2 * 1) = 3 1.1.1 Binary systems In mathematics and digital electronics, a binary number is a number expressed in the binary numeral system, or base-2 numeral system, which represents numeric values using two different

More information

Introducing PDF/UA. The new International Standard for Accessible PDF Technology. Solving PDF Accessibility Problems

Introducing PDF/UA. The new International Standard for Accessible PDF Technology. Solving PDF Accessibility Problems Introducing PDF/UA The new International Standard for Accessible PDF Technology Solving PDF Accessibility Problems Introducing PDF/UA Agenda Why PDF What is PDF What is PDF/UA PDF/UA & WCAG 2.0 CommonLook

More information

DIGITAL RECORDS MANAGEMENT GUIDELINES

DIGITAL RECORDS MANAGEMENT GUIDELINES DIGITAL RECORDS MANAGEMENT GUIDELINES This Digital Records Management Guidelines document will primarily address the following types of digital records: Email Media Born Digital Records Scanned Records

More information

4. TECHNOLOGICAL DECISIONS

4. TECHNOLOGICAL DECISIONS 35 4. TECHNOLOGICAL DECISIONS 4.1 What is involved in preserving digital resources? Preservation is concerned with ensuring the longevity of a digital resource through changing technological regimes with

More information

Elementary Computing CSC 100. M. Cheng, Computer Science

Elementary Computing CSC 100. M. Cheng, Computer Science Elementary Computing CSC 100 1 Graphics & Media Scalable Outline & Bit- mapped Fonts Binary Number Representation & Text Pixels, Colors and Resolution Sound & Digital Audio Film & Digital Video Data Compression

More information

Six Sigma in the datacenter drives a zero-defects culture

Six Sigma in the datacenter drives a zero-defects culture Six Sigma in the datacenter drives a zero-defects culture Situation Like many IT organizations, Microsoft IT wants to keep its global infrastructure available at all times. Scope, scale, and an environment

More information

Response to the. ESMA Consultation Paper:

Response to the. ESMA Consultation Paper: Response to the ESMA Consultation Paper: Draft technical standards on access to data and aggregation and comparison of data across TR under Article 81 of EMIR Delivered to ESMA by Tahoe Blue Ltd January

More information

PRODUCT SHEET. LookAt Technologies LTD

PRODUCT SHEET. LookAt Technologies LTD PRODUCT SHEET LookAt Technologies LTD WWW.LOOKAT.IO TABLE OF CONTENTS 1. OVERVIEW... 4 2. SYSTEM REQUIREMENTS... 5 OPERATING SYSTEM... 5 WEB BROWSERS... 5 LOCALIZATION... 5 3. FILES... 5 FILE TYPE SUPPORT...

More information

Desktop DNA r11.1. PC DNA Management Challenges

Desktop DNA r11.1. PC DNA Management Challenges Data Sheet Unicenter Desktop DNA r11.1 Unicenter Desktop DNA is a scalable migration solution for the management, movement and maintenance of a PC s DNA (including user settings, preferences and data).

More information

Client Website Overview Guide

Client Website Overview Guide This training guide will provide an overview of the Client Website. The Client Website is a Personal Financial Website that will provide you with a consolidated view of your financial information. There

More information

strategy IT Str a 2020 tegy

strategy IT Str a 2020 tegy strategy IT Strategy 2017-2020 Great things happen when the world agrees ISOʼs mission is to bring together experts through its Members to share knowledge and to develop voluntary, consensus-based, market-relevant

More information

to PDF. For Outlook Export s & attachments to PDF. Bahrur Rahman AssistMyTeam

to PDF. For Outlook Export  s & attachments to PDF. Bahrur Rahman AssistMyTeam V9 Email to PDF For Outlook Export emails & attachments to PDF Bahrur Rahman AssistMyTeam Welcome to Email to PDF for Outlook- A fast, light-weight add-in for Microsoft Outlook that makes it easy and effortless

More information

QA Hub User Guide. IM11 V001 dated

QA Hub User Guide. IM11 V001 dated QA Hub User Guide 1 QA HUB USER GUIDE What is the Quality Assurance (QA) Hub? The QA Hub is a web-based data management tool, which has replaced Elmhurst s original Monitoring system. The QA Hub will maintain

More information

Document Management Release Notes

Document Management Release Notes Document Management Release Notes Release 9.8 08/17/2011 This version of the software has been retired 2011 Sage Software, Inc. All rights reserved. Sage, the Sage logos and the Sage product and service

More information

Summary of Bird and Simons Best Practices

Summary of Bird and Simons Best Practices Summary of Bird and Simons Best Practices 6.1. CONTENT (1) COVERAGE Coverage addresses the comprehensiveness of the language documentation and the comprehensiveness of one s documentation of one s methodology.

More information

Wealth Management Center Overview Guide

Wealth Management Center Overview Guide This training guide will provide an overview of the Wealth Management Center. The Wealth Management Center is a Personal Financial Website that will provide you with a consolidated view of your financial

More information

Preservation Planning for a Personal Digital Archive Paul Wilson

Preservation Planning for a Personal Digital Archive Paul Wilson 1990 2001 2016 Preservation Planning for a Personal Digital Archive Paul Wilson pwilsonofc@btinternet.com DPC Webinar, 29 th June 2016 How I got into this 2 Office Technology Division, 1980-84 Seek out

More information

Lecture 19 Media Formats

Lecture 19 Media Formats Revision IMS2603 Information Management in Organisations Lecture 19 Media Formats Last week s lectures looked at MARC as a specific instance of complex metadata representation and at Content Management

More information

Digital Preservation at NARA

Digital Preservation at NARA Digital Preservation at NARA Policy, Records, Technology Leslie Johnston Director of Digital Preservation US National Archives and Records Administration (NARA) ARMA, April 18, 2018 Policy Managing Government

More information

econsult: Requesting a Consult

econsult: Requesting a Consult A consult request is a request related to a patient, when the requesting provider could benefit from consulting with a specialist to enhance the care pathway. The requesting provider selects whether to

More information

Introduction. Collecting, Searching and Sorting evidence. File Storage

Introduction. Collecting, Searching and Sorting evidence. File Storage Collecting, Searching and Sorting evidence Introduction Recovering data is the first step in analyzing an investigation s data Recent studies: big volume of data Each suspect in a criminal case: 5 hard

More information

Technical University of Munich - FTP Site Statistics. Top 20 Directories Sorted by Disk Space

Technical University of Munich - FTP Site Statistics. Top 20 Directories Sorted by Disk Space Technical University of Munich - FTP Site Statistics Property Value FTP Server ftp.ldv.e-technik.tu-muenchen.de Description Technical University of Munich Country Germany Scan Date 23/May/2014 Total Dirs

More information

New Look for the My Goals Page

New Look for the My Goals Page New Look for the My Goals Page New Look for the Create Goals Page How to Create or Edit a Goal To create a goal, go to PERFORMANCE > GOALS. Then click the CREATE button. To edit a goal, go to PERFORMANCE

More information

DOWNLOAD OR READ : WHATS THE DIFFERENCE IN PROTESTANT AND ROMAN CATHOLIC BELIEFS PDF EBOOK EPUB MOBI

DOWNLOAD OR READ : WHATS THE DIFFERENCE IN PROTESTANT AND ROMAN CATHOLIC BELIEFS PDF EBOOK EPUB MOBI DOWNLOAD OR READ : WHATS THE DIFFERENCE IN PROTESTANT AND ROMAN CATHOLIC BELIEFS PDF EBOOK EPUB MOBI Page 1 Page 2 whats the difference in protestant and roman catholic beliefs whats the difference in

More information

BlackBerry Workspaces Server Administration Guide

BlackBerry Workspaces Server Administration Guide BlackBerry Workspaces Server Administration Guide 6.0 2018-10-06Z 2 Contents Introducing BlackBerry Workspaces administration console... 7 Configuring and managing BlackBerry Workspaces... 7 BlackBerry

More information

DupScout DUPLICATE FILES FINDER

DupScout DUPLICATE FILES FINDER DupScout DUPLICATE FILES FINDER User Manual Version 10.3 Dec 2017 www.dupscout.com info@flexense.com 1 1 Product Overview...3 2 DupScout Product Versions...7 3 Using Desktop Product Versions...8 3.1 Product

More information

POSITION DESCRIPTION

POSITION DESCRIPTION Network Security Consultant POSITION DESCRIPTION Unit/Branch, Directorate: Location: Regulatory Unit Information Assurance and Cyber Security Directorate Auckland Salary range: I $90,366 - $135,548 Purpose

More information

Preservation Metadata Extraction and Collection : Tools and Techniques. Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa

Preservation Metadata Extraction and Collection : Tools and Techniques. Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa Preservation Metadata Extraction and Collection : Tools and Techniques Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa How to get what you need to keep what you ve got The stack

More information

Response to the CCSDS s DAI Working Group s call for corrections to the OAIS Draft for Public Examination

Response to the CCSDS s DAI Working Group s call for corrections to the OAIS Draft for Public Examination Response to the CCSDS s DAI Working Group s call for corrections to the OAIS Draft for Public Examination Compiled on behalf of the members of the Digital Curation Centre and the Digital Preservation Coalition

More information

DOWNLOAD OR READ : FREE SERVICE MANUAL 2006 GMC SIERRA PDF EBOOK EPUB MOBI

DOWNLOAD OR READ : FREE SERVICE MANUAL 2006 GMC SIERRA PDF EBOOK EPUB MOBI DOWNLOAD OR READ : FREE SERVICE MANUAL 2006 GMC SIERRA PDF EBOOK EPUB MOBI Page 1 Page 2 free service manual 2006 gmc sierra free service pdf free service manual 2006 gmc sierra Edit PDF files with PDFescape

More information

Strategy for long term preservation of material collected for the Netarchive by the Royal Library and the State and University Library 2014

Strategy for long term preservation of material collected for the Netarchive by the Royal Library and the State and University Library 2014 Strategy for long term preservation of material collected for the Netarchive by the Royal Library and the State and University Library 2014 Introduction This document presents a strategy for long term

More information

Robin Dale RLG

Robin Dale RLG Robin Dale RLG Robin.Dale@notes.rlg.org Diversity of applications (commercial, home-grown, operational, etc.) in the organization, structure and encoding of documents and data Complexity varies greatly

More information

Digital Preservation and The Digital Repository Infrastructure

Digital Preservation and The Digital Repository Infrastructure Marymount University 5/12/2016 Digital Preservation and The Digital Repository Infrastructure Adam Retter adam@evolvedbinary.com @adamretter Adam Retter Consultant Scala / Java Concurrency and Databases

More information

Microsoft SharePoint Server 2013 Plan, Configure & Manage

Microsoft SharePoint Server 2013 Plan, Configure & Manage Microsoft SharePoint Server 2013 Plan, Configure & Manage Course 20331-20332B 5 Days Instructor-led, Hands on Course Information This five day instructor-led course omits the overlap and redundancy that

More information

The Journal of Insect Science

The Journal of Insect Science The Journal of Insect Science http://www.insectscience.org Subject: Contact: Purpose: Publication Information / Workflow Adam Engelsgjerd 520.621.2502 engelsgjerda@u.library.arizona.edu This document is

More information

BOCC NUT Content Guide

BOCC NUT Content Guide NUT Associations and Divisions Website Content Guide Contents General guidelines 1 Guidelines for users WITHOUT an existing site 2 Guidelines for users WITH an existing site 3 Assets 3 Domain names 5 Contact

More information

Briefing Paper: developing the DOI Namespace

Briefing Paper: developing the DOI Namespace 010123-DOI-NS-paper.doc 1 Briefing Paper: developing the DOI Namespace This briefing paper describes a project that has been commissioned by the IDF for completion during the first half of 2001. The paper

More information

The Development of Digital Preservation Best Practices in EPrints. OR2012 : The 7 th International Conference on Open Repositories

The Development of Digital Preservation Best Practices in EPrints. OR2012 : The 7 th International Conference on Open Repositories The Development of Digital Preservation Best Practices in EPrints Marconi and his receiving apparatus at Signal Hill, St. John's, December 1901. Long-term meaningful access to file formats across all

More information

ARCW Digital Preservation Survey Report

ARCW Digital Preservation Survey Report This survey was undertaken by the Archives and Records Council Wales to provide an evidence base for developing a national digital preservation service for ARCW members. ARCW Digital Preservation Survey

More information

DRS Update. HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017

DRS Update. HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017 Update HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017 1 AGENDA DRS DRS DRS Architecture DRS DRS DRS Work 2 COLLABORATIVELY MANAGED DRS Business Owner Digital

More information

NXPowerLite Desktop. User Manual. Version 8.0.X, February neuxpower.com. Simple Storage Reduction Software

NXPowerLite Desktop. User Manual. Version 8.0.X, February neuxpower.com. Simple Storage Reduction Software NXPowerLite Desktop User Manual Version 8.0.X, February 2018 neuxpower.com Simple Storage Reduction Software 1 Table of Contents 1. Table of Contents 1 2. Using NXPowerLite 2 2.1. Desktop Application 2-3

More information

Free ITIL Foundation Exam Paper 40 Questions 60 Minutes Allowed. Minimum of 26/40 to Pass. With the Compliments of www.itservicesuccess.com Good Luck!! GIVE YOURSELF THE UNFAIR ADVANTAGE! MULTIPLE CHOICE

More information

Genesis Webinar-To-Go Quick Reference Guide

Genesis Webinar-To-Go Quick Reference Guide Genesis Webinar-To-Go Quick Reference Guide This document is intended to provide you with helpful information and basic usage tips for Genesis Webinar-To-Go. 09-23-2010 Page 1 of 1 WEBINAR_A2 Logging in

More information

File obsolescence at the ADS?

File obsolescence at the ADS? File obsolescence at the ADS? Tim Evans 23-06-2016 Introduction The Archaeology Data Service: Established in 1996 Based within the Department of Archaeology, University of York Digital archive for UK-based

More information

Novetta Cyber Analytics

Novetta Cyber Analytics Know your network. Arm your analysts. Introduction Novetta Cyber Analytics is an advanced network traffic analytics solution that empowers analysts with comprehensive, near real time cyber security visibility

More information

Document Version: 1.0. Purpose: This document provides an overview of IBM Clinical Development v released by the IBM Corporation.

Document Version: 1.0. Purpose: This document provides an overview of IBM Clinical Development v released by the IBM Corporation. Release Notes IBM Clinical Development Release Date: 17 August 2018 Document Version: 10 OVERVIEW Purpose: This document provides an overview of IBM Clinical Development released by the IBM Corporation

More information

Data Curation Handbook Steps

Data Curation Handbook Steps Data Curation Handbook Steps By Lisa R. Johnston Preliminary Step 0: Establish Your Data Curation Service: Repository data curation services should be sustained through appropriate staffing and business

More information

Data Curation Profile Human Genomics

Data Curation Profile Human Genomics Data Curation Profile Human Genomics Profile Author Profile Author Institution Name Contact J. Carlson N. Brown Purdue University J. Carlson, jrcarlso@purdue.edu Date of Creation October 27, 2009 Date

More information

Concord Fax Online User Guide V.4 (2010)

Concord Fax Online User Guide V.4 (2010) Concord Fax Online User Guide V.4 (2010) Concord Technologies Publication Notice The contents of this publication the specifications of this application are subject to change without notice. Concord reserves

More information

A number of optimizations are already in use by the majority of companies in industry, notably:

A number of optimizations are already in use by the majority of companies in industry, notably: 1 Abstract Mechatronics products contain significant amounts of software. Most advances in embedded software development focus on specific phases of the development process. However, very little emphasis

More information

NXPowerLite Desktop (Mac)

NXPowerLite Desktop (Mac) NXPowerLite Desktop (Mac) User Manual Version 8.x.x, August 2018 neuxpower.com Simple Storage Reduction Software Table of Contents. Table of Contents 1. Using NXPowerLite 2. Desktop Application 2-3. Finder

More information

[Compatibility Mode] Confusion in Office 2007

[Compatibility Mode] Confusion in Office 2007 [Compatibility Mode] Confusion in Office 2007 Confused by [Compatibility Mode] in Office 2007? You re Not Alone, and Here s Why Funnybroad@gmail.com 8/30/2007 This paper demonstrates how [Compatibility

More information

How WhereScape Data Automation Ensures You Are GDPR Compliant

How WhereScape Data Automation Ensures You Are GDPR Compliant How WhereScape Data Automation Ensures You Are GDPR Compliant This white paper summarizes how WhereScape automation software can help your organization deliver key requirements of the General Data Protection

More information

Digital Preservation: How to Plan

Digital Preservation: How to Plan Digital Preservation: How to Plan Preservation Planning with Plato Christoph Becker Vienna University of Technology http://www.ifs.tuwien.ac.at/~becker Sofia, September 2009 Outline Why preservation planning?

More information