EPIC: Using Plato. Anna Collins, Barbara Bültmann, David Piper, Elin Stangeland University of Cambridge. Funded by: Explorations in using Plato

Size: px

Start display at page:

Download "EPIC: Using Plato. Anna Collins, Barbara Bültmann, David Piper, Elin Stangeland University of Cambridge. Funded by: Explorations in using Plato"

Jessica Barnett
6 years ago
Views:

1 EPIC: Using Plato Explorations in using Plato Anna Collins, Barbara Bültmann, David Piper, Elin Stangeland University of Cambridge Funded by: These materials are released under Creative Commons licence 2.0 B-NC-SA: By Attribution, Non-Commercial, Share-Alike

2 EPIC: Using Plato Explorations in using Plato Anna Collins, Barbara Bültmann, David Piper, Elin Stangeland This report forms part of the EPIC project, which was funded from February 2011-August 2011 as part of JISC s Information Environment Programme , under the Preservation strand, with oversight from the JISC Programme Manager, Neil Grindley. Contents Introduction... 4 The EPIC Project... 4 EPIC and DSpace@Cambridge... 4 An overview of Plato... 5 Level of Plato knowledge required... 6 Requirements of deposited items... 6 Define Requirements... 7 Define Basis... 7 Defining the Policy Tree... 7 Defining Sample Records... 8 Identifying sample records... 8 Identify Requirements... 8 Defining the objective tree... 8 Evaluate Alternatives... 9 Define Alternatives... 9 Migrating PDF to PDF/A... 9 Migrating Word to PDF/A Go/No Go Develop Experiments Run Experiments Uploading results files Evaluate Experiments Evaluating each tool for each file Analyse Results Transform Measured Values Assigning transformation values to measurements Set Importance Factors Weighting the attributes Analyse Results Deciding on the best tool Adjustments to the experiment Using automated evaluation Conclusions on using Plato to assess text documents Evaluating the use of XCDL in Plato to compare image files automatically Suggestions for developments Creating and evaluating the decision trees Experiments Transforming values Assigning weights Assessing outcomes EPIC Using Plato 2 Version 1.2

3 Assessing the relative importance of different stages Producing output Reflections on using Plato The utility of Plato for non-experts Setting up a suitable objective tree Choosing the sample set Format specific migrations Carrying out the transformation procedure Assigning suitable weights Producing a Preservation Plan Bugs Automated evaluation tools Displaying numerical values Importing mind maps from newer versions of Freemind Overall Evaluation of Plato References Appendices Appendix A Appendix B Appendix C Appendix D Appendix E Appendix F Cover picture adapted from: 1928 (Corner of the inner Royal Commonwealth Society Library, with globes. Miss Winifred Hill, cataloguer, later second wife of P. Evans Lewin, Librarian from 1910 to 1946, and (standing) Miss Oppenheim. Reference no. 5/18/628). RCS-VA-004, Royal Commonwealth Society Library, Cambridge University Library, University of Cambridge. Cambridge University Library. EPIC Using Plato 3 Version 1.2

4 Introduction The EPIC Project EPIC (Evaluating Plato in Cambridge) was a JISC-funded project at the Cambridge University Library, which ran February-August It investigated ways of improving the digital preservation services currently provided with DSpace@Cambridge, the institutional repository for the University of Cambridge. A key activity of the project was to explore the feasibility of using Plato and associated PLANETS Suite 1 for preservation planning and appropriate preservation activities. One aspect of the project involved working with people who had deposited items within the repository in order to gain an understanding of the properties or characteristics that were regarded as significant in designated communities; this would inform our testing of the tools within Plato. This also involved investigating how best to capture this information from users who are unfamiliar with and uninterested in the formal discourse of the digital preservation community. This report describes our exploratory work into the practical application of Plato and the PLANETS tools within the context of our repository environment. While there is considerable expertise of preservation activities within the University, and the University Library in particular, we were also interested in evaluating the utility of Plato for people with limited experience in digital preservation (such as a typical repository team member), to assess its suitability for use by non-experts. EPIC and DSpace@Cambridge The EPIC project involved gathering an overview of the content of the DSpace@Cambridge in order to identify collections or individual items that may be at risk of obsolescence; this is reported in greater detail elsewhere 2. The principal focus was on text-based documents; the analysis of file types within the collection is summarised in Table 1. File extension Number of items.txt pdf 3338.doc 175.rtf 35.sxw 26.bbl/.latex/.sty/.tex 5.ps 4.sxi 2.bbl/.tex 1.docx 1.odp 1 Table 1. Summary of the numbers of items of different files types for textual files within DSpace@Cambridge. Whilst the majority of text files were in.txt format, they were not considered an 'at risk' format. 1 See and 2 See the project final report. EPIC Using Plato 4 Version 1.2

Thus, one of the most common file formats in the DSpace@Cambridge repository is PDF.

5 Thus, one of the most common file formats in the repository is PDF. However, this includes different types of PDF, and not necessarily PDF/A, which is the preferred format for archiving. Therefore, for purposes of long-term preservation of these files, we are interested in tools which allow us to produce files in PDF/A format from PDF and.doc formats, and so we decided to evaluate Plato in this context. Overall, we did not identify any collections where there was an immediate risk associated with obsolescence and so we were able to explore using Plato without urgent need for an appropriate preservation action. An overview of Plato Plato is a preservation planning tool which was developed as part of the PLANETS suite in a 4-year project co-funded by the European Union to address core digital preservation challenges, and following the end of that funding is now maintained and developed by the Open Planets Foundation (OPF). The programme is described as: "a decision support tool that implements a solid preservation planning process and integrates services for content characterisation, preservation action and automatic object comparison in a service-oriented architecture to provide maximum support for preservation planning endeavours." 3 The programme guides the user through the sequential process of producing a comprehensive preservation plan for a collection. There are numerous, iterative stages in the preservation planning process, with the knowledge base informing (and being informed by) each of them. Figure 1 summarises the Plato Workflow. Figure 1. Diagram summarising the Plato Workflow 4. 3 See 4 From the Plato home page. EPIC Using Plato 5 Version 1.2

Familiarity and experience in the field of preservation planning is likely to give more robust recommendations; similarly, it is possible to increase one's knowledge base at each stage.

6 Familiarity and experience in the field of preservation planning is likely to give more robust recommendations; similarly, it is possible to increase one's knowledge base at each stage. Some of the stages it identifies are subdivided within the programme; this is shown in Figure 2. Figure 2. Diagram showing the different steps in the Plato workflow, and the associated input and output information. Thus for each principle stage within the Plato workflow several intermediate steps are involved. These will be described in greater detail as the process is worked through. Level of Plato knowledge required This investigation into Plato was undertaken by a novice in using the tool, with little practical experience in digital preservation. Thus, very little prior knowledge is assumed. However, it is recommended that for someone unfamiliar with Plato that this report is read in conjunction with an open web-based version of Plato running, as this will help illustrate the situations being described. Requirements of deposited items An important aspect of the EPIC project was to determine what attributes of the deposited item needed to be preserved in a migrated document. This was done through a series of interviews with researchers who had contributed items to the DSpace@Cambridge repository. The interview process and the results are discussed more fully elsewhere, 5 however, the results have a direct impact on our preservation planning process and so some key aspects will be described here. 5 See for example: EPIC Using Plato 6 Version 1.2

7 11 researchers were interviewed, representing a range of academic disciplines at the University of Cambridge. We used sample objective trees provided by Becker et al. (2007) from other project plans to draw up a list of attributes which were likely to be important when considering the preservation of a textual document. For each of these attributes, the interviewees were asked to say how important it was that that attribute remained unchanged. These were assessed on a scale from 0, indicating that no change was acceptable, to 5, indicating that all changes would be acceptable. These values were averaged across all disciplines. In cases where an attribute was considered 'not applicable' by a researcher that response was disregarded when producing averages. 6 Additionally, the interviewer made a subjective, numerical assessment of the relative importance of those attributes; the larger the value assigned, the more important the attribute. The sum of the values for the attributes for each interviewee was 1. Define Requirements Define Basis This stage involves some documentation on the collection being considered by the plan (such as file types, any mandates relating to this collection and the trigger for producing the preservation plan) and the inclusion of a policy tree, where the strategy, policy, goals and constraints of the holding institution are incorporated into the overall preservation plan. Defining the Policy Tree Defining the policy tree was a useful exercise. While DSpace@Cambridge has a preservation policy 7 and has previously considered aspects of preservation policy in order to deal with day-to-day management issues, this was the first time that we produced a detailed systematic review of the policies that we would want to apply to a specific file format. We used the FreeMind mindmap template on the Plato site 8 as the starting point and removed the items that were unnecessary. The template provided us with a comprehensive list of requirements and we could not think of any other policy requirements that we needed to add. Given our lack of experience in producing a formalised preservation action plan, we sometimes made abstract decisions based on ideal solutions; with further experience these decisions will be better informed. Perhaps because of this lack of experience we struggled with making absolute es/no choices on a number of areas. For example Preservation action tool shall be open source /N. This would be a es if an open source tool exists that meets all requirements but if a commercial alternative provided a significantly better solution then we would want to look at the cost benefit. The policy tree we used can be found in Appendix A. 6 Later, these values were converted so that 0 indicated that all changes were acceptable and 5 that no change was acceptable. This was to align these values with the transformation process, where 5 indicated a good result and lower values related to progressively worse outcomes. The table in Appendix C reflects these converted values. 7 This is available at 8 This can be downloaded from EPIC Using Plato 7 Version 1.2

8 Defining Sample Records The principal activity at this stage is to define a sample set of files from the collection which represent the range of characteristics within that whole collection. The recommended sample size is 3-10 files. These files are then uploaded into Plato. Once a file has been uploaded, its file type is identified automatically using JHOVE 9 and FITS 10 ; if these tools suggest more than one file type, the user should confirm the choice of file type. It is at this point that XCDL descriptions of the files can be produced, if these are required by the user. XCDL is an Extensible Characterisation Definition Language written as part of the development of Plato (there is an associated Extensible Characterisation Extraction Language, XCEL), and it aims to allow for comparison of files. Identifying sample records While ultimately intended to run a sample of the recommended size, for the initial assessment we decided to use only one document, chosen at random. The decision to use only one file was made so that the evaluation process was speeded up, and for us to gain an initial understanding of the programme. Identify Requirements The key process here is setting up an appropriate objective tree. This needs to contain all the attributes of the migrated document that need to be compared to the original (e.g. appearance, coherence); other elements of the action path should also be considered at this point. 11 While this issue will be discussed more fully later (see Creating and evaluating the decision trees), it is worth emphasising here the importance of distinguishing between factors that are important and how important they are. Our experience here suggests that, when in doubt, it is better to ascribe Ordinal criteria rather than Boolean criteria for attributes. However, if meaningful numerical criteria can be applied then this may be preferable to Ordinal assessments which may involve a degree of subjectivity. The Transforming stage allows criteria to be applied more or less strictly, depending on context. Defining the objective tree The objective tree we used at this stage is given in Appendix B. It is based on the attributes discussed in the interviews with DSpace@Cambridge users, and is therefore indirectly based on the objective tree used by Becker et al. (2007). We originally used a 0-5 integer range as assessment criteria for most attributes, with 0 indicating that no change was acceptable, and 5 indicating that changes could be made; this was chosen to allow us to align the values we assigned at this stage with the important attributes identified from our interviews with users. However, we soon amended this to an ordinal scheme, to give increased clarity on the range being assessed. 12 For most values, we used the options 'Identical/small changes/large changes/garbage'. The value 'garbage' was chosen to indicate succinctly that there was such a severe change that the output was entirely meaningless, whereas a large change would indicate that the migrated text could be associated with the original text but with severe loss of clarity, information, etc. 9 JHOVE (JSTOR/Harvard Object Validation Environment) is a Java application which performs format-specific identification, validation, and characterization of digital objects. 10 FITS (File Information Tool Set) identifies, validates, and extracts technical metadata for various file formats, utilising several third-party open source tools, including JHOVE. 11 As we progressed through the planning process, we recognised that we had not been defined all necessary objectives, and some criteria should be refined. 12 Later in the process we also found the transformations from an ordinal scale to target values more intuitive. EPIC Using Plato 8 Version 1.2

9 For Behaviour: URLs behaviour the ordinal values chosen were 'work [indicating that the link still functioned]/text [indicating that the text of the URL was correct, but that it did not work as a hyperlink]/doesn't work [indicating that the text was incorrect and/or the hyperlink did not work]'. Note that a hyperlink would not have to point to a valid website, but that it linked to the site specified correctly. The same range was applied to Behaviour: Jump references. For Structure: Figure: Size figure, the ordinal values were 'identical/different/distorted', where 'different' indicates that the figure has changed size, either bigger or smaller but the aspect ratio is retained (as either could impact on clarity of the figure, or have a knock-one effect on the overall document structure). 'Distorted' indicates a change in aspect ratio, and may also include a change in size (depending on the measurement system, e.g. number of pixels). The following attributes were assessed on Boolean (es/no) criteria: Content: Figure Content: Open specification Figure Content: Figure Content: International standard Figure Behaviour: Script Blocking Behaviour: Deactivation of security mechanism Behaviour: Content machine readable The attribute Content: Figure Content: Market share Figure was defined on a Number Range (0-100) as it represented a percentage value. This was also defined as a single calculation (i.e. performed on only one object in the sample size for each migration tool, when the sample size is greater than one), in order to save some time in the experimental evaluations. Having identified the requirements of the preservation and defined the parameters by which each attribute would be assessed, we moved on to set up the experiments. Evaluate Alternatives Define Alternatives At this stage, possible tools for migrating or emulating the original files are identified and added to a list; the appropriate action will be performed on each of these alternatives in subsequent stages of Evaluate Alternatives. Plato offers suggestions for appropriate tools that are integrated in the PLANETS suite, but tools that are available locally can also be chosen and described. At this point, it is sensible to give each tool a clear name for easy identification of the tool in later stages (see Experiments section). Migrating PDF to PDF/A Our hope was that we could use Plato and the PLANETS suite to convert PDF files to PDF/A. There was no tool available within the PLANETS toolset that would do that explicitly. We did find a tool for converting PDF to PDF, though no further information was available about the types of PDF file involved, and in particular whether a particular type of PDF file would be produced. In light of this, we decided to try an alternative preservation action This required starting the preservation planning approach again, but the process used was the same as has been described; we also decided that it was unlikely that there would need to be significant changes to policy tree and objective tree, and used those as previously defined. Note that in a 'real' preservation plan, these processes would need to be re-evaluated. However, as we were not confident (due to our lack of experience EPIC Using Plato 9 Version 1.2

10 Migrating Word to PDF/A Because we could not find the migration tools that we wanted for converting PDF specifically to PDF/A, we tried to migrate a Word file to PDF, as this is other file format held in appreciable numbers within DSpace@Cambridge. We encountered a similar problem as with the PDF to PDF/A conversion: there were no tools available integrated within Plato to convert a.doc file to.pdf. At this point, we decided to use tools available to us locally rather than tools that were available through the PLANETS toolbox. We defined two alternatives: the Word 'Save as PDF/A' option, which we chose because this tool is readily available 14, and Adobe Acrobat Pro conversion to PDF/A, as we expected a high-quality tool from Adobe, who provide an industry standard. Go/No Go At this point, it would be possible to discard some of the selected tools, giving reasons for their unsuitability, before using them to carry out a migration or emulation process. A 'Go' decision is necessary to proceed with the planning process, but there is a drop-down menu offering alternative decisions (e.g. 'Deferred Go'). The reasons for the decision must be provided, as well as a description of the actions which will occur. Develop Experiments Further detail about the experiments can be given at this stage. Run Experiments At this stage, tools which are integrated in Plato are used to perform the experiments. Alternatively, if the tools are not included within Plato, the results of using these tools on the sample are uploaded one by one. Once these results files have been produced or uploaded, it may be possible to describe them in XCDL. Uploading results files In this case, rather than run the experiments, we uploaded files that we generated externally. However, we were not able to produce XCDL descriptions of these files; no error message, or other indication of a problem, was produced but the file generated when we attempted to produce the XCDL description was 0 bytes, with no content. We do not know why the XCDL description failed in these cases; it may be that XCDL descriptions for these types of files are still in development, but we could not find any clear reference to this. Zeierau et al. (2008) commented that "the XC*L languages are still under development and only operate on limited formats", but we have not been able to locate a list of the formats on which XCDL operates. A cursory visual inspection of the two files at this stage did not show any obvious differences, and a line-by-line inspection would be required, a time-consuming process to do thoroughly by eye. We realised that if any automated tools were available that they would have the potential to identify differences between the two files more efficiently and objectively. in preservation planning in this way) that all aspects had been covered in sufficient detail, we felt that it would be more profitable to continue with the exercise and re-evaluate our decisions when we understood the process better. We did need to choose an alternative sample file and selected an expedition report because it contained a lot of pictures and figures. 14 If we found that the Word-generated PDF/A produced suitable output, it may then be appropriate to recommend this as a tool for researchers and contributors to produce their own PDF/A files in the future. EPIC Using Plato 10 Version 1.2

11 One difference that became apparent at this stage was with the file sizes; the PDF file produced using Word 2010 was smaller than the original document, but the Acrobat file was larger than the original document. At this point our objective tree was focussed on the requirements of the researchers and the file size of the migrated file was not a criterion. This was an effective reminder that the objective tree must include the requirements of the curators as well as the community it serves; we could use this information to inform subsequent plans. Evaluate Experiments At this stage, for each attribute in the objective tree an assessment of how each tool has performed is made, using the scale described in the objective tree (e.g. Ordinal, Numeric, Boolean). This requires comparison of the output with the original value; for certain attributes this can be done automatically as part of Plato. Plato has an option for making an initial assessment of suggested values, but these need to be confirmed by the user or the process cannot proceed. Values can also be entered manually. Evaluating each tool for each file It became clear very quickly that this is an incredibly time-consuming process, but one that it is important to do well in order to produce a useful preservation plan. As we had used a lot of ordinal scales we sometimes found our assessments of 'small changes versus 'large changes' to be rather subjective. It is also possible to give a reason for each assessment, and this was useful in such cases (particularly if we felt that changes may be needed in the objective tree, for example considering a change from Boolean to Ordinal ranges). It is clear here that both the number of tools being tested and the sample size have a considerable effect on the amount of time required to complete this stage. For the purposes of this initial trial of Plato, we assigned values to each tool randomly. We found that expanding the objective tree fully at the start of this stage, and then saving at regular intervals was very helpful to avoid loss of input information. Analyse Results Transform Measured Values The transformation process involves mapping the assessments made on each attribute in the evaluation stage to a target value, where 5 indicates the best performance, and 0 indicates a 'knockout blow' (a tool which scores 0 will not be recommended). The results from the different tools over all the sample files are combined (aggregated) in two possible ways. Worst case aggregation uses the lowest score obtained for a tool for a given attribute. Arithmetic mean aggregation uses the average (mean) score obtained by the tool for that attribute. For attributes assessed by an ordinal scale, each value on the scale is assigned a value, as in Figure 3. EPIC Using Plato 11 Version 1.2

Figure 4. Screenshot of the mapping of numerical values to target values.

12 Figure 3. Screenshot of the mapping of an ordinal assessment to the target value. For attributes which are assessed by numerical values, the assigned value is given for a corresponding target value, as shown in Figure 4. Figure 4. Screenshot of the mapping of numerical values to target values. Assigning transformation values to measurements The help notes in Plato indicate that "Experience has shown that a scale with the resolution of discrete values 0-5 with 0 being an unacceptable value and 5 the best possible result works very well." In order to relate the assessment that no change was acceptable to the comparison tools in Plato, an interview score of 0 was equated with a score of 5 when transforming measurements, so that the best result is achieved when the migrated file was identical to the original in that respect. At this stage we were able to make further use of our assessment of the requirements of the research community. The importance of some file attributes was strongly dependent on the discipline of the researcher, and thus the features of the document. For example, preservation of equation structure is seen as vital for the Physical Sciences, but was not applicable to researchers from the Arts and Humanities, where font type and preservation of diacritics and alphabets may be crucial. This has a subsequent effect on the choice of files for a sample set, as in order to be confident in a single migration tool it would need to deal with all these issues (and others) effectively. Initially, the overall requirements of researchers were aggregated, disregarding any disciplinary trends. However, the results from this requirements gathering suggest that different migration tools may be needed (or at least, need to be considered) in order to identify the most suitable migration tool for an individual item, if no tool can be found which deals with all the requirements successfully. EPIC Using Plato 12 Version 1.2

13 As migration inevitably involves the loss of some information, it may be necessary to assess what are the most important characteristics of an item, and while doing this for each individual item may not be practical, there may be sufficient common ground within certain communities and/or collections to allow some aggregation on that basis. Assessing the migration tools may require different transformation parameters or different weighting schemes, and the use of different objective trees may also need to be considered. There were only two attributes (Appearance: Internal References and Appearance: URLs) for which none of the respondents said that no change was acceptable. This indicated that throughout the research community these values were considered less critical, and so were the only criteria where the lowest transformed value was 1 rather than 0 (with 0 indicating a knock-out blow for a tool that fails to perform this function to the minimum standard required). We also used the mean values of interviewees' assessments (see Appendix C; see also 'Requirements of Deposited Items' above). If an attribute had a mean value of 2.5 or less, the average aggregation mode was used. If an attribute had a mean value of greater than 2.5, the worst case aggregation mode was used. This is an arbitrary decision, and may seem at odds with the criteria established above. The intention is to ensure that, for those requirements which are considered more critical, migration tools that perform consistently well are promoted over ones which occasionally encounter significant problems, even if they generally perform as well or better. If several programs have similar high levels of success, the aggregation mode would be changed to make the selection more stringent, either by decreasing the value at which the worst case mode is used, or using the worst case mode to aggregate the values for all criteria. Set Importance Factors The weighting scheme (also referred to as the balancing criteria) is set up so that each level of node in the objective tree is assessed separately, and needs to sum to 1. Weighting the attributes Here we used the interviewer's assessment of the relative importance of the different components in the objective tree to assign weights. The weighting scheme used, based on the researchers' requirements, is given in Appendix D. Many of the attributes turned out to have low weights, and some of the weights are effectively zero (e.g. header and footer). Analyse Results At this stage, the stages of the preservation planning process covered so far are summarised. Those tools which have achieved non-zero scores by the weighted multiplication method of summarising the results can be used to make a recommendation for the preservation plan. If no tool has a nonzero score, then a tool cannot be recommended and the process cannot continue. Deciding on the best tool The results we obtained did not give us a recommended tool, as each scored 0 for at least one attribute. EPIC Using Plato 13 Version 1.2

Figure 5. Screen shot of the results for weighted multiplication and weighted sum. Here, neither tool meets the requirements set up by the transformations values for our objective tree.

14 Figure 5. Screen shot of the results for weighted multiplication and weighted sum. Here, neither tool meets the requirements set up by the transformations values for our objective tree. Here, it is clear that the random values for assessment that we chose resulted in neither tool being suitable; the chart can be expanded further to see what areas were assigned knock-out values of 0. In this case, we had not assigned suitable transformation values to some criteria, so good ('identical') results were being transformed to 0. Adjustments to the experiment Using automated evaluation The experiment described above indicated that if we wanted to assess a number of different options for a sample of several different files, it would be preferable to use as many automated tools as possible, and not use visual inspection (which is necessarily more subjective) unless no alternative exists. Therefore, we set up a trial objective tree where all criteria could be evaluated automatically. This objective tree is given in Appendix B. All attributes were assessed using Boolean criteria, except relative file size which was a ratio of size of migrated file: size of original file. This attribute was defined as a positive number, with units of 'ratio' (while a ratio is a dimensionless quantity, Plato requires units to be given to numerical values). The same sample and output files were used as before. However, the results were less useful than we had hoped. In order for the automatic evaluation to work for some attributes, it is necessary to describe the files in XCDL. However, we found that the input file was described in XCDL of 0bytes; in the case of the migrated files, although the message "Successfully described all result files" was obtained at the Run Experiments stage. Moving on to the next stage (Evaluate Experiments) gave the message "Some XCL descriptions for samples are missing, XCL comparison will not work for these samples." Seemingly the results files had not been successfully described after all. Some of the criteria could still be evaluated automatically, but output was only obtained for the three criteria in File Properties, with both migrated files being valid and well-formed, and with relative file sizes of 1.41 and 0.46 respectively (accurately reflecting the sample file size of 9 183KB, and KB and 4 269KB for the Adobe- and Word-generated PDF files respectively). EPIC Using Plato 14 Version 1.2

15 Conclusions on using Plato to assess text documents Our use of Plato on text documents suggests that there are insufficient integrated migration tools in Plato and the PLANETS suite for it to be of significant value in discovering new migration tools. Initially we wanted to use Plato to help us to migrate PDF files to PDF/A format, but we are also interested in migration of other text file formats to PDF, and could not find integrated tools for doing this. 15 Plato has pushed us to assessing the preservation requirements of the textbased resources, and has helped us to explore the requirements of originators of text-based outputs when it comes to preservation. However, our problems with generating working output files for some migration actions, and the failure of the automated evaluation tools (without any indication that they have failed and at times with the erroneous suggestion that they have worked correctly), lead us to conclude that we do not feel that it currently offers sufficient support or tools for us to make use of it in the near future. We are fortunate that we did not identify any immediate areas of concern through our scoping work on the text collections. When future assessment of the deposited contents indicate that action is required to preserve such documents, we will re-investigate the utility of Plato and the Planets suite to help our preservation process. Evaluating the use of XCDL in Plato to compare image files automatically Becker et al. (2008) discusses a PNG to TIF migration, with comparison of input and output files using XCDL; many of the automated functions apply only to images (e.g. 'image height', 'image width', 'resolution'). Because we feel that the automated evaluation has the potential to be the most valuable aspect of Plato in developing a preservation plan, we decided to try out these tools using a sample set of 3 TIF files, a coloured diagram, a photo, and a screen shot. It should be emphasised that the key aims were to gain a better understanding of setting up an objective tree with leaf attributes that could be automatically defined and to see what type of output was produced by the automated evaluation functions, rather than being an accurate reflection of our priorities and requirements for image preservation. To this end, we decided to set up a small objective tree (see Appendix B). Most of the requirements are assessed by Boolean criteria. Although a series of yes/no responses would make a speedy evaluation process, it was not realistic for all the criteria we wished to assess. For example, if we knew that the images were not identical (Significant properties: Content: Image pixelwise identical > No) we wanted to use other criteria to assess how different they were. This was expressed by Significant properties: Content: Image similarity (RMSE). We were also interested in the resolution (in dpi) of the final output file. Requirements on the format were also felt to be more meaningful if a value could be attached. In the case of Complexity, we decided to use a three-point ordinal system. For Comparative size, a simple ratio would help comparison with images of different starting sizes. However, it should be noted that these choices were largely arbitrary, and chosen around the issues that we thought were most likely to give interpretable results, rather than reflecting our priorities in 15 The issue of format-specific actions will be discussed in Format specific migrations. EPIC Using Plato 15 Version 1.2

16 image preservation. For example, we did not choose to evaluate the image metadata as it was not clear that this included the technical EXIF metadata. 16 As we were interested in using automatic evaluation tools, we needed to find file types and migration tools which would produce XCDL files. For our starting files, we were successful in producing XCDL from tiffs. For the output files, we tried tif > tif and tif > jpeg 2000 migrations, but could not generate the required XCDL from the migrated files. Overall, this indicated that the automated evaluation tools were of limited utility in their current form; they are focussed on preservation action on image files, which means that their scope is insufficient to incorporate them into preservation actions required in the repository, where around 2790 items are image files 17. Successful use also requires the generation of valid XCL for both starting file and end file and this was unreliable; the XCL tools relating to text files (such as font information) were not tried because we could not generate the required XCDL files for our text-based formats. Suggestions for developments Creating and evaluating the decision trees The policy/objective trees can be quite long. It takes a long time to go through them systematically, and the way in which Plato is set up suggests that objective trees should be short; no advice is given on how much detail should be covered in the objective tree. In light of this we make several suggestions: Ability to evaluate on a particular branch/node, rather than the whole tree would be helpful (e.g. if there is a branch that is considered more important). This could be used to save time (especially as the experiments are not necessarily quick to run, and would be particularly relevant if you are genuinely trying to assess software options; you may want to compare several options directly on a couple of key criteria, and then fine-tune the options. This could be done using different plans, however if that is the case then the coherency of the policies and the objective tree is lost. While it is possible to weight according to some criteria, this can only be done after the evaluations and transformations, which can be very time-consuming. When comparing several different alternatives, with relatively large samples, it can be easy to lose track of which 'leaf' is being evaluated. This is particularly true when the values assessment criteria are defined in broadly similar ways (e.g. if there are several Boolean assessments, or several attributes being assessed on the same ordinal scale). A clearer system to indicate what parameter is being evaluated would be valuable. Rather than having the save option at the end of the page only, have it after each leaf, or a 'jump to bottom' button, in the same way as there is a 'jump to top' button (this would help avoid scrolling down the page at other times too, for example having confirmed calculated values, to be able to Save and Progress). 16 When we did return to try to assess Image metadata (from the format: output automatic evaluation option), we were not able to generate any meaningful results the evaluation indicated that the image metadata was 'UNDEFINED' for both the original image and the target. This was also the case for the options Image metadata: software and Image metadata: description. 17 This does not include images that form part of another item, e.g. webpage, article. EPIC Using Plato 16 Version 1.2

17 Better highlighting of values which are missing or have not been entered correctly. This stops the program from progressing, but the only indication is an asterisk by the relevant text box. In general, this is not easy to spot and is particularly problematic in cases where the objective tree, sample size, or number of migration tools is large. Highlighting them in red (as for the node/leaf with no text when setting up the objective tree) would be considerably clearer. Experiments The pop-up box to start the experiments could be formatted better. For example, a clearer summary of what will happen, n samples and m migration tools resulting in (nxm) experiments, and follow this with a clear question, e.g. 'Do you wish to continue?' with es/no (or Cancel) options, rather than a green arrow, which does not clearly indicate that the experiment has not started yet. The help pages explain the icons indicating experimental results as follows: Figure 6. Screenshot of the help explanation of experimental results icons. It seems most likely that this is an error; probably the green tick should signify that 'the experiment has executed successfully', but it is confusing nonetheless. Having selected a Preservation Action (Define Alternatives page) the information on which preservation action tool is which is limited on subsequent pages; this is particularly problematic if more than one action from a given provider is selected as by default these are given numbers rather than what the action is doing (e.g. tif > bmp). It is possible to rename the tools in Plato, instead of using the default name given to each too. It would be helpful to make it clearer that renaming is possible. Similarly, in the evaluation tree, the result from each migration could be made clearer by giving the name of the migrated file (including any file extensions), as well as the original filename. Transforming values If an attribute is assessed using a numerical range, the user enters which values map to the defined target values (1-5). (This contrasts with ordinal ranges, where the user enters the target value that should be associated with a given category.) It is not clear that if the numerical value is less than the value that maps to 1, that numerical value will map to 0. Nor is it clear whether this is affected by using Linear Threshold Stepping where interpolation is expected. It is important that values that should be mapped to 0 are clearly defined, as these indicate knock-out criteria. Similarly, it is unclear if transformation values need to be evenly distributed between the lowest and highest values when setting up the transformations, particularly from numerical measurements. EPIC Using Plato 17 Version 1.2

18 Assigning weights Weights can only be changed using the slider. While the presence of a text box suggests that values for weights can be entered directly, this does not seem to be the case and would be a welcome addition. We had originally weighted each attribute individually, rather than assessing each node of the objective tree individually. While this does help break down the problem, in our case it involved a certain amount of recalculation. There is value in both techniques, so greater flexibility in how the details are entered into Plato may be beneficial. The layout makes it unclear which value relates to which total. Assessing outcomes At the Analyse Results stage, stating the maximum score of any tool would help demonstrate how the various tools compare to the ideal. The help comment "The final ranking is based on a rational scale" could be explained more fully; an example of how the score for a given tool is calculated for each scenario (Weighted Multiplication and Weighted Sum) could also be of value. A consistent colour scheme would also improve clarity, e.g. Figure 7. Screenshot showing how the colour associated with a given tool may change, depending on the results of other tools. Here, the same tool (Abobe Acrobat 9 Pro) has been assigned a different colour scheme in Weighted sum and Sensitivity analysis than in Weighted multiplication, as a tool earlier in the list (Word PDF converter) has been discounted. Assessing the relative importance of different stages Several of the stages do not have mandatory steps. It is hard to evaluate the importance, and in some cases the relevance of these. For example, at the Take the Go decision stage, one can only proceed if 'Go' is entered. While this is sensible in the context of carrying out the experiments and the ultimate production of the plan, it does mean that considerable time and assessment is locked into the programme. It is also unclear why there is an option of 'Provisional Go', and what this is intended to represent; similarly for 'Deferred Go'. EPIC Using Plato 18 Version 1.2

19 Producing output It would be useful to be able to output the analysis carried out for use in reports, dissemination (particularly to non-experts), etc. While it is possible to produce output in XML, it does require some manipulation to produce clear output for inclusion in printed documents. This applies at stages prior to the completion of a preservation plan (which may not be possible, if a preservation plan cannot be completed because of the lack of suitable tools, the need to consult an outside expert at an intermediate stage, etc.), as well as for a completed and approved plan. Reflections on using Plato The utility of Plato for non-experts As noted by Becker (2010), considerable manual effort, and expert preservation planning knowledge is needed to produce the preservation plan: "The case studies described [in chapter 5] each involved several people for about a week, including a planning expert to coach the decision makers for many organisations, applying the planning approach to all or even just the most valuable collections is not feasible." Our experience tends to corroborate this. Considerable time is needed to gain sufficient understanding of the program to be able to use it or to assess the options available within it with confidence; this is regardless of running any of the tools or evaluating the experiments, which may also be time-consuming. As is often the case for new software, the initial barriers to use are high, and Plato does seem to be aimed at highly expert preservation specialists who have access to a large number of resources or pools of expertise. The evaluation process is lengthy, and our experience suggests that it may need to be done in stages, particularly for people new to preservation planning. Without providing appropriate help to enable the user to define a suitable policy tree and create a robust objective tree, it is very likely that some important aspects of the action path will be omitted in the first instance, through lack of experience. As a related point, throughout the programme we found the help functions to be of limited use. The help button that accompanies many text boxes often provides little additional information. For example, at the Transform Measured Values stage, the help for the Comments box is "Document any relevant observations and issues that you deem relevant [sic] for the transformation process." Some key stages have a question which links to further information and examples, and this was more helpful for the new user, but is not comprehensive, is sometimes contradictory and would benefit from links to further resources. We found Becker's thesis to be the main source of help and background information that is necessary to understand the programme. Plato would benefit from having more of this content integrated into the documentation. Setting up a suitable objective tree Drawing up an appropriate objective tree is a crucial part of producing a preservation plan, but is a time-consuming task even when experts are involved. We suggest that to provide support for this, a checklist (similar to the DCC Checklist for a Data Management Plan) could be a valuable resource. Clearly, it would be impossible to draw up an exhaustive checklist of factors that should be considered, not least because these will depend on the institution, the size and type of collection, the urgency of the preservation plan, etc. However, while publicly-available preservation plans can EPIC Using Plato 19 Version 1.2

help with identifying factors, a checklist could ensure that common significant issues are highlighted for consideration; this would be particularly useful for small organisations with little

20 help with identifying factors, a checklist could ensure that common significant issues are highlighted for consideration; this would be particularly useful for small organisations with little experience in preservation planning. If appropriate, these checklists could be different according to the original format (for example issues with fonts would be relevant to documents with text, but not to images). One of the main challenges when drawing up an objective is the separation of what characteristics you want to assess and what level you want the output file to meet. Here, the challenge is principally one of conceptualisation; it becomes apparent as one meets unexpected issues in the output file and realises that setting up the objective tree differently (e.g. asking slightly different questions, changing assessment from Boolean to Ordinal or one of the numerical options) would allow these differences to be addressed more effectively. That is, we were required to reassess how we regarded a particular attribute in the starting (or resulting) file. This issue can be illustrated by thinking of an image, such as a figure, within a file. Intuitively we might say that it has to be the same, and set up a leaf 'Image identical' with 'es' and 'No' as the available options. However, this does not allow for nuances in differences. Is the image the same size? If not, is it smaller? Does this affect clarity if key points are now difficult to resolve by eye? Is it larger? Does this affect clarity if the image becomes pixelated? Has it been distorted? Is the aspect ratio the same? Has text been resized differently from non-text components? It is easy to see that it is hard to produce an effective tree that covers all possible issues, but also that not addressing these issues makes assessment of the migrated file difficult., At the Evaluate Experiments stage, it is possible to add comments to explain the reason for the value assigned for the output. For example, for a migrated file, the attribute Character Appearance > Font size is assessed on an ordinal scale as having small changes, a comment outlining the nature of these changes can be added, e.g. 'Font appears smaller in places'. Figure 8. Screen shot indicating how additional comments can be used to give more detail when evaluating experiments. However these comments are not displayed later in the preservation plan, as they provide important information about the results. Choosing the sample set Plato help information suggests a sample size of 3-10 items. Through experience, this is certainly a sample size that would not be too daunting to evaluate; however it does seem small if the number of items to be migrated is large. If there is any evidence that the tool does not perform consistently, or the majority of items for assessment can be evaluated automatically, then it would seem sensible to test any promising tools further using a larger (and possibly randomly chosen, rather than EPIC Using Plato 20 Version 1.2

21 carefully selected) sample set. Similarly, if there is no clearly favoured tool, running those which perform similarly well on a larger sample may be beneficial. The weighting scheme chosen may also be a factor in the relative performance of competing tools, and when this is flagged by Plato at the Validate Preservation Plan stage, caution is recommended. Again, it may be more beneficial to use more samples as well as investigating the effect of changes to the weighting scheme; this may depend on the level of confidence in the weighting scheme being applied. In practice, as we adopted an iterative process to try out the various automated evaluation options and the alternative migration tools, our sample sizes were very small (in some cases only one file). However, for a genuine migration plan, we would feel more confident in the strategy chosen if it had been tested on several files. Although it is necessary to compare the different tools, it would also be valuable to be able to see clearly how each tool handled each assessment criterion. Format specific migrations Becker (2010) provides many valuable insights on how the program is intended to be used. For example, he comments on the common practice of trying to find the best format for the content, rather than the 'action path' required to arrive at the target, and that not all tools will produce outcomes with the same characteristics, even if the end file type is the same. Regarding the latter point, this was one of the reasons that we were interested in using Plato, to help us compare the results of different tools, for example to check if the output is standard compliant. We feel that, while the action path is important, if the resulting files are not fit for purpose then the suitability of the action path is irrelevant. (Of course, this can work both ways; if the action path is not suitable, then that preservation tool is effectively unusable.) Often, there will be an industry standard, or the preservation team will have some a priori knowledge which they have used to select a preferred output type. While Plato encourages the digital preservation team to think holistically about the process, this is coupled with the ability to compare the outputs of different software tools which are meant to be producing the same type of output. Carrying out the transformation procedure Intuitively, we expected this to occur before the experiments, as the minimum levels that the migration tools should attain should be (at least to some extent) independent of the results of the experiment. For example, it may be known a priori that any change in the format of an equation for example would be unacceptable. Setting this a posteriori seems to imply that we are simply ranking the different options relative to one another, rather than against a set of defined criteria. This is also suggested by the help tool and the automatically calculated transformation suggestions, which make use of the best and worst values obtained. Certainly, one may wish to re-evaluate the stringency of these criteria in light of the experimental outcomes, for example if no experiment reaches the required standard. This could then be logged (or otherwise documented), before carrying out the transformation again. Assigning suitable weights Becker (2010) comments that too much time can be spent on assigning weights to factors with low importance. In this case, many of the attributes turned out to have low weights. As these weights EPIC Using Plato 21 Version 1.2

were established based on what researchers themselves thought were important factors, rather than any preconceptions about what may be required, then this could not have been known a priori.

22 were established based on what researchers themselves thought were important factors, rather than any preconceptions about what may be required, then this could not have been known a priori. Assigning suitable weights is further complicated by the confusing layout used at this stage in the process as it is hard to see which node is being assessed. Producing a Preservation Plan We have touched on this issue in Producing output, but it bears re-examining here. Preservation planning clearly does not happen in a vacuum. Those involved in the planning process will be part of a larger group. For example in the repository team at DSpace@Cambridge, it is likely that one person would take part in preservation planning of a particular collection. While this will happen in consultation with colleagues, they would not necessarily need to have access to Plato to contribute. Aspects of the preservation plan may have wider implications, such as with budgets, or staffing or long-term strategies within the institution. Expecting all those involved with this process to have access to Plato is not realistic, nor is using the XML output (without considerable manipulation). We regard these difficulties with dissemination as a real barrier to Plato's utility. Adam Farquhar (Ashenfelder, 2010) indicates that it is possible to "produce structured documents in PDF and other formats" from Plato, however we were not able to find any option that would allow us to do this, nor could we find any advice on producing documents in PDF in the help facility. Bugs Automated evaluation tools Several of the tools that are meant to allow for automated evaluation do not seem to work, giving completely blank output for both decision (es/no) and description. These include: e.g. (from the combined scanned image tree): x resolution identical (similarly y resolution identical), resolution unit identical, Stability, IPR protection, Open license, Format characteristics > Ubiquity, Format documentation > quality, Format characteristics > Complexity, Format characteristics > Disclosure, Licensing > Open Source. Displaying numerical values There are some issues with floating point values, and presenting numerical results of transformations in a meaningful way, e.g.: Figure 9. Example of problems with rounding errors. While this can be altered easily by the user (especially as checking automatically-calculated values is required), a more robust method of displaying numbers in a meaningful way would inspire confidence. EPIC Using Plato 22 Version 1.2

Digital Preservation: How to Plan

Digital Preservation: How to Plan Preservation Planning with Plato Christoph Becker Vienna University of Technology http://www.ifs.tuwien.ac.at/~becker Sofia, September 2009 Outline Why preservation planning?