my Grid Requirements for the information repository and management of information in mygrid Document Class: Requirements document

Size: px

Start display at page:

Download "my Grid Requirements for the information repository and management of information in mygrid Document Class: Requirements document"

Anabel Pitts
6 years ago
Views:

1 my Grid Requirements for the information repository and management of information in mygrid Document Class: Requirements document Document Reference: PL2 Issue No: 0.2 Author: Peter Li Institution: University of Newcastle Date: Pages: 49 Abstract: This document reports on a study to identify the requirements for the information repository and the management of information in the mygrid project. It is intended that the document will provide a reference for the design of the mygrid information repository and stimulate discussion amongst the mygrid developers, investigators, and the wider life sciences community regarding its specification and implementation.

2 0 Document Information 0.1 Table of Contents 0 Document Information Table of Contents Document History Forecast Changes References Introduction Requirements Gathering Terminology Data Storage Data formats Flat files XML files Relational data Data Archiving Metadata Metadata types Technical metadata Contextual metadata Ownership Versioning Life Sciences Identifier Metadata storage Provenance Contents of Provenance Workflows and provenance of MIR usage Query Capability Data Formats Relational data XML data Flat file data Metadata querying Free text searching Distributed Query Processing Types of repositories in mygrid Data Formats Distributed Query Processing and Workflows Wrapping of public databases Views Data Processing and Transformation Data Transformation User-Defined Functions Management Information Service User Management Ref PL1 Issue 0.2 Page 2 of 49

3 10.1 User accounts Project working Notification Distributed Annotation System Security Security Issues Authentication Authorisation Role-Based Access and Privileges Views Auditing mygrid Information Repository Activities Capacity and Performance Data Volume Indexing OGSA-DAI Fault Tolerance Counter measures Transactions Concurrency Control Database Backup and Recovery Personalisation Personal Annotation of Data Integration of Legacy Data Views Notification Profiles Appendix I User requirements for biologists Bio Bio Bio Bio Bio User requirements for specialist bioinformaticians Bioinf Bioinf Bioinf Bioinf User requirements for system administrators S User requirements for tool builders T Appendix II An example of a flat file entry in Swiss-Prot Appendix III An example sequence entry in FASTA format Ref PL1 Issue 0.2 Page 3 of 49

4 0.2 Document History Revision Description of change Initial issue to mygrid WP3 Newcastle. Second draft based on comments received from Paul Watson and Anil Wipat. Released to mygrid WP3 Newcastle and Manchester. The next draft will take into account the comments received from WP3 Manchester, Alan Robinson and Nick Sharman. 0.3 Forecast Changes The requirements described in this document are subject to change based on the outcome of the following issues in mygrid: Requirements for mygrid gathered from industrial users at GlaxoSmithKline, AstraZeneca, Merck and Non-Linear Dynamics. The storage of metadata in the form of RDF. Provision of a mygrid e-lab book. 0.4 References [1] mygrid User Group web pages (2002) [2] Apgar et al. (2002) Life Sciences Identifier (LSID): Draft Specification for Review and Comment. [3] Werner, P. (2002) Life Sciences Identifier (LSID): A Foundation for Wide Area, Scientific Collaboration and Informatics Interoperability. [4] Greenhalgh, C. (2002) Towards a simple operational model for mygrid? pdf [5] Smith, J. et al., (2002) Distributed Query Processing on the Grid. Proc. Grid Computing GRID ed. M. Parashar, Baltimore, USA, November LNCS 2536, Springer Verlag, [6] Watson, P. (2002) Databases and the Grid. Newcastle University Computing Science Technical Report CS-TR-755. [7] IBM DiscoveryLink web pages. [8] Atkinson, M. et al., (2002) Grid database access and integration: Requirements and functionalities. [9] Pearson, D. (2002). Data requirements for the Grid: Scoping study report. [10] Goble, C. et al., (2001) mygrid project proposal. [11] Distributed Annotation System Ref PL1 Issue 0.2 Page 4 of 49

5 1 Introduction This report documents the results of a requirements analysis for the mygrid information repository (MIR) and integration of its data with external user repositories and public databases. This analysis was deemed necessary by the Information Repository Management work package in the mygrid project to provide a basis for the design and implementation of the information repository and its distributed query processing service. The report consists of three sections. Section 1 describes how the requirements were gathered and Section 2 defines the terminology used in this report. Section 3 contains a list of generic requirements for the MIR and distributed querying of data repositories Ref PL1 Issue 0.2 Page 5 of 49

6 2 Requirements Gathering The requirements for the MIR were derived from discussions within the Information Repository Management work package in the mygrid project and resources available on the web. The results of the work undertaken by the mygrid User Group were also considered in this requirements analysis [1]. The mygrid User Group developed a taxonomy of mygrid end users of which there were five main types: biologists, bioinformaticians, application tool builders, system administrators and managers. These user types are underlined in Figure 1. People conforming to these roles were recruited from various academic institutions in the UK and their requirements of the mygrid platform were captured using a semi-structured interview. The results from these user interviews are presented in Appendix I. MyGrid Users Biologists Computer Specialists Managers Rare Users Occasional Users Bioinformaticians Tool Builders Systems Administrators Project Managers Bioinformatics Managers Bioinformatics Tool Builders Figure 1. A taxonomy of mygrid users. The requirements for the MIR were either directly extracted or inferred from the results of the interviews with end-users. The requirements for the MIR from the developers on the mygrid were also considered Ref PL1 Issue 0.2 Page 6 of 49

7 3 Terminology This section introduces several terms used in this document. Data is a collective term for the values assigned to data items and data format describes how this data have been structured. The data is generated by a data producer such as an application program or laboratory experiment. A database is an organised collection of data which is stored and managed by a database management system (DBMS). A service is a capability provided by a resource which could be a data analysis program or a data repository. The personal repository is now referred to in this document as the mygrid information repository (MIR) since this name conveys a more accurate meaning of its role in mygrid. Each mygrid user possesses data which is stored in a MIR deployed by their organisation. Databases can be arbitrarily categorised into three types in the mygrid environment: the MIR, public databases such as EMBL and Swiss-Prot, and external user repositories. These latter repositories contain biological data which is proprietary to a user and/or their research group Ref PL1 Issue 0.2 Page 7 of 49

8 4 Data Storage This section describes the requirements for storing data in the MIR. It describes what formats these data may be structured as and how integrity of the data within these formats should be maintained. The MIR is required to accommodate data generated from laboratory and in silico experiments. Examples of the types of data which might be stored in the MIR include gene sequences, protein structures, signalling pathways and abstracts from scientific papers. Other types of data requiring storage are provenance and intermediary data generated from the execution of workflows. In addition, the workflow definitions constructed by users should also be stored in their personal repositories. 4.1 Data formats In the life sciences, data is structured in a number of different formats: flat files, XML and relational data. Biological data needs to be stored in the MIR in these data formats Flat files A flat file is a file containing data which have no structured interrelationship and no regard to its visual representation. Flat files are a common format for storing textual data in the life sciences. There are various types of flat file formats used to represent the different types of biological data. Examples of flat file formats representing DNA information include EMBL and Genbank. Common file formats holding protein information are Swiss-Prot and PDB. These flat files represent data by flagging each line with an n-lettered code to indicate the type of data present on that line. An example of a Swiss-Prot entry is shown in Appendix II. DNA, RNA and protein sequences are also commonly represented in the FASTA format. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. An example sequence in FASTA format is shown in Appendix III XML files XML format is becoming increasingly common for structuring biological data. Examples of where XML is used as a general framework for annotating data in the life sciences include the Bioinformatic Sequence Markup Language (BSML) and Biopolymer Markup Language (BioML). In mygrid, workflows are being represented using XML in the form of Web Services Flow Language (WSFL). Provenance records generated from the execution of workflows are also being represented using XML. The structure of XML documents are defined by Document Type Definitions and XML schemas. These files should either be stored or be made accessible to the MIR so that they can be used to validate XML files store there and maintain data integrity Relational data Biological data may be structured into relational tables which embody different aspects of the data but contain overlapping information. Users should be able to store relational data in the MIR Ref PL1 Issue 0.2 Page 8 of 49

9 The integrity of relational data must be maintained in the MIR. This involves the ability to control relationships between data using DBMS capabilities such as referential integrity, check constraints and triggers Referential Integrity Referential integrity should be used to enforce the relationships between tables in relational data. For example, a data record in a table cannot exist for a user ID if that user ID does not exist in the Users table Check Constraints Check constraints should be used to enforce rules on a column of data. Examples of the rule which should be enforced using check constraints include protein sequence cannot contain letters J, O, U or X and DNA sequence can only contain the letters A, T, C and G Triggers Triggers are compiled SQL procedures in a database used to perform actions based on other actions that occur in the database. These triggers should be created and executed to enforce rules other than referential integrity between tables before and after the insertion, update and deletion of relational data in the MIR. For example, triggers could be used to remove the dependencies on a piece of data which a user wants deleted. The functionality provided by triggers can also be used as a basis for the notification of data change events in the MIR to facilitate the general mygrid notification service. 4.2 Data Archiving Industrial organisations are compelled to store outdated data for extended periods of time (decades) for legal and regulatory reasons. They will also wish to do this so that potentially valuable information is not lost. This may create significant capacity and performance demands on the MIR. Therefore, a facility for transferring old data to a separate off-line data archive may be required Ref PL1 Issue 0.2 Page 9 of 49

10 5 Metadata This section describes the requirements for metadata and its role in performing data operations in the MIR. 5.1 Metadata types Metadata is a term for data that describes other data. Metadata is important to the MIR since it provides references, context and meaning to the data stored in there. The types of metadata required by the MIR to perform its data operations can be grouped into technical, contextual, currency and ownership metadata Technical metadata Users data stored in the MIR and in external databases need to be described in terms of its technical characteristics. These technical characteristics are location, data structure and data resource characteristics Location The MIR needs to know where data is located in order for it to be retrieved. This is especially important if legacy data belonging to a user is kept in another database as it must be retrievable by the MIR if it is required for processing by mygrid services. The data location may be expressed in the form of a logical reference to the data such as a full file pathname, a uniform resource locator (url) or an object name in a database Data structure The data structure defines the logical groupings of data items and their interrelationships in a data source along with their order of appearance, format, size and type within each logical grouping. Knowledge of the data structure in a data resource is required to navigate through its contents and to be able to directly access specific pieces of data. Examples of data structures include database schemas, such as that for the MIR, which is required by the mygrid portal, and the record structures of flat files, and XML files which are in the form of DTDs and XML schemas Data resource characteristics The technical characteristics of data resources are required for determining effective and efficient methods of discovering, retrieving and managing data from data resources. Size is an important data resource characteristic which might be required for determining where data might be stored and how much local space must be allocated before it is downloaded. Information on data resource characteristics is particularly important for distributed query processing which relies on a data dictionary containing statistics such as the number of processor nodes, distribution of volume over these nodes and availability of the resource Contextual metadata The data stored in the MIR must be furnished with contextual metadata to provide it with meaning and context. The contextual metadata that is used should conform to a specific Ref PL1 Issue 0.2 Page 10 of 49

11 standard that data can be defined accurately and without ambiguity by the MIR and by other mygrid services. Data in the pre-prototype and 0.1 version of the MIR was annotated with their concept type, e.g. These concept type terms were obtained from ontologies provided by the Metadata and Ontologies work package Ownership The MIR must assign an owner to each item of data it stores. This ownership metadata is important for a number of reasons. It is required for maintaining data security in the MIR, for establishing intellectual property rights over data and for users to credit an owner when using their data. Knowledge of the data owner also provides an indication of the quality and value of the data Versioning Data in the life sciences is volatile. An example of a process which generates extremely volatile data is contig assembly. This volatility is caused by updating of data which can involve the amendment or deletion of the data contents, or the addition of new data. However, previous versions of data may need to be retrieved if, for example, a biologist disagrees with the new functional annotation of a protein and needs to re-assess the previous annotated record of the protein. It is therefore essential to be able to distinguish between data in different states which arise and co-exist over time in the MIR by annotating data with versioning metadata Life Sciences Identifier The Life Sciences Identifier (LSID) is a naming scheme for uniquely identifying biological data in federated repositories which has been submitted to the I3C for approval [2]. These data identifiers are unique because they incorporate technical metadata. The LSID is a two-tiered naming system which separates the name and physical location of the data item. The name of the data item is based on the DNS domain of the authority defining the data and a namespace which denotes a particular database to constrain the scope of its object ID which is an alphanumeric identifier unique in the database. An optional field containing a unique integer representing the version of the objectid can also be incorporated into the LSID [3]. An example of a LSID for an entry in a Swiss-Prot database housed at the European Bioinformatics Institute is shown below: Urn:LSID:<AuthorityID>:<NamespaceID>:<ObjectID>:<Version> Urn:LSID:ebi.ac.uk:swiss-prot:P10166:3 Data items in the 0.1 version of the MIR and in other external repositories are identified using an alphanumeric or integer identifier. These identifiers of data items are meaningless without knowledge of the database that they referring to and the location of this database. For example, consider the integer which is the unique identifier for a gene in a database. What database is the identifier referring to and where is physical location of this database? For these reasons, it is suggested that the MIR supports the use of the LSID to distinguish data between multiple mygrid information repositories and other external databases. Adoption of LSID to describe the origin and ownership of data will also benefit the recording of provenance in mygrid. It is essential that the origin of data being analysed is known and this will be conveniently provided by its LSID. In addition, the LSID standard provides a way of incorporating versioning metadata into identifiers and thereby enabling the MIR to distinguish between two or more versions of data Ref PL1 Issue 0.2 Page 11 of 49

12 5.2 Metadata storage There is an issue of where and how metadata should be stored in the mygrid environment so that it can be queried by mygrid services. In mygrid 0.1, the metadata required by the MIR and other mygrid services was stored in the MIR itself. The types of metadata stored in the MIR were defined by its database schema. Consequently, this schema had to be modified every time a new type of metadata had to be stored in the MIR. Resource Description Framework (RDF) has been suggested as a less rigid and more flexible framework for describing data [4]. If RDF repositories are used for metadata storage then the MIR will be required to query and manipulate metadata stored in this format. Another issue is the location of the metadata repository. Since the MIR places huge demands on metadata to perform its data operations, it is natural to continue to store metadata with data in the MIR Ref PL1 Issue 0.2 Page 12 of 49

13 6 Provenance This section describes the role of the MIR in provenance in mygrid. It also describes the relationship between provenance relating to the data operations made by users on the MIR and workflows. 6.1 Contents of Provenance Provenance is a form of metadata that describes the history and origin of data and thereby an indication of the value and trustworthiness of data. This is essential in the life sciences where research is based on analysing data which has been created and maintained by someone else, for example, the DNA and protein entries in the Embl and Swiss-Prot databases maintained by the European Bioinformatics Institute. In a similar fashion, users will also want to know how data in their personal repositories have been generated so that they can ascertain the quality and value of their data. This requires the provenance record to be tightly linked to the data it describes in the MIR. A provenance record can be considered as an audit trail which traces the sourcing, moving and processing of data by recording the metadata describing each of these steps. Metadata associated with a provenance record for MIR data generated by the execution of workflows include its date of creation, owner and the bioinformatics services which were used in creating the data. These types of metadata form a core set of metadata which should be available in provenance records. However, other types of metadata will also be required in provenance records which will be specific to the type of data analysis being performed by the user. These types of metadata will consist of the parameters used in services within workflows. For example, a biologist might want to know what E-value threshold was used in determining homology between sequences in a Blast analysis. This will require a flexible means of adding new metadata into the provenance record for the interpretation and analysis of data by the user. 6.2 Workflows and provenance of MIR usage The operations that are performed by users on data in the MIR can also be considered as metadata which are important to provenance. As more functionality is added to the MIR and used for manipulating and transforming data, writing these operations to the provenance record will be essential. Moreover, the data operation steps in transforming data from one format to another may form the basis of mini-workflows which users may want to repeat, share with other users and incorporate into other workflows Ref PL1 Issue 0.2 Page 13 of 49

14 7 Query Capability This section describes the requirements for querying data in the MIR. 7.1 Data Formats The MIR must have the means to retrieve any data which has been stored there by its users. Since data will be stored as relational data, XML and flat files, the MIR will need to be able to query data at all levels of granularity associated with these formats in order to select the specific piece of data which is required for retrieval Relational data Structured Query Language is the standard language used for querying relational data in relational database management systems. It is therefore essential that the DBMS used for storing relational data implements ANSI SQL. Operators will be required to compare and categorize data, and aggregate functions for summarizing data. Relational data will also have to be sorted using group functions and textually restructured using character functions. More sophisticated database queries will require joining tables to integrate data between relations, and the use of subqueries to define unknown data and to combine multiple queries into one XML data The MIR will need to query all parts of an XML document. This will require the MIR to support an XML query language such as XPath that allows queries to specify the location paths through the data. These location paths are the sequence of XML tags which are required to identify elements and attributes for extraction. The DBMS should also allow elements or attributes in XML documents to be indexed if they are frequently queried Flat file data The querying of data in flat files is addressed in the Data Transformation and Processing section of this document. 7.2 Metadata querying RDF has been suggested as a protocol for storing metadata in mygrid. If this is the case, then the information repository will need to be able to query, retrieve and manipulate metadata in RDF repositories in order for it to perform its data operations. This will require the information repository to support the use of Jena or another application programming interface which can be employed to query and manage RDF. 7.3 Free text searching The ability to query free text found in flat files, XML and relational data stored in the MIR will be a useful function since biologists often want to retrieve information based on a keyword search. For example, a scientist might want to find out what projects are developing a drug compound against tyrosine kinase receptors which will require a search against the MIR using the keywords tyrosine, kinase and drug Ref PL1 Issue 0.2 Page 14 of 49

15 8 Distributed Query Processing This section describes the requirements for distributed query processing (DQP) in mygrid. 8.1 Types of repositories in mygrid Data repositories in the mygrid user environment can be arbitrarily grouped into three types: the MIR, public databases such as EMBL and Swiss-Prot, and external user repositories. These latter repositories are proprietary databases of biological data that they wish to integrate with mygrid. For example, the analyses that users want to perform may involve federating data which have been distributed amongst these three types of repositories. Combining data from these repositories will require DQP for integrating the data contained within them. 8.2 Data Formats The latest work on DQP in the Information Repository Management work package has been involved with federating structured data from relational and object-orientated databases [5]. Since data in the MIR will also be stored as flat files and XML, DQP will be required to integrate data in these formats whilst masking the differences, idiosyncrasies and implementation of the underlying data source from the user. 8.3 Distributed Query Processing and Workflows mygrid users will want to incorporate DQP into their workflows. An example of the use of DQP is to compare data, for example a gene sequence, referenced by the same identifier in local and remote repositories and then send the latest version for analysis to a mygrid service. This step should be incorporated into workflows if the latest versions of data from repositories are required for analysis by the user. 8.4 Wrapping of public databases The DQP service being provided by WP3 will require the databases containing the data being federated to be wrapped using OGSA_DAI Grid services. The use cases provided by the mygrid User Group indicate that the EMBL, SwissProt and PDB/MSD would be popular candidates amongst the biologists and bioinformaticians for OGSA-DAI service wrapping. Using these databases, scientists can obtain information on genes, proteins and protein structures which, for example, can be used to study the relationship between protein sequence and protein structure which is an area of research for Bioinf2 [1]. Furthermore, the biologist referred to as Bio3 in the mygrid user cases studies single nucleotide polymorphisms (SNPs) and she might want to integrate her SNP data, which might be stored in the MIR, with data from EMBL, SwissProt and PDB/MSD to determine whether her SNPs in genes are silent or produce a change in the corresponding protein sequence and protein structure [1]. 8.5 Views See Personalisation section in this document Ref PL1 Issue 0.2 Page 15 of 49

16 9 Data Processing and Transformation This section describes the requirements for transforming data between different formats. It also describes the requirements for processing data held in the information repository into management information and the caveats of producing this type of information. 9.1 Data Transformation The transformation of data between different formats is a task that is frequently undertaken by bioinformaticians and biologists. Much of the in silico analyses of data undertaken by the scientists interviewed by the mygrid User Group involved workflows composed of multiple computational steps of database queries and applications of analytical tools or algorithms. However, users were continually faced with the problem of transforming, exporting or saving their data as a different format that can be imported into another application and this is tedious and time-consuming process. Due to these interoperability problems, there is a requirement for the MIR to seamlessly marshal the flow of data from one application or repository to the next. This involves manipulating or transforming the saved intermediary data into the form required by the following application. The use of XML to store data has solved some of these problems but data present in XML elements may also need to be manipulated. For example, the data analysis performed by the biologist referred to as Bio3 in the user cases involved the identifying signal peptides in protein sequences using a bioinformatics application called SignalP. Since only the first 60 amino acids from protein sequences are required by SignalP to perform its service, Bio3 was required to write a Perl script which extracted the 60 amino acids from the protein sequences which he wanted to analyse User-Defined Functions Due to the diversity in flat file formats in the life sciences, the ability to create and execute userdefined functions (UDFs) will be required for data transformation in the DBMS used to implement the MIR. A UDF is a procedural functionality created in the DBMS using a host programming language that can be incorporated into a SQL statement. Open source libraries such as BioJava containing classes that can manipulate bioinformatics data could be employed in the creation of UDFs. For example, classes are available in BioJava which can be used to transcribe DNA to RNA, perform in silico enzymatic digests of proteins and converting between flat file formats. The MIR mediation of data flow between services in workflows will provide gains in computational performance which will not be realised if data transformation is performed by the user portal or by a third party transformation service. 9.2 Management Information Service Management information systems is a term for the computer systems in an organization that provide information about its business operations. This might be information about sales, inventories and other data that would help in managing the running of the organization. Value can be added to the MIR by processing data from all personal repositories into management information which can then be used to aid managers in their decision making. For example, a manager might need to justify the decision to purchase a licence to use a commercial database of genomic information after the 90-day trial period. Management information on how Ref PL1 Issue 0.2 Page 16 of 49

17 often the genome database has been queried during its trial period would be useful in deciding whether it is worthwhile buying a licence. Generation of management information requires the data stored in the MIR to be processed into such a form which managers can use as an aid to decision-making in managing their organisations. The generation of management information is a service which should be provided by the MIR. This management information service (MIS) can be delivered using data warehousing techniques, triggers, complex SQL queries or a combination of the three. Collection of management information poses a need for responsible information handling since the management information will include a user s personal data. If the MIS is to be used then it should be made publicly known along with the uses of the management information. Data access security needs to be configurable to prevent the contravention of data privacy laws in the country in which the MIR is deployed. Appropriate security measures such as views should be used to restrict access to sensitive data Ref PL1 Issue 0.2 Page 17 of 49

18 10 User Management This section describes the requirements for user management in the MIR. It also describes the requirements for project working amongst users User accounts A stable user management system is mandatory for maintaining the security of the MIR. Users must be properly managed in order to protect the data held in the MIR. New users of the personal repository will require accounts on their organisation s MIR to be created for them with the default profile and access privileges that are required for them to accomplish their duties. In addition, users will need to have their accounts dropped from the MIR if they no longer require access to the personal repository. The ability to alter a user s profile after user creation will be required in case the role undertaken by the user changes Project working Much of the work in the life sciences is undertaken in an environment which requires the sharing of information on a regular basis. A data sharing facility is required which allows people to access project-specific data collections, create and share annotation about data, and view the results of in silico analyses. Managers will need to be able to create projects and provide it with metadata, e.g. project name, for its identification. The project administrator will need to be able to add members to the project and allocate them with the appropriate access permissions and privileges Notification The ability of the MIR to notify users when a piece of data has been inserted, modified (e.g. annotation) or deleted is required for project working. An example of the need for notification during collaborative working is when a user is waiting analyse data which is being generated by a colleague in the project Distributed Annotation System The creation and sharing of annotations on sequence data could be facilitated by supporting the Distributed Annotation System (DAS) in the MIR [11]. The DAS standard allows scientists to view and compare annotations which are distributed across the web in other DAS servers. In addition to the storage of DAS annotations, the MIR will need to act as a Grid-enabled DAS annotation server to allow annotations to be shared amongst mygrid users within project teams, research groups and organisations. A Grid-enabled DAS reference server will also be required to provide the genome maps and sequences on which the DAS annotations in the MIR are based on Ref PL1 Issue 0.2 Page 18 of 49

19 11 Security The section describes the requirements for controlling access to data stored in the MIR Security Issues Protection of data from unauthorised usage is of the utmost importance to mygrid users of the information repository. The use of mygrid will be greatly affected by users confidence in how secure data is in the MIR and also how it will be used. The data in the MIR should be kept secure using the security measures provided by the DBMS that is employed to implement the MIR. Furthermore, the security measures used in the MIR will also have to be integrated with mygrid-wide security processes. Security in the MIR is important for another of reasons. Organisations will need to control who has access to their MIR and what kinds of data they can retrieve for viewing so that unauthorised data is not disclosed. For example, a manager will need to be able to view the data of those mygrid users who are members of his group and to decide what to publish and to whom. Threats involving the malicious and accidental modification and deletion of MIR data also need to be prevented. This might involve the deletion of project data by a hacker or the unauthorised amendment of sequence data by a student. There are four key areas of data security which the MIR must address: authentication, authorisation, privileges and integrity Authentication Authentication is the process by which the DBMS verifies a user s identity. This is the first layer of security that is required from MIR. The MIR will be required to integrate its authentication procedure for users with the security facility for mygrid services provided by the Architecture work package. Their security model is to be based on a X509 certificate authentication mechanism Authorisation Authorisation is the next layer of security required from the MIR. This is the process by which a DBMS obtains information of a user about which data operations they can perform and what database objects they can access on the database. The level and type of data access will vary amongst users of the MIR from a single element in an XML file to all of the data contained in a repository. Authorisation mechanisms must be present in the MIR which can provide this range of granularity in data access Role-Based Access and Privileges The fine level of security control to data access required by the MIR can be achieved through the use of privileges. Four types of privileges to data access can be granted to MIR users: read, write, update and delete. Providing a high number of individuals with different combinations of these access privileges at various levels of granularity to data will make security difficult to manage in the MIR. Instead, roles could be created for MIR users and each role configured with only those privileges to data access that will enable the user to perform their job duties. These roles should be based on the types of users identified by the mygrid User Group: biologist, bioinformatician, system administrator, application tool builder and manager. Database privileges need to be defined for each of these roles so that the correct authority levels are set for data access and Ref PL1 Issue 0.2 Page 19 of 49

20 manipulation, and system administration. A role will need to be associated with users when accounts on the MIR database are created for them. The work undertaken by a user may cross the boundaries of two or more types of mygrid end user. For example, a biologist may also perform activities associated with that for a bioinformatician. In these cases, users will need to be registered with two or more roles in order for them to accomplish their work Views The data stored in the MIR is of a sensitive nature since workflows and provenance contains information about the activities of its users on mygrid. Various mygrid services such as the MIS will require access to this data for retrieval and processing. However, this personal information may need to be provided in an anonymous form prior to its release from the MIR so that users activities cannot be traced back to a specific individual. This requirement is necessary in order to adhere to any data protection and confidentiality legislation which may be present in the country where the MIR is deployed. If personal information is required to be made anonymous, the specific piece of data which leads to the identification of users can be hidden through the use of views in the DBMS. These views are created by predefined queries which can therefore be used as a form of security restricting data, for example usernames, from being accessed by a service or a user Auditing mygrid Information Repository Activities Authentication and authorisation procedures can be employed to control access to data from known users but are not sufficient for preventing unknown or unauthorised access to the MIR. Monitoring the operations made on the MIR can improve the regulation of data access and ultimately prevent unauthorised access. This will require an audit facility which records the data operations performed on the MIR. These operations are then examined to determine whether they were authorised and, if not, who was responsible for performing those unauthorised MIR operations Ref PL1 Issue 0.2 Page 20 of 49

21 12 Capacity and Performance An indication of the volume of data that the MIR will have to accommodate and handle is provided in this section. This section also describes what measures should be adopted by the MIR to improve its scalability and performance Data Volume The growth of data in the life sciences has been fuelled by the high-throughput technologies and the proliferation of computational tools for data analysis and processing. The volume of data in the life sciences is currently estimated to be in the petabyte range [7]. Public databases stores data in the 100s of gigabyte range, for example, EMBL currently stores approximately 150 gigabytes of genomic data [10]. It is difficult to predict in advance how much data will be stored in the MIR. The figure will be dependent on the usage of mygrid which will differ between each user and organisation. However, there are performance requirements which will be required from all MIR regardless of the number of users or the types of organisation it serves. Low response times for complex queries will be required from applications that wish to retrieve subsets of data for further processing such as the management information service provided by WP3. In addition, the MIR will have to support high access throughput to cope with large number of clients simultaneously accessing data. To this end, the MIR must be designed and deployed as a mission-critical database that is scalable to handle the level of performance required by the users. The MIR is more akin to databases found at the heart of financial systems more than a typical scientific database Indexing The large volumes of data, provenance and workflows that will be stored in the MIR will have a detrimental effect on the speed of data query and retrieval. The efficiency by which data is queried and retrieved should be improved by making use of indices in the MIR to reference specific data that are frequently queried by users and mygrid services OGSA-DAI The analyses performed by mygrid users may involve intensive computation over large datasets. This will require the MIR, external user databases and public databases to be wrapped by OGSA-DAI Grid services to provide the efficient transfer of data between the user and the data repositories Ref PL1 Issue 0.2 Page 21 of 49

22 13 Fault Tolerance This section describes the requirements for fault tolerance mechanisms in the MIR. The section also describes what procedures are required to recover the MIR in the event of a database crash Counter measures Procedures must be available to counter situations which might lead to inconsistencies in the data stored by the MIR. For example, the failure of a long-lived workflow that interacts with the MIR could lead to its data not conforming to referential integrity Transactions There is a requirement to conduct transactions in the MIR so that data operations performed by a user or mygrid service can be grouped into units of work. To minimise the amount of recomputation should a failure in a workflow occur, the transactional control commands, Commit, Rollback and Savepoint, must be available and appropriately used in the MIR depending on the successful or unsuccessful execution of a transaction Concurrency Control Collaborative working carries with it the potential for inconsistencies to occur in data that are shared and operated on by users. These inconsistencies arise when the same piece of data is simultaneously being manipulated by data operations in multiple transactions. Concurrency control protocols are required to schedule transactions in such a way that they do not interfere with one another. These concurrency control protocols may involve locking methods which can deny transactions access to a data item if it is already being accessed by another transaction in the MIR or timestamping methods to enable read/write access to data by a transaction if the last update that data has been performed by an older transaction Database Backup and Recovery The MIR will need to be recovered in the event of a failure such as a power failure or a system crash. The following facilities will be required from the MIR for its recovery: A backup mechanism to make off-line backup copies of the MIR at regular intervals which can then be used to restore a crashed MIR. A logging facility to record the current state of transactions and associated data operations. The maintenance of a log file is also required for security in the audit of database operations and for supporting user sessions. A checkpoint facility to record when the MIR and its log file have been synchronised Ref PL1 Issue 0.2 Page 22 of 49

23 14 Personalisation This section describes how the Information Repository Management work package can contribute in the personalisation of mygrid to its users. Personalisation may be defined as the process through which the working environment is altered to suit the needs and individual preferences of users. The MIR together with DQP can address personalisation by storing personal annotation of data, provenance and workflows, integration of legacy data with the MIR, and views of the user s own data in the MIR and in public databases Personal Annotation of Data A laboratory scientist s paper log book acts as a persistent store for their data and provenance in a similar function to the MIR. However, the major difference between these two data stores is that the mygrid 0.1 version of MIR does not record the textual commentary that is also found in log books. This textual commentary is important since it adds further context to the results obtained from experiments from the user s perspective. For example, it is common for scientists to comment on the quality of the data they have generated based on the provenance which has been captured during the running of an experiment. The MIR should provide a facility that allows users to be able to furnish data, provenance and workflows with personal annotation Integration of Legacy Data Most laboratories will have accumulated legacy data during the time it has been in existence which may also form the basis of the research performed by its current scientists. In addition, a mygrid user may not wish to or it might not be practical to store their data in the MIR. In these scenarios, there is a requirement for a facility to integrate the MIR with repositories containing the laboratory s legacy data and external user data to allow it to be analysed by mygrid services. This could be achieved using federation middleware such as IBM s DiscoveryLink software or the DQP service currently being developed by work package 3. If the latter is implemented then there is a requirement for the DBMS used to store legacy data to be wrapped with OGSA-DAI services Views In addition to security, views can be utilized to permit users to access data in a way that is customized to their needs. Most biologists and bioinformaticians interviewed worked on multiple projects running several wet lab and in silico experiments at any one time. The volume of data, provenance and workflows which will be generated from these experiments will soon become difficult to manage without a facility to sort these data into groups. However, the structure and the naming of these groups will differ between each user since it will be dependent on the nature of their analyses and their personal preferences. Views should be used to present the data and its associated provenance and workflows in the MIR in the form of a hierarchical file directory based on metadata that specifies its position in the hierarchy for each data item. Currently, domain entities such as gene and protein sequences can only be sorted into groups in the 0.1 version of the MIR. Once data has been sorted into user-defined groups, views can then be used as a form of data aggregation to provide summaries of data in the MIR. These summaries are required by Ref PL1 Issue 0.2 Page 23 of 49

24 biologists who need to record their wet lab and in silico experiments in paper log books. For example, the mygrid user cases describes a biologist referred to as Bio1 who printed out a hard copy of the results of her bioinformatics analyses for storage. The MIR should allow users to collate data from in silico experiments for presentation in a user-friendly format for printing and pasting into paper log books. Users will only want to view certain types of metadata associated with domain entities, provenance records and workflows which are found in its flat file or XML file. For example, when a user examines proteins, he or she may only want to see the amino acid sequence of a protein, information about its functional domains and its associated journal reference. The MIR should provide a facility which allows users to specify what types of metadata they wish to view along with a domain entity, provenance record or workflow. Furthermore, the stock views of data provided from public repositories may not be suited to the user. The display of data entries in public databases should be amenable to that required by users. For example, the user may not be interested in all the fields of a EMBL entry and should be allowed to choose those fields, e.g. gene name, description and reference which they are interested in reading. Views should be employed to display data that users are only interested in seeing from entries in public repositories. Creating these sorts of views will involve using DQP to select the required data from OGSA-DAI wrapped public repositories Notification All of the scientists interviewed by the mygrid user group indicated that they wished to be notified of changes in the data content of public databases. The changes in data that users wanted to be notified about was dependent upon their area of research. The granularity of the data changes that users wanted to be notified also varied from user to user. For example, Bio1 wanted to be notified of any changes in sequence data relating to chromosome 9 and 10 in humans, whilst Bioinf4 was only interested in changes in those genes expressed in excretory cells in C. elegans Profiles Personalization depends on the gathering of information on the activities of users and persistent storage of this information so that personalisation of mygrid is present when users return to use it. Whilst this information can be directly requested from the user, it can also be actively gathered from data held in the MIR using complex queries similar to those found in the WP3 management information service and used to create a user profile. This user profile can then be employed to tailor mygrid services and components to the work performed by the life scientist. For example, knowledge of a user s field of research can be used configure the notification service so that the user is notified about entries in public databases based on keywords obtained from their user profile. Similarly, information about the services which a user frequently uses can be employed to arrange services or tools on the portal GUI in a manner specific to the needs of the user Ref PL1 Issue 0.2 Page 24 of 49

XML in the bipharmaceutical

XML in the bipharmaceutical sector XML holds out the opportunity to integrate data across both the enterprise and the network of biopharmaceutical alliances - with little technological dislocation and