Convocatoria Movilidad de Jóvenes Doctores

Size: px

Start display at page:

Download "Convocatoria Movilidad de Jóvenes Doctores"

Sybil McCoy
6 years ago
Views:

1 UNIVERSIDAD CARLOS III DE MADRID ESCUELA POLITÉCNICA SUPERIOR DEPARTAMENTO DE INFORMÁTICA Semantically-Enhanced Bioinformatics Platform (SEBIO) MEMORIA Convocatoria Movilidad de Jóvenes Doctores Autor: Dr. Juan Miguel Gómez Responsable en la Institución de acogida: Prof. Dr. Ying Liu Institución de acogida: Laboratory of Bioinformatics and Medical Informatics, University of Texas at Dallas (UTD)

2 Table of Contents TABLE OF CONTENTS...I 1 INTRODUCTION Context and Objectives Contributions of this Report Report Organization PROBLEM STATEMENT Problem Scenarios Real World Scenarios Semantic Heterogeneity RELATED WORK SEBIO FUNDAMENTAL CONCEPTS Bioinformatics The Semantic Web SEBIO Components MASIA: A MICRO-ARRAY INFORMATION AND DATA INTEGRATION SEMANTICS-BASED ARCHITECTURE Introduction and Goals Micro-Array Data Sources Heterogeneity in Biological Data Micro-Array Data Sources The MGED Ontology Micro-Array Data Integration The MASIA approach The MASIA Software Architecture BLISS: A BIOMEDICAL LITERATURE SOCIAL RANKING SYSTEM Introduction and Goals Collaborative Discovery Bridging the Gap: Social Semantics BLISS: A Biological Literature Social Ranking System BIRD: BIOMEDICAL INFORMATION INTEGRATION AND DISCOVERY WITH SEMANTIC WEB SERVICES Introduction and Goals BIRD: Biological Information Integration Discovery I

3 7.3 Needle in a haystack: Dealing with Semantic Web Services Using BIRD for Biomedical Information Integration CONCLUSIONS AND FUTURE WORK REFERENCES Printed Publications World Wide Web Resources II

4 1 Introduction This report aims at summarizing and outlining the scope of the scientific joint work and collaboration that has taken place during a two month research stay in the context of the Convocatoria de Movilidad de Jóvenes Doctores de la Universidad Carlos III de Madrid. This research has been conducted in the Laboratory of BioInformatics and Medical Informatics of the University of Dallas at Texas (UTD), headed by Prof. Dr. Ying Liu, one of the most prominent researchers in the area. In the last years, technological advances in high-throughput techniques and efficient data gathering methods coupled with a world wide effort in computational biology have resulted in a vast amount of life science data often available in distributed and heterogeneous repositories. These repositories contain interesting information such as sequence and structure data, annotations for biological data, results of complex and expensive computations, genetic sequences and multiple bio-datasets. However, the multiplicity and heterogeneity in the objectives, methods, representation, and platforms of these data sources and analysis tools have created an urgent and immediate need for research in resource integration and platform independent processing of investigative queries involving heterogeneous data sources and analysis tools. It is now universally recognized that a database approach to analysis and management of biological data offers a convenient, high level, and efficient alternative to high volume biological data processing and management. The Semantic Web and Semantic Web Services paradigm promise a new level of data and process integration that can be leveraged to develop novel high-performance data and process management systems for biological applications. Biomedical ontologies constitute a best-of-breed approach for addressing the aforementioned problems. Using semantic technologies as a key technology for interoperation of various datasets enables data integration of the vast amount of biological and biomedical data. In this research proposal, we aim at providing a Semantically-Enhanced BioInformatics Platform (SEBIO) that handles effectively: Conceptual Models for Biological Data Use of Semantics to manage interoperation of Biomedical datasets Biomedical Data Engineering using ontologies 1

5 Support of Ontologies for Biological Information Retrieval and Web Services Our SEBIO approach ascertains knowledge discovery and provides a semanticallyenhanced solution to harvest and integrate information from text, biological databases, ontologies and terminological resources. This would for example be used with largescale knowledge bases such as the Gene Ontology. Equally, we will benefit from the existence of semantically annotated corpora for testing and training. We will also focus on the study of paramount importance for efficient semantic mining based on NLP techniques. 1.1 Context and Objectives The goal of this research staying is to start off and progress an international research collaboration line which enables mutual synergy and knowledge transfer and exchange. This would revert into fruitful cooperation results among the Universidad Carlos III and the SoftLab Group together with the University of Dallas and the group of Prof. Dr. Ying Liu. These objectives have been fully achieved, as the outcome of this report will show, being established a two-way scientific cooperation solidly built up on the accomplishments of the SEBIOS proposal. In the following sections, we will document the mentioned outcome and the contributions of this report. 1.2 Contributions of this Report Technology is a means optimizing benefits and resources for a number of disciplines, particularly research disciplines. In particular, several research disciplines such as Bioinformatics, where information integration is critical, could benefit from harnessing the potential of new approaches, such as the Semantic Web and Semantic Web Services. For this, we believe that the Semantic Web and Semantic Web Services paradigm (see section 4.2 for further details) promise a new level of data and process integration that can be leveraged to develop novel high-performance data and process management systems for biological applications. Using semantic technologies as a key technology for interoperation of various datasets enables data integration of the vast amount of biological and biomedical data. The breakthroughs of adding semantics to such data is depicted in Figure 1. In a nutshell, the use of knowledge-oriented biomedical data 2

6 integration would lead to achieving Intelligent Biomedical Data Integration, which will bring biomedical research to its full potential. Figure 1. Intelligent Biomedical Data Integration In the following sections, we will unfold how the SEBIO approach contributes and sets itself as a cornerstone for the development of Intelligent Biomedical Integration. 1.3 Report Organization This report is organized as follows. In chapter two, the problem statement is presented. We describe the context of our work and particularly, our domain of interest. We illustrate our motivation with the precise problem scenario and we use it to formulate the leading guidelines of our work. In chapter 3, the state of the art bioinformatics, what we understand for bioinformatics and biomedical research. 3

7 The gist of our work is discussed in chapter 4, 5, 6 and 7, where the SEBIO approach is unfolded. Finally, the remainder of the report presents our conclusions and future work section, which concludes the report. 2 Problem Statement As discussed in [Cohen, 04], it is undeniable that, among the sciences, biology played a key role in the twentieth century. Over the past fifteen years we have witnessed a dramatic transformation in the practice of life sciences research. We have already selected many of the proverbial low hanging fruit of dominant mutations and simple diseases. Chronic and more complex diseases, as well as efforts to design microbes for engineering needs or to uncover the basis of a genetic repair, need the ladder of IT to reach the higher branches in living systems. At the same time, technological improvements in sequencing instrumentation and automated sample preparation have made it possible to create high throughput facilities for DNA sequencing, high throughput combinatorial chemistry for drug screening, high throughput proteomics, high throughput genomics, etc. In consequence, what was once a cottage industry marked by scarce expensive data obtained largely by the manual efforts of small groups of graduate students, post-docs and a few technicians has become industrialized and data-rich, marked by factory scale sequencing organizations. That role is likely to acquire further importance in the years to come. In the wake of the work of Watson and Crick [Watson and Crick, 2003] and the sequencing of the human genome, far-reaching discoveries are constantly being made. One of the central factors promoting the importance of biology is its relationship with medicine. Fundamental progress in medicine depends on elucidating some of the mysteries that occur in the biological sciences. However, biomedical research is now information intensive; the volume and diversity of new data sources challenges current database technologies. The development and tuning of database technologies for biology and medicine will maintain and accelerate the current pace for innovation and discovery. There are four main classes of situations in which data management technology is critical to supporting health-related goals, mentioned in the following: 4

8 - The rapid construction of task-specific databases to manage diverse data for solving a targeted problem - The creation of data systems that assist research efforts by combining multiple sources of data to generate and test new hypotheses, for instance, about disease and their treatments. - Management of databases to accumulate data supporting entire research communities. - The creation of databases to support data collection and analysis for clinical and field decision support. Such data integration is technically difficult for several reasons. First, the technologies on which different databases are based may differ and do not interoperate smoothly. Standards for cross-database communication allow the databases (and their users) to exchange information. Secondly, the precise naming conventions for many scientific concepts (such as individual genes, proteins or drugs) in fast developing fields are often inconsistent, and so mappings are required between different vocabularies. Third, the precise underlying biological model for the data may be different (scientists view things differently) and so to integrate these data requires a common model of the concepts that are relevant and their allowable relations. This reason is particularly crucial because unstated assumptions may lead to improper use of information, on the surface, appears to be valid. Fourth, as our understanding of a particular domain improves, not only will data change, but even database structures will evolve. Any users of the data source, including in particular any data integrators must be able to manage such data source evolution. These problems will be tackled more in detail in the problem scenarios of the following section. 2.1 Problem Scenarios In this section, we will first focus on three real world scenarios where the aforementioned problems are found. Subsequently, we will focus on the major problem that encompasses them, namely: semantic heterogeneity. We will finally discuss its major consequences and how we will tackle with such problem Real World Scenarios 5

9 The first real world scenario was the Four Corners Hantavirus Outbreak. Identifying new pathogens used to take months to years. The identification of Legionnaires disease and AIDS pathogens are cases in point. However, in 1993, when healthy young people in the American southwest began to die from an unknown pathogen, the virus responsible was identified in only one week using a combination of molecular biology and bioinformatics approaches. Traditional immunological approaches were only able to suggest that the virus involved in their Four Corners epidemic was distantly related to family of viruses known as hantaviruses, not enough information to prevent or treat de the disease. DNA sequences of related viruses in the hantavirus family were retrieved from DNA sequence databases and allowed the design of molecular probes (PCR primers) which were used in the first definitive test for the virus (confirming it as the pathogen) and allowing the determination of the DNA sequence of the new virus. In turn, the DNA sequence allowed the identification of the new virus closes relatives (in viruses fond in Korea), which shared similar animal vectors (rodents) and produced similar symptoms. Because the Four Corners hantavirus produces symptoms that resemble those of cold or flu before progressing to pulmonary arrest and sudden death, the assay developed bases on sequences found in DNA sequence databases was critical in stopping the spread of this epidemic. If this information had not been available i.e. online, well described and searchable, it might have taken several years and many deaths before this pathogen was identified. In the intervening ten years, electronic data resources have continued to grow, leading to ongoing challenges in building the kind of integrated, online resources needed to attack similar disease. The 2003 SARS threat underlines this need. A second real world scenario is related to the WTC Victim Identification. After the tragedy of September, 11, 2001, the police officers in New York City had the task of identifying the remains of victims, so that they could be returned to family members. Existing database systems were built predominantly on the assumption that individual remains would be found and identified on a one-by-one basis. The possibility of more than 300 victims and tens of thousands of samples was never considered in the design of the initial database system There are two sources of DNA in the tissues: nuclear and mitochondrial. Each of these sources has a number of attributes that can be measured and the combination of these attributes tends to be unique for individuals, thus allowing identification. Given a sample of known origin (taken from the personal effects of the victims and gathered 6

10 from their families), it can be compared with the profile of attributes gathered from the unknown samples, and matched. In many cases, additional evidence is required, including DNA samples from parents and siblings (who shared some, but not all DNA attributes with their relatives), information about where the remains were found, information about what personal effects were used for identification and the contact information about all the people who are reported as missing. To manage these data, the investigators built a complex system using cutting-edge database technology and state-of-the-art understanding of how to use genetics and other evidence to identify victims. The resulting tool continues to evolve, but has assisted in the identification of many victims and the return of their remains to loved ones. Although this database was built under extraordinary circumstances, the need for urgent assembly and integration of data and the provision of novel analytic capabilities based on this data occurs routinely in both biomedical research as well as in the delivery of healthcare. When these needs arise, it is too late to perform essential background research in order to support these efforts, and so these needs must be anticipated in order to respond in a timely manner to urgent needs. Finally, the malaria studies real world scenario also deals with information integration. The malaria parasite, Plasmodium Falciparum, is responsible for nearly 11 million deaths annually of children under the age of five. One of the great scientific achievements of 202 was the publication of the full genome (the DNA sequence) of both the parasite as well as the mosquito (Anopheles Gambiae) that carries it to human victims and the fist public release of the full genome database (PlasmoDB) [PlasmoDB, 03]. For the first time ever, we have the complete triad of genomes involved in this disease (the parasite, the vector mosquito and the human host). A primary health goal is to develop new drugs to effectively treat and perhaps eradicate malaria as a major threat to human health. The genome database provides the list of the genes that are present in the parasite, but does not organize these genes into the pathways and networks of interactions that could be used to understand the underlying wiring of the parasite and how it works. Fortunately, there are other databases, including the MetaCYC database [MetaCyc, 03] of metabolic pathways that can be used to assemble the genes into the metabolic machine that makes the parasite run. With a clear picture of this machine, we are able to identify vulnerable regions that can be targeted for interference with new drugs. In order to validate these targeted metabolic capabilities, we use other research databases (revealing where and when genes are turned on and off, including micro-array databases and proteomics databases) in order to prioritize the possible targets and asses heir 7

11 likelihood of success. Given a set of genes that would be good targets, we can further filter them by comparing them to human genes in order to help ensure that the new drugs will not be toxic for human use. In some cases, the gene targets are proteins with known three-dimensional structures (or strong similarity to known structures), stored in the Protein Data Bank (PDB), and in those cases we can explore the detailed atomic structure of these genes, and use databases of existing compounds in order to get a detailed understanding of how a potential drug might actually interact with its target an what modifications might make the drug more potent. At the end of this pipeline, then, we will have a relatively small set of candidate for further drug development that have been filtered using disparate information sources, each of which provides a unique type of information. The resulting drugs can then be tested experimentally, and the process of drug discovery has begun taking full advantage of all the relevant data sources upfront, thus decreasing the time to useful new drugs Semantic Heterogeneity The need to manage bioinformatics data has been coming into increasingly sharp focus for some time. Years ago, these data sat in silos attached to specific applications. Then the Web came into the arena, bringing the hurly-burly of data becoming available across applications, departments and entities in general. However, throughout these developments, a particular underlying problem has remained unsolved: data resides in thousands of incompatible formats and cannot be systematically managed, integrated, unified or cleansed. To make matters worse, this incompatibility is not limited to the use of different data technologies or to the multiple different flavors of each technology (for example, the different relational databases in existence), but also because of its incompatibility in terms of semantics. Hence, the most challenging incompatibility arises from semantic differences. In principle, each data asset is set up with its own world-view and vocabulary i.e. its schema. This incompatibility exists even if both assets use the same technology. For example, one database could have a table called Protein A intending to model a particular protein and classifying its function, categorizing it and relating it with some others proteins. Another database could simply refer to the same concept (the very particular Protein A) as a Protein Alfa, (although not including channel partners) and be sub-divided in a different way, related to some other proteins and linked to various functions. Since both proteins (despite being the same) present such dissimilarities, 8

12 they will never be related or co-related. If a particular researcher wants to know all the information about Protein A, it will not be able to accomplish a detailed overview of the information since these sources are absolutely unrelated. In a larger context, this problem may be multiplied by thousands of data structures located in hundreds of incompatible databases and message formats. And the problem is growing; bioinformatics related techniques continue to father more data, reengineer intense and massive data techniques processes and integrate with more sources. Moreover, developers are continuing to write new applications and to create new databases based on requests from users, without worrying about overall data management issues. 3 Related Work In this section, we focus on the related work surrounding the scope of this work. Hence, we will first give the rationale of the importance of bioinformatics and information integration by providing a background and context for them. We also provide an explanation about the requirements that motivate this work. We provide a contextualized and well-referenced description of the related work. Integration of heterogeneous data in life sciences is a growing and recognized challenge. As it has been discussed in [Gopalacharyulu et al, 05], several approaches for biological data integration have been developed. Well-known approaches include rulebased links such as SRS [Etzold and Argos, 93] or [Etzold et al, 96], federated middleware frameworks, such as Kleisli system [Davidson et al, 97] or [Chung and Wong, 99] as well as wrappers based solutions such as IBM Discovery Link [Hass et al, 01]. In parallel, progress has been made to organize biological knowledge in a conceptual way by developing ontologies and domain-specific vocabularies such as in [Ashburner et al, 00] or [Bard and Rhee, 04]. With the emergence of the Semantic Web, the ontology-based approach to life science data integration has become more ostensible. In this context, data integration comprises problems like homogenizing the data model with schema integration, combining multiple database queries and answers, transforming and integrating the latter to construct knowledge based on underlying knowledge representation. However, the ontology-based approach can not solve the evolving concepts in biology and its best 9

13 promise lies in specialized domains and environments where concepts and vocabularies can be well controlled [Searls, 05]. A similar approach to the work presented has been followed in [Gopalacharyulu et al, 05]. Their integration approach is based on the premise that relationships between biological entities can be represented as a complex network. The context dependency is achieved by a judicious use of distance measures on these networks. The biological entities and the distances between them are mapped for the purpose of visualization into the lower dimensional space suing the Sammon s mapping. Finally, their system implementation is based on a multi-tier architecture using a native XML database and software tools for querying and visualizing complex biological networks. However, the forthcomings of the approach are hampered by the fact that they are stated at pure XML-level without taking into account particular semantics of the mappings and hence, not being able to exploit the semantics inherent to the data formats. In this work, we have presented a novel approach to achieve micro-array data integration stemming from massive data gathering experiments. Finally, our future work will focus on finding more use cases and real-world scenarios to validate the efficiency of our approach and determine the feasibility of the semantic match of lightweight ontologies and mappings in particular contexts. This section has described several requirements extracted from our example. It has provided definitions and contextualized explanations of those. In the next section, we will examine how these requirements are faced in the current state of the art. 10

14 4 SEBIO Fundamental Concepts In this section, we first define the fundamental concepts that are encompassed by a Semantically-Enhanced Bioinformatics Platform (SEBIO), namely what is bioinformatics and its main goals, what is the Semantic Web and how it can help to data and information integration and finally, how are we going to partition our approach in three main projects overcoming this vision. 4.1 Bioinformatics There is no commonly world-wide recognized definition of Bioinformatics. There are also several standpoints when considering current Bioinformatics goals. Fundamentally, [Cohen, 04] claim the main role of Bioinformatics is to aid biologists in gathering and processing genomic data to study protein function and medical researchers in making detailed studies of protein structures to facilitate drug design. On a more general perspective, these goals can be outlined as follows: Inferring a protein shape and function from a given sequence of amino acids. Finding all the genes and proteins in a given genome. Determining sites in the protein structure where drug molecules can be attached. Hence, the major role of bioinformatics is to help infer gene function from existing data, this data being varied, incomplete and noisy. For that a number of techniques, strategies and approaches are summarized in the following as explained in [Cohen, 04]: Comparing Sequences: Given the huge number of gene sequences available, algorithms to compare them must be developed and allow deletion, insertion and replacements of symbols representing nucleotides or amino acids, for such transmutations occur in nature. Constructing Evolutionary (Phylogenetic Trees: Often constructed after comparing sequences from different organisms, these trees group the sequences according o their degree of similarity. Particularly, they serve as a guide to reason about how these sequences have been transformed through evolution. Detecting Patterns in Sequences: Using machine-learning and probabilistic grammars or neural networks, the goal here is to detect parts of DNA and amino acids. 11

15 Determining 3D structures from Sequences: Infer shapes from RNA sequences has been shown and proved difficult and remains an unsolved problem (cubic complexity). Inferring Cell Regulation: Genes interact with each other. Proteins can also prevent or assist the production of other proteins. What drives the behavior of these interactions? It is relevant to study the role of the gene or the protein in a metabolic or signaling pathway. Determining Protein Function and Metabolic Pathways: The objective here is to interpret human annotations for protein function and also to develop databases representing graphs that can be queried for the existence of nodes (specifying reactions) and paths (specifying sequences of reactions). Assembling DNA fragments A different approach is taken by [Ignacimuthu, 05], where the definition and main goals of bioinformatics are split into aims, tasks and areas. The main aims of Bioinformatics are as follows: To organize data in a way that allows researchers to access existing information and to submit new entries as they are produced. Develop tools and resources that aid in the analysis and management of data. Use these tools to analyze the data and interpret the results in a biologically meaningful manner. Main tasks involve the analysis of sequence information what implies what follows: Identifying the genes in the DNA sequence from various organisms. Developing methods to study the structure and function of newly identified sequences and corresponding structural RNA sequences. Identify families of related sequences and the development of models. Aligning similar sequences and generating phylogenetic trees to examine evolutionary relationships. Finally, the main areas to be tackled with are: Handling and management of biological data including its organization, control, linkages, analysis and so on. 12

16 Communication among people, projects and institutions engaged in biological research and applications. Organization, access, search and retrieval of biological information, documents and literature. Analysis and interpretation of the biological data through computational approaches concerning visualization, mathematical modeling and development of algorithms for highly parallel processing of complex biological structures. In a nutshell and in what the scope of this work concerns, bioinformatics is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems, whereby integration and exchange of data within and among organizations is a universally recognized critical need. Bioinformatics and biomedical research deal with the problem of information and data integration, since, by far, the most obvious frustration of a life scientist today is the extreme difficulty in putting together information available from multiple distinct sources. As it is also discussed in section 2 and particularly more detailed with real world problem scenarios in section 2.1.1, a commonly noted obstacle for integration efforts in bioinformatics is that relevant information is widely distributed, both across the Internet and within individual organizations, and is found in a variety of storage formats, both traditional relational databases and non-traditional sources (e.g. text data sources in semi-structured text files or XML, and the result of analytic applications such as genefinding application or homology searches). Finally, as discussed in section 2.1.2, arguably a more critical need in data integration is to overcome semantic heterogeneity i.e. to identify objects in different databases that represent the same or related biological objects (genes, proteins, etc) and to resolve the differences in database structures or schemas, among the related entities. 4.2 The Semantic Web The Semantic Web term was coined in [Berners-Lee et al., 01] to describe the evolution of a Web that consisted largely of documents for humans to read towards a new paradigm that included data and information for computers to manipulate. Ontologies 13

17 [Fensel, 02] are its cornerstone technology, providing structured vocabularies that describe a formal specification of a shared conceptualization. The fundamental aim of the Semantic Web is to provide a response to the ever-growing need for data integration on the Web. The benefit of adding semantics is bridging nomenclature and terminological inconsistencies to comprehend underlying meaning in a unified manner. Semantics can be achieved by formally capturing the meaning of data, since a common data format will likely never be achieved, eventually leading to efficiently managing data by establishing a common understanding [Shadbolt et al, 05]. The de facto Semantic Web standard ontology language is OWL (Web Ontology Language) [OWL, 04]. OWL is a markup language for publishing and sharing data using ontologies on the Internet. OWL is a vocabulary extension of the Resource Description Framework (RDF) and is derived from the DAML+OIL Web Ontology Language. The OWL specification is maintained by the World Wide Web Consortium (W3C). OWL currently has three flavors: OWL Lite, OWL DL, and OWL Full. These flavors incorporate different features, and in general it is easier to reason about OWL Lite than OWL DL and OWL DL than OWL Full. OWL Lite and OWL DL are constructed in such a way that every statement can be decided in finite time; OWL Full can contain endless 'loops'. OWL DL is based on description logics. Its subset OWL Lite is based on less expressive logic. A more detailed explanation of the three increasingly expressive sublanguages designed for use by specific communities of implementers and users can be found in the following. OWL Lite supports those users primarily needing a classification hierarchy and simple constraints. For example, while it supports cardinality constraints, it only permits cardinality values of 0 or 1. It should be simpler to provide tool support for OWL Lite than its more expressive relatives, and OWL Lite provides a quick migration path for thesauri and other taxonomies. Owl Lite also has a lower formal complexity than OWL DL. OWL DL supports those users who want the maximum expressiveness while retaining computational completeness (all conclusions are guaranteed to be computed) and decidability (all computations will finish in finite time). OWL DL includes all OWL language constructs, but they can be used only under certain restrictions (for example, while a class may be a subclass of many classes, a class cannot be an instance of another class). OWL DL is so named due to its correspondence with description logic, a field of research that has studied the logics that form the formal foundation of OWL. Finally, OWL Full is meant for users who want maximum expressiveness and the syntactic 14

18 freedom of RDF with no computational guarantees. For example, in OWL Full a class can be treated simultaneously as a collection of individuals and as an individual in its own right. OWL Full allows an ontology to augment the meaning of the pre-defined (RDF or OWL) vocabulary. It is unlikely that any reasoning software will be able to support complete reasoning for every feature of OWL Full. A more lightweight ontology language is the Resource Description Framework (RDF) [Hayes, 04]. RDF is a family of specifications for a metadata model that is often implemented as an application of XML. The RDF family of specifications is maintained by the World Wide Web Consortium (W3C). The RDF metadata model is based upon the idea of making statements about resources in the form of a subject-predicate-object expression, called a triple in RDF terminology. The subject is the resource, the "thing" being described. The predicate is a trait or aspect about that resource, and often expresses a relationship between the subject and the object. The object is the object of the relationship or value of that trait. The RDF simple data model and ability to model disparate, abstract concepts has also led to its increasing use in knowledge management applications unrelated to Semantic Web activity. 4.3 SEBIO Components In the last years, technological advances in high-throughput techniques and efficient data gathering methods coupled with a world wide effort in computational biology have resulted in a vast amount of life science data often available in distributed and heterogeneous repositories. These repositories contain interesting information such as sequence and structure data, annotations for biological data, results of complex and expensive computations, genetic sequences and multiple bio-datasets. However, the multiplicity and heterogeneity in the objectives, methods, representation, and platforms of these data sources and analysis tools have created an urgent and immediate need for research in resource integration and platform independent processing of investigative queries involving heterogeneous data sources and analysis tools. It is now universally recognized that a database approach to analysis and management of biological data offers a convenient, high level, and efficient alternative to high volume biological data processing and management. For this, we believe that the Semantic Web and Semantic Web Services paradigm (see section 4.2 for further details) promise a new level of data and process integration that can be leveraged to develop novel high-performance data and process management systems for biological applications. Using semantic technologies as a key technology 15

19 for interoperation of various datasets enables data integration of the vast amount of biological and biomedical data. The breakthroughs of adding semantics to such dataare the use of knowledge-oriented biomedical data integration would lead to achieving Intelligent Biomedical Data Integration, which will bring biomedical research to its full potential. For that, we have divided SEBIO and its main goals into three main projects, which achieve all together the aforementioned goals. These three projects tackle with a particular feature of the SEBIO platform, namely: semantic data integration, semantic web services integration and finally, literature data integration. The three projects and the features covered are shown in Figure 2. Figure 2. SEBIO components The Micro-Array Information and Data Integration Semantics-based Architecture (MASIA) is an architecture to enable micro-array data sources integration. The Biomedical Information Integration and Discovery with Semantic Web Services (BIRD) aims at achieving fundamental integration for biomedical information sources. Finally, the Biomedical Literature Social Ranking System offers a wide range of documents and literature ranked in terms of interest about a number of topics. 16

20 These three projects are encompassed by the SEBIO approach but significant research has been made in the context of each and every of them. Hence, in the next sections, these projects will be discussed in detail, begin the real outcome of this work. 5 MASIA: A Micro-Array Information and Data Integration Semantics-based Architecture With the advent of online accessible bioinformatics data, information integration has become a fundamental issue. In the last years, the ability to perform biological in silico experiments has increased massively, largely due to high-throughput techniques gathering massive amounts of data. As Semantic Web is maturing and data and information integration is growing, harnessing the synergy in both approaches can leverage the lack of widely-accepted standards by fostering the use of semantic web technologies to represent, store and query metadata and data across bioinformatics datasets. In this section, we present MASIA, a fully-fledged semantically-enhanced architecture for the integration of micro-array data sources. We propose the MGED ontology as a basis for various integration data formats and depict a use-case scenario to show the advantages of our approach. 5.1 Introduction and Goals Searching and integrating data from various sources has become a fundamental is-sue in Bioinformatics research. Particularly in those fields where massive data gather-ing is faced, the need of information integration is critical, preserving by all means the semantics inherent to the different data sources and formats. As discussed in [Ignacimunthu, 05], such integration would permit to organize properly data fostering the analysis and access of such information to accomplish critical tasks such process-ing micro-array data to study protein function and medical researchers in making de-tailed studies of protein structures to facilitate drug design. A DNA microarray (also commonly known as gene chip, DNA chip, or biochip) is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array for the purpose of expression profiling, monitoring 17

21 expression levels for thousands of genes simultaneously. Measuring gene expression using micro-arrays is relevant to many areas of biology and medicine, such as studying treatments, disease, and developmental stages. For example, micro-arrays can be used to identify disease genes by comparing gene expression in disease and normal cells. These experiments are however pooling out massive amounts of data which are stored in a number of data formats and data sources, hampering the interoperability among them. The main problem looming over the lack of integration is the fact that the current Web is an environment primarily developed for human users and micro-array data resources lack of widely accepted standards, what leads to a tremendous data heterogeneity. The need of adding semantics to the Web and use semantics to achieve in-formation integration becomes even more critical as information systems become more complicate and data formats gain a more complex structure. The Semantic Web is about adding machine-understandable and machine-processable metadata to Web resources through its key-enabling technology: ontologies [Fensel, 02]. Ontologies are a formal, explicit and shared specification of a conceptualization. The goal of the Semantic Web is to provide a response to the ever-growing need for data integration on the Web. The benefit of adding semantics is bridging nomenclature and terminological inconsistencies to comprehend underlying meaning in a unified manner. In this section, we present MASIA, a Micro-Array Information and Data Integration Semantics-based Architecture. The breakthrough of MASIA is using semantics as a formal means of leveraging different vocabularies and terminologies and foster integration. Firstly, the MASIA approach consists of a methodology to gather requirements, collect and classify metadata and the different data schemas stemming from the data resources to be integrated, construct a Unifying Information Model (UIM), rationalize the data semantics an utilize it. Secondly, we depict the MASIA software architecture as fully-fledged software architecture by outlining the functionality of the software architecture components and its capability to enable integration. The remainder of this section is organized as follows. Section 5.2 describes a number of micro-array data sources and the problem scenario. In Section 5.3, micro-array data integration is introduced. Section 5.4 presents the MASIA methodology and requirements. Finally, section 5.5 depicts the proof-of-concept architecture. 18

22 5.2 Micro-Array Data Sources In this section, firstly we discuss the current caveats in data integration for biological data. Then we focus in micro-array data integration problems and depict how the MGED ontology can server as a backbone basis for integration of such formats Heterogeneity in Biological Data The need to manage bioinformatics data has been coming into increasingly sharp focus for some time. Years ago, these data sat in silos attached to specific applications. Then the Web came into the arena, bringing the hurly-burly of data becoming available across applications, departments and entities in general. However, throughout these developments, a particular underlying problem has remained unsolved: data resides in thousands of incompatible formats and cannot be systematically managed, integrated, unified or cleansed. To make matters worse, this incompatibility is not limited to the use of different data technologies or to the multiple different flavors of each tech-nology (for example, the different relational databases in existence), but also because of its incompatibility in terms of semantics. Hence, the most challenging incompatibility arises from semantic differences. In principle, each data asset is set up with its own world-view and vocabulary i.e. its schema. This incompatibility exists even if both assets use the same technology. For example, one database could have a table called Protein A intending to model a particular protein and classifying its function, categorizing it and relating it with some others proteins. Another database could simply refer to the same concept (the very particular Protein A) as a Protein Alfa, (although not including channel partners) and be subdivided in a different way, related to some other proteins and linked to various functions. Since both proteins (despite being the same) present such dissimilarities, they will never be related or co-related. If a particular researcher wants to know all the information about Protein A, it will not be able to accomplish a detailed overview of the information since these sources are absolutely unrelated. In a larger context, this problem may be multiplied by thousands of data structures located in hundreds of incompatible databases and message formats. And the problem is growing; bioinformatics related techniques continue to father more data, reengineer intense and massive data techniques processes and integrate with more sources. Moreover, developers are continuing to write new applications and to create new databases based on requests from users, without worrying about overall data management issues. 19

23 5.2.2 Micro-Array Data Sources The lack of standardization in arrays presents an interoperability problem in bioinformatics, which hinders the exchange of array data. A number of micro-array data sources scattered all over the world are providing such information. One of the most prominent efforts is the Stanford Micro Array Database (SMD) 1, a micro-array experiments results database hoarding data from experiments, public experiments, 7192 spots, 1553 users, 309 labs, 43 organisms and 29 publications. Another micro-array data source is the European Bioinformatics Institute (EBI) ArrayExpress 2, a public repository for micro-array data, complemented by the ArrayExpress Data Warehouse, which stores gene-indexed expression profiles from a particular subset of experiments in the repository. Also the Maxsd 3 project from the University of Man-chester is a data warehouse and visualization environment for genomic expression data. The aforementioned lack of standardization in the data formats of these resources is hampering the potential exchange of array data and analysis. Various grass-roots opensource projects are attempting to facilitate the exchange and analysis of data produced with non-proprietary chips. The "Minimum Information About a Micro-array Experiment" (MIAME) 4 XML based standard for describing a micro-array experiment is being adopted by many journals as a requirement for the submission of papers incorporating micro-array results. The goal of MIAME is to outline the minimum information required to interpret unambiguously and potentially reproduce and verify an array based gene expression monitoring experiment. Although details for particular experiments may be different, MIAME aims to define the core that is common to most experiments. MIAME is not a formal specification, but a set of guidelines. A standard micro-array data model and exchange format MAGE, which is able to capture information specified by MIAME, has been submitted by EBI (for MGED) and Rosetta Biosoftware and recently became an Adopted Specification of the OMG standards group. Many organizations, including Agilent, Affymetrix, and Iobion, have contributed ideas to MAGE. Also, the ArrayExpress source from EBI claims to be compliant with this initiative

24 However, the heterogeneity among these formats and many other that lay beyond the scope of this article has triggered the development of a conceptual model or ontology, the MGED ontology, which will be detailed in the next section The MGED Ontology The MGED ontology is a conceptual model for micro-array experiments in support of MAGE v.1. The aim of MGED is to establish concepts, definitions, terms, and resources for standardized description of a micro-array experiment in support of MAGE v.1. The MGED ontology is divided into the MGED Core ontology which is intended to be stable and in synch with MAGE v.1 and the MGED Extended ontology which adds further associations and classes not found in MAGE v.1. Since the MGED has been recognized as a de-facto unifying terminology but most of the actors in the micro-array data sources scenario, it is a perfect gold standard candidate for being a common understanding model. The MGED is depicted in the next figure. Figure 3. The MGED Ontology As previously mentioned, the primary purpose of the MGED Ontology is to provide standard terms for the annotation of micro-array experiments. These terms will enable structure queries of elements of the experiments. Furthermore, the terms will also enable unambiguous descriptions of how the experiment was performed. The terms will be provided in the form of an ontology which means that the terms will be organized into classes with properties and will be defined. A standard ontology format will be used. 21

25 For descriptions of biological material (biomaterial) and certain treatments used in the experiment, terms may come from external resources that are specified in the Ontology. Software programs utilizing the Ontology are expected to generate forms for annotation, populate databases directly, or generate files in the established MAGE-ML format. Thus, the Ontology will be used directly by investigators annotating their micro-array experiments as well as by software and database developers and therefore will be developed with these very practical applications in mind. Currently, the MGED ontology has an OWL-syntax (see next section) and it had a RDF-S syntax but it was retired due to incompleteness. The fundamental importance of an ontology as a formal specification of a shared conceptualization and the impact on our approach is discussed can be found on section 4.2, where the Semantic Web languages and formal foundations are further explained. 5.3 Micro-Array Data Integration Micro-array experiments are massive data gathering. Efforts to integrate different micro-array data from such experiments will always have to struggle with a large number of physically different data formats. While a common data format will likely never be achieved, the key to efficiently managing data is to establish a common understanding. This is the idea behind semantics, bridging nomenclature and terminological in-consistencies to comprehend underlying meaning in a unified manner. Semantics can be achieved by formally capturing the meaning of data. This is accomplished by relating physical data schemas to concepts in an agreed-upon model. We call this central model which acts as a unifying entity under whose umbrella physical data schemas a Unifying Information Model (UIM). The UIM does not reflect any specific data model, but rather reflects the agreed-upon scientific view, scientific vocabulary and rules which will provide a common basis for understanding data. Semantics build upon traditional informal metadata and captures the formal meaning of data in agreed upon terms. For example, in the MIAME context, a number of Bio-source properties (properties that stem from the probe or sample being examined in the experiment) must be pro-vided, namely: organism (NCBI taxonomy), contact details for sample and several descriptors relevant to the particular sample. These descriptors are: sex, age, development stage, organism part (tissue), cell type, animal/plant strain or line, genetic variation (e.g., gene knockout, transgenic variation), individual genetic characteristics (e.g., disease alleles, 22

26 polymorphisms), disease state or normal, additional clinical information available and the individual (for interrelation of the samples in the experiment). Clearly, a proper management of this hierarchical data structure must take place. Following our approach, the UIM might capture the major meaning or intention of concepts such a descriptors and the more specific concepts of a sex, age and so on and so forth. A semantic mapping will then relate physical data schemas of the various micro-array data formats to the Unifying Information Model. For instance, a semantic mapping might capture the fact that one of the micro-array experiments has a concept called age that is called maturity by a relational database table, years by an XML Schema of another experiment and time-to-live by another experiment. There-fore the semantic mapping formally captures the meaning of the data by reference to the agreed-upon terminology, in this case agreed upon the MAME model as a basis for the UIM. The motivation to capture data semantics gains momentum when data management and use becomes critical for an analysis and statistical treatment of the data. Essentially, semantics saves time by capturing the meaning of data once. Without semantics, each data asset will be interpreted multiple times by different users since it will be designed, implemented, integrated, cleansed, extended, extracted and eventually decommissioned. Furthermore, this independent interpretation will be time-consuming and error-prone. With clearly defined semantics, the data asset is mapped and interpreted only once, prone to be related, processed and linked properly with a number of other subsequent assets. Secondly, new assets can be generated from the Unifying Information Model. Finally, the most significant impact of semantics is a strategic one. Semantics turns hundreds of data sources into one coherent body of information. The semantic architecture includes a record of where data is and what it means. Using this record of which information is represented in each data asset, it becomes possible to automate the search for overlap and redundancy. The UIM provides the basis for creating new data assets in a consistent way and serves as a reliable reference for understanding the interrelationship between disparate sources and for automatically planning how to translate between them. Finally, semantics provides a central basis for impact analysis and for the smooth computer-aided realization of change. In summary, semantics play a key role for integration enabling an effective strategic approach to data management. In the following section, we will present the MASIA methodology and architecture to turn micro-array data integration based on a UIM into a solution for the data integration heterogeneity problem scenario. 23

27 5.4 The MASIA approach In this section, we present the MASIA methodology to tie the both ends of the data sources and the Unifying Information Model (UIM). We also present the software requirements for the software functional requirements of the MASIA system. There are some problems which have to be faced when trying to use Semantic Information Management. Firstly, a fragmented data environment leads to business information quality problems. How to bridge the gap between just simple data towards structured information? Also, information management is a key issue in a dynamic environment such as modern enterprise where application deployment, business process reengineering or possible re-structuring of the data models leads to a burden of hardcoded scripts, data assets and proprietary definitions. Finally, meaning and context of the data must be captured and managed in a way that represents some long-term value for the enterprise. How to bridge the gap between this situation and the Semantic Information Management level is defined by the Semantic Information Management methodology. This methodology is structured such that each stage adds value in its own right, while simultaneously progressing the enterprise towards the benefits of a full semantic data integration. 1. Gather Requirements: Establish the project scope, survey the relevant data sources and capture the organization s information requirements. 2. Collect and classify metadata: Catalog data assets and collect metadata relevant to the organization and its use of data. 3. Construct Unifying Information Model: Capture the desired business world-view, a comprehensive vocabulary and business rules. 4. Rationalize the data semantics: Capture the meaning of data by mapping to the Information Model. 5. Publish/Deploy: Share the Information Model, metadata and semantics with rele-vant stakeholders; customize it to their specialized needs. 6. Utilize: Create processes to ensure utilization of architecture in achieving data management, data integration and data quality. 24

28 Figure 4. Methodology For these steps to be followed, the system must present also some features. Suc-cessful semantically-enhanced integration must be supported by an appropriate suite of architectural components which should be fully integrated. Key components of the supporting system should include: Metadata Repository: A repository for storing metadata on data assets, sche-mas and models associated with the assets. Data Semantics: Integrated tools for ontology modeling to support the crea-tion of a Unifying Information Model and for semantically mapping data schemas to such Unifying Information Model. Data Management Services: The system should use the Unifying Information Model standard business terminology as a lens through which data is man-aged. Data management should include the ability to author and to edit the Information Model, discover data assets for any given business concept, ad-minister data, create reports and statistics about data assets, test and simulate the Unifying Information Model and analyze impact in support of change. Data Integration Services: The system should automatically generate code for queries and data transformation scripts between any two mapped data sche-mas, utilizing the common understanding provided by data semantics. Data Quality Services - In order to provide a systematic approach to data quality, the system should support the identification and decommissioning of redundant data assets. It should support comparison for ensuring consistency among semantically different data and validation/cleansing of individual sources against the central repository of rules. 25

29 Metadata Interface: The system must be able to collect metadata and data models directly from relational databases and other asset types and to ex-change metadata with other metadata repositories. Similarly, the metadata and models accumulated by the system must be open to exchange with other systems through the use of adaptors and standards such as XMI (XML Metadata Interchange standard). Run-Time Interface: A key differentiator of the semantic information technology is the active data integration capabilities. The Run-Time Interface ensures that queries, translation scripts, schemas and cleansing scripts generated automatically by the system may be exported using standard languages. User Interface: The User Interface should include a rich thick-client for power users in the data management group. Platform : The system should include a platform supporting version control, collaboration, permission management and configuration for all metadata and active content in the system. In this section, we have presented the MASIA methodology and software functional requirements, including a scarce description of particular software components functionality and capability. 5.5 The MASIA Software Architecture In this section, we present a novel and promising architecture to tackle with the situation depicted in the previous section. We propose a tailor-made value-adding technological solution which addresses the aforementioned challenges and solve the integration problem regarding to searching, finding, interacting and integrating heterogeneous sources by means of semantic technologies. The MASIA architecture is composed by a number of components depicted in the following figure. 26

Figure 5. The MASIA Software Architecture These components will be detailed in what follows: Crawler: It is a software agent which browsers the information sources in a methodical, automatic manner.

30 Figure 5. The MASIA Software Architecture These components will be detailed in what follows: Crawler: It is a software agent which browsers the information sources in a methodical, automatic manner. It is a technology suitable for nearly any application that re-quires full-text search, especially cross-platform. Mappings Engine: The Mappings Engine is a set of integrated tools for semantically mapping data schemas to such Unifying Information Model. The Mappings Engine is to enhance the semi-automatic mapping of schemas and concepts or categories of the UIM in order to alleviate the tedious process which requires human intervention. Since automatic map-ping is envisaged as not recommendable due to semantic incompatibilities and ambi-guities among the source schemas and data formats, it should bridge the gap between cost-efficient machine-learning mapping techniques and pure human interaction. The Mappings Engine takes the MGED ontology (see section 5.2.3) as a conceptual basis for the mappings from the various sources. It will then relate, as explained in section 5.3, data schemas with the semantic structure of the ontology. 27

31 YARS: The YARS (Yet Another RDF Store) 5 system is a semantic data store that allows semantic querying and offers a higher abstraction layer to enable fast storage and retrieval of large amounts of RDF (see section 4.2 for more details about Semantic Web languages such as RDF) while keeping a small footprint and a lightweight architecture approach. YARS deals with data and legacy integration. GUI: This is the component that interacts with the user. It collects the users request and presents the results obtained. In our particular architecture, the GUI will collect requests pertaining to search criteria, such as, for example, descriptor. The GUI communicates with the Execution Manager component providing the user request and displays the results provided as a response from the Execution Manager component. Query Engine: The Query Engine component uses a query language to make queries into the YARS storage system. The semantics of the query are defined not by a precise rendering of a formal syntax, but by an interpretation of the most suitable results of the query. Since YARS stores RDF triples (see section 4.2 for more details about Semantic Web languages), there are a number of possibilities to use as a query language. The RDF Query Language (RDQL) 6, a W3C recommendation which has been replaced by a more recent recommendation, the SPARQL Query Language for RDF 7. Since YARS enables SPARQL querying, due to pragmatic reasons this is the query language of our choice. Execution Manager: The Execution Manager component is the main component of the architecture. It manages the different interactions among the components. Firstly, it communicates with the mapping engine to verify that the information extracted by the crawler are being correctly mapped on the MGED ontology as a Unifying Information Model (UIM) and finally stored into YARS with an RDF syntax. Secondly, it accepts the users search request through the GUI and hand them over the query engine, which, in turn, queries YARS to retrieve all RDF triples related with the particular search criteria. By retrieving a huge number of

32 triples from all the integrated resources, the user benefits from a knowledgeaware search response which is mapped to the underlying terminology and unified criteria of the Unifying Information Model, with the added advantage that all resources can be tracked and identified separately i.e. data provenance can be traced and assigned to a particular resource. 6 BLISS: A Biomedical Literature Social Ranking System With the explosion of online accessible bioinformatics literature, selection of the most suitable resources has become very important for further progress. Bioinformatics literature access relies heavily on the Web, but searching quality literature is hindered by the caveats of information overloading. Recently, the exchange of information on the Web has gained momentum with the raise of some socially-oriented collaborative trends. Current phenomena such as blogging, wikis or social software sites such as Digg or Slashdot have emerged as a paradigm shift in which the consumer-producer equation on the Web has reverted. The increasing success of these initiatives in pointing at and recommending resources on the Web is fuelling a new type of social recommendation for the discovery and location of resources. Together with current Semantic Web technologies and vocabularies that have gained momentum and proved useful, they can help to overcome the significant shortcomings of information overload and foster sharing and collaboration through semantics. In this section, we present the BLISS system, a proof-of-concept implementation of a biological literature social ranking system used in the bioinformatics field. 29

33 6.1 Introduction and Goals Over the past fifteen years we have witnessed a dramatic transformation in the practice of life sciences research. As a consequence, biomedical literature has increased exponentially and its use has been so far the low hanging fruit of searching endlessly documents and articles with old-fashioned information retrieval techniques. Achieving the full potential of current search of biomedical information resources, fundamentally articles about a particular topic or subject, needs the ladder of IT to reach the higher branches. The Web is undergoing significant change with regards to how people communicate. A shift in the Web content consumer-producer paradigm is making the Web a means of conversation, cooperation and mass empowerment. Emerging killer applications combine sharing information, social dimension, undermining the very principles where content have relied for decades, namely information asymmetry and top-down content delivery. The Semantic Web has emerged as an attempt to provide machine-processable metadata to the ever increasing information resources on the Web. Following the paradigm shift, some initiatives such as the FOAF project [FOAF, 05] or the SIOC [SIOC, 05] vocabulary, aim at fostering the social aspect. In these approaches, a vocabulary and proper semantics are defined for widely used terms which benefit thereafter from: annotation, taxonomy or tagging and custom semantics. This does not follow the traditional network definitions of devices or objects (phones, fax machines, computers or documents) being linked but moves to the next level, where what are being linked are people and organizations [Reed et al, 05]. The breakthrough of adding semantic metadata to services (in our case, Web Ser-vices) is the ability to enable automatic or semi-automatic discovery. However, this leads to the so-called chicken-egg problem of metadata. The provider of the service would request for a good excuse or reason, a good application or benefit, from providing the metadata. However, if the metadata is not generated, no application or value-added functionality can be achieved. In this paper, we argue that collaborative discovery can bridge the gap of a number of caveats found in real world scenarios based on an analogy with how current Web resources are being found, shared and provided by the aforementioned social software trends. Hence, collaborative discovery would be learning from the recent changes on the Web becoming simply old wine in new bottles. The remainder of this section is organized as follows. Section 6.2 discusses several aspects of collaborative discovery. In Section 6.3, we discuss how collaborative discovery could bridge the gap between 30

34 semantically-enabled discovery techniques and social software trends. Finally, section 6.4 pro-vides an overview of BLISS, a proof-of-concept implementation used in the bioinformatics field and concludes the paper. 6.2 Collaborative Discovery It has been recently been acknowledged that the blogging and social bookmarking phenomena are the most popular means of communication on the Web, affecting pub-lic opinion and mass media around the world [Kline et al, 05]. A weblog or blog is simply a website in which items are posted and displayed with the newest at the top. They combine text, images and links to other either websites or blogs. Social book-marking consists in locating, ranking, sharing bookmarks and classifying them appropriately by tagging them i.e., assigning them a tag, a descriptive keyword, which summarizes the category they belong to. Even more, the combination of both blogging and social bookmarking has joint into collaborative websites such as Digg 8 or Slashdot 9. In these websites, news, stories and pointers at the location of other interesting Web resources are submitted by users, and then promoted to the front page through a user-based ranking system. This differs from the hierarchical editorial system that many other news sites employ. More particularly, in Digg, readers can go through all of the stories (or pointers to resources location) that have been submitted by the users in the "digg all" section of the site. A digg is a vote coming from a registered user by which he stresses the importance and reveals interest about. Once a story has received enough "diggs", depending on the calculations performed by Digg's algorithm, it appears on Digg's front page. Should the story not receive enough diggs, or if enough users make use of the problem report feature to point out issues with the submission, the story will remain in the "digg all" area, where it may eventually be removed. If a user wants to look for a particular resource on a particular topic, he can use the tag hierarchy to navigate through topics and then find resources that have been considered of interest and useful by the other Digg users

Figure 6. A screenshot from Digg This could be envisaged as a collaborative search strategy in which the users act as a filter of the current information overload on the Web.

35 Figure 6. A screenshot from Digg This could be envisaged as a collaborative search strategy in which the users act as a filter of the current information overload on the Web. Fundamentally, users are building up a certain type of human metadata to locate in a distributed manner an algorithm to find out real valuable resources. 6.3 Bridging the Gap: Social Semantics In principle, a user is aiming at finding a particular resource to fulfill a particular goal [Fensel & Bussler, 02]. For example, a user would like to locate a History book and he would just try to use one of the world-wide and well-known search engines such as Google, Yahoo, etc. Fundamentally, this can be achieved by two means. Either the service provider (Amazon) provides metadata and waits for a software agent to find and interpret it, finally accessing it or a third party points out at the service by ensuring its quality. An issue that has loomed over this approach is how the lack of motivation, accuracy of efficiency form the provider perspective in providing the metadata hampers its full potential. As we have mentioned in the introduction, the so-called chicken-egg problem 32

36 of metadata shows up. The provider of the service would request for a good excuse or reason, a good application or benefit, from providing the metadata. However, if the metadata is not generated, no application or value-added functionality can be achieved. In the latter approach, let us imagine a user John of the aforementioned software Digg, which points out at a great library where the resource is found by bookmarking, blogging, ranking and qualifying it. The gist of the matter consists of turning this simple recommendation into human-generated metadata. This is what can be achieved using Social Semantics, i.e., semantic metadata harvested from social collaborative software. First of all, automatically generated metadata can be extracted from the assertion by John in a machine-understandable manner, using an RDF vocabulary such as the Semantically Interlinked Online Communities [SIOC, 05] vocabulary. This would imply having a machine-readable syntax and lightweight semantics achieved by the RDF graph (fundamentally, relationships among resources). Secondly, since the assertion has been assessed, ranked and filtered in a collaborative manner from the users of the collaborative software environment, it means that it counts on a wide consensus, which turns the assertion into significant, precise and reliable information. Finally, all this has been achieved without the effort or the least hassle from the service provider, in a cost effective manner and it is ready to be widely used. In addition, the tagging system certainly constitutes an interesting development, since folksonomies that are emerging organically appear to be a potential source of metadata. They arise because a large number of people are interested in particular information and are encouraged to describe it, being it rather than a centralized for of a classification, a free bottom-up attempt to classify information [Shadbolt et al, 06]. They are very near of the concept of shallow ontologies which comprise relatively few unchanging terms that organize very large amounts of data by using a set of very common and alwaysshowing-up terms and relations. Finally, regarding how more metadata could be harvested from the blogging side of the website, we could refer to [Karger and Quan, 04] to find out how blogs provide an important source of metadata. In a nutshell, a blog is structured and annotated, usually can be found in RSS 2.0 (not RDF compliant) but also sometimes in RSS 1.0 (based on RDF). Secondly, it complies with a well-known and widely-used structured with already provides some metadata which could be expressed by, for example, the Dublin Core. Eventually, it is also categorized and tagged. Regarding the effectiveness of the approach, rather than bog ourselves into details, there are a number of social sciences studies about the effect of the opinion of the masses or 33

37 the opinion leader [Wolf, 87] regarding how critical mass in terms of balanced biased of interest and attention attraction, is highly effective. These semantic capabilities are the cornerstones of what we call SITIO, a Social Semantic Recommendation system. The idea is to provide a collaborative discovery system such as Digg or Slashdot with the aforementioned semantic metadata. The advantage of this is, apart from what has been previously stressed, that query answering based on keywords does not allow exploiting the semantics inherent to these structured or semi-structured data formats. They also benefit from formal reasoning and inference strategies for classifying and relating information. 6.4 BLISS: A Biological Literature Social Ranking System As it is pointed out in [Cohen, 04], it is undeniable that, among the sciences, biology played a key role in the twentieth century. That role is likely to acquire further importance in the years to come. In the wake of the work on DNA and the sequencing of the human genome, far-reaching discoveries are constantly being made. One of the central factors promoting the importance of biology is its relationship with medicine. Fundamental progress in medicine depends on elucidating some of the mysteries that occur in the biological sciences. Biologists need software that is reliable and can deal with huge amounts of data as well as interfaces that facilitate the human-machine interactions. Consider a biologist working on osteoporosis, a major bone disease affecting millions of people. To target osteoporosis, understanding the balance between cells, which produce bone substance and cells called osteoclasts which consume it, is important. Imagine that a scientist would like to find fundamental and best-of-breed information about gene expression levels of osteoclasts during cell differentiation. He would also wish to find all the important articles about the subject not having to go through the burden of the vast amount of literature that can be found in digital libraries, journals and huge information repositories. All this information, taken together, may lead to insights into which proteins may be suitable targets to treat bone-related diseases. Most of the needed information and analysis tools are accessible over the Web. However, they are designed for low-throughput human use and not for high-throughput automated use. The vision of a Semantic Web for bioinformatics transparently integrates some of these resources through the use of mark-up languages, ontologies and metadata provided for the applications involved in this process. 34

The Biological Literature Social Ranking System (BLISS) is a joint research effort among the Universidad Carlos III de Madrid and the Laboratory of Bioinformatics and Medical Informatics of the

The BLISS implementation The main features of the system are outlined as follows: The user (biologist, bioinformatician, medical,etc) finds an article interest-ing and wants to communicate it to the

38 The Biological Literature Social Ranking System (BLISS) is a joint research effort among the Universidad Carlos III de Madrid and the Laboratory of Bioinformatics and Medical Informatics of the University of Texas at Dallas (UTD). A screenshot of BLISS is depicted in the next figure. Figure 7. The BLISS implementation The main features of the system are outlined as follows: The user (biologist, bioinformatician, medical,etc) finds an article interest-ing and wants to communicate it to the community. For that, he selects that article (providing a URL as a pointer) and selecting a category under which it is relevant (e.g. Yeast or Lung Cancer). Users who join the system can, provided their experience in the field, vote and hence rank the documents properly. The more votes an article gets, the higher it climbs up. Potential users (for example, newbies) can then be recommended and suggested a number of articles of particular importance for a number of topics, what, given the social nature of the approach, ensures the quality and feedback of the articles. 35

39 In BLISS, the biologist would find all the information regarding osteoporosis classified under the relevant categories (these ones under the Survey of Oncology Society taxonomy), where users of the system have recommended them. But apart from that, BLISS provides relevant metadata that can be harvested and used for intelligent collaborative discovery, as discussed in section 2. For example, BLISS provides a labelled graph (based on a RDF representation) of resources being recommended under a common topic or with similar features. MEDLINE is a major repository of biomedical literature supported by the U.S. National Library of Medicine (NLM). It currently collects and maintains more than 15 million abstracts in the field of biology and medicine, and is incremented by additional thousands of new articles every day. PubMed is the most popular interface to access the MEDLINE database. If the articles searched by the biologists about osteoporosis are in MEDLINE, he will found them via a PubMed identifier link. He could then be certain that the article is of a certain quality (since it has been verified and recommended by a pool of users) and access it directly via the BLISS interface. Current work in this research line is proving very fruitful and interesting with case studies and scenarios coming from the real world. As the use of new communication paradigms technologies on the Web grows and changes, the problem for finding and relating appropriate resources in order to achieve a particular goal will get more acute. In this paper, we have proposed an approach based on collaborative discovery and social semantic ranking system as a particular means to bridge the gap between provided metadata from the service provider perspective and the current collaborative discovery techniques and initiatives on the Web. The forthcomings of our approach are twofold. On the one hand, current technology can easily add some plug-in to improve and add the described functionality and benefit from the harvesting of more information, such as the previously noted. On the other hand, a more accurate critical mass use of social software based collaborative discovery techniques can foster the effectiveness and efficiency of discovery, enhancing eventually the whole resource discovery approach. We have also presented BLISS, a proof-of-concept implementation that is being used in the bioinformatics field. Finally, our future work will focus on finding more use cases and real-world scenarios to validate the efficiency of our approach and determine the feasibility of the collaborative discovery in particular domains. This work is related to existing efforts about social software and new distributed collaborative trends. 36

40 7 BIRD: Biomedical Information Integration and Discovery with Semantic Web Services Biomedical research is now information intensive; the volume and diversity of new data sources challenges current database technologies. The development and tuning of database technologies for biology and medicine will maintain and accelerate the current pace for innovation and discovery. New promising application fields such as the Semantic Web and Semantic Web Services can leverage the potential of biomedical information integration and discovery, facing the problem of semantic heterogeneity of biomedical information sources in a variety of storage and data formats widely distributed both across the Internet and within individual organizations. In this paper, we present BIRD, a fully-fledged biomedical information integration solution that combines natural language analysis and semantically-empowered techniques to ascertain how the user needs can be best fit. Our approach is backed with a proof-ofconcept implementation where the breakthrough and efficiency of integrating the biomedical publications database PubMed, the Database of Interacting Proteins (DIP) and the Munich Information Center for Protein Sequences (MIPS) has been tested. 7.1 Introduction and Goals Integration and exchange of data within and among organizations is a universally recognized need in bioinformatics and genomics research. By far the most obvious frustration of a life scientist today is the extreme difficulty in putting together information available from multiple distinct sources. A commonly noted obstacle for integration efforts in bioinformatics is that relevant information is widely distributed, both across the Internet and within individual organizations, and is found in a variety of storage formats, both traditional relational databases and non-traditional sources (e.g. text data sources in semi-structured text files or XML, and the result of analytic applications such as gene-finding application or homology searches). Arguably the most critical need in biomedical data integration is to overcome se-mantic heterogeneity i.e. to identify objects in different databases that represent the same or related biological objects (genes, proteins, etc) and to resolve the differences in database structures or schemas, among the related objects. Such data integration is technically difficult for several reasons. First, the technologies on which different databases are based may differ and do not interoperate smoothly. 37

41 Standards for cross-database communication allow the databases (and their users) to exchange information. Secondly, the precise naming conventions for many scientific concepts (such as individual genes, proteins or drugs) in fast developing fields are often inconsistent, and so mappings are required between different vocabularies. Third, the precise underlying biological model for the data may be different (scientists view things differently) and so to integrate these data requires a common model of the concepts that are relevant and their allowable relations. This reason is particularly crucial because non-stated assumptions may lead to improper use of information, on the surface, appears to be valid. Fourth, as our understanding of a particular domain improves, not only will data change, but even database structures will evolve. Any users of the data source, including in particular any data integrators must be able to manage such data source evolution. Since the current Web is an environment primarily developed for human users, the need of adding semantics to the Web becomes more critical as organizations rely on the service-oriented architecture paradigms to expose and data sources by means of Web Services. The Semantic Web is about adding machine-understandable and machineprocessable metadata to Web resources through its key-enabling technology: ontologies [Fensel, 02]. Ontologies are a formal, explicit and shared specification of a conceptualization. The breakthrough of adding semantics to Web Services leads to the Se-mantic Web Services paradigm, which offers the possibility of ascertain which services could best fit the wishes and fulfil the goals of the user. Semantic Web Services can be discovered, located and accessed since they provide formal means of leveraging different vocabularies and terminologies and foster mediation. However, the problem of bridging the gap between the current Web, primarily designed for human users whose intentions are expressed in natural language, and the formalization of those wishes remains. Potential users might deter from using Semantic Web Services, since its underlying formalization and unease of use hampers its use from rich user-interaction perspective. Hence, we present in this paper our work on the Biomedical Information and Integration Discovery with Semantic Web Services (BIRD) platform, which fosters the intelligent interaction between natural language user intentions and the existing Semantic Web Services execution environments. Our contribution is an overall solution, based on a fully-fledged architecture and proof-of-concept implementation that transforms the user intentions into semantically-empowered goals that can be used to en-compass interaction with a number of available Semantic Web Services architectures such as WSMX [WSMX], OWL-S Virtual Machine [OWL-S] and METEOR-S [METEOR-S]. 38

42 The remainder of this paper is organized as follows. Section 7.2 describes the Biomedical Information and Integration Discovery with Semantic Web Services (BIRD) platform for Semantic Web services. In Section 7.3, the interaction between BIRD and the available Semantic Web Services execution environments is introduced. Finally, section 7.4 presents the proof-of-concept implementation based in a real world scenario in which the integration of the biomedical publications database PubMed, the Database of Interacting Proteins (DIP) and the Munich Information Centre for Protein Sequences (MIPS) has been tested. 7.2 BIRD: Biological Information Integration Discovery BIRD is a two-faced software agent designed to interact with and human beings as a gateway or a man-in-the-middle towards Semantic Web Services execution environments. The main goal of the system is to help users express their needs in terms of information retrieval and achieve information integration by means of Semantic Web Services. BIRD allows users to state their needs via natural language or go through a list of the most important terms, extracted from the Gene Ontology (GO). For this, BIRD makes use of ontology-driven data mining. This implies that it firstly captures and gathers which are the terms and the user would like to search (e.g. Gene A, Protein Y) by using as a reference the aforementioned terms of the GO. Secondly, it builds up a lightweight ontology i.e. a very simple graph made of the relationships of those terms. Finally, it looks for which goal from the goal template repository fits better with the search criteria and requirements from the user. A goal in Semantic Web Services technology refers to the aim a user expects to fulfil by the use of the service. Once BIRD has inferred the goals derived from the users wishes, it sends them to the suitable Semantic Web Services execution environment, which will retrieve the out-come resulting of the integration of the applications being accessed (e.g. all the bio-medical, biological publications and medical databases). The aim of this section is to describe the functionality of the components in the architecture. Loose coupling and reusability of components have been the major intentions behind the architectural decisions and implementation. Some of these details are reflected in the particular components to make them more understandable. 39

43 Figure 8. The BIRD Architecture This figure depicts the main components of BIRD. The core component is the Control Manager, which supervises all the process and act as intermediary among the other components. The GUI is placed between the user and the Control Manager. The users have two possibilities: they can either introduce a text in natural language or use the ontology-guided tool (but this option would require further work which is not envisaged in this section), which assists them in expressing their goals. When the Control Manager has extracted the user intention, it invokes the Goal Loader. The Goal Loader retrieves all the possible goals from the repository and the Goal Matcher infers which goals are needed to achieve the users requests. Finally, the Control Manager sends separately these goals to the Goal Sender, which is responsible for dispatching them to the suitable execution environment. In the following subsections, a concise description of each of these components will be presented. Language Analyzer: The task of the Language Analyzer is to filter and process the input introduced by the user in natural language and determine the concepts (attributes and values) and relations included in it. Goal Loader: This component looks for goal templates in the Goal Template Repository where different types of goals are stored. Actually, the Goal Loader retrieves all the goal tem-plates and transmits them to the Control Manager. Since in this version of BIRD there is no fixed Semantic Web Services execution environment, different types of goal repositories are taken into 40

44 account. The repository is outside the architecture so that anybody may plug in his/her own goal repository. Goal Matcher: Matching is a widely-used term, which, in our case, encompasses a syntactic and se-mantic perspective. The Goal Matcher compares the ontology elements obtained from the analysis of the user s wishes to the description of the goal templates extracted from the repository. From this matching, several goals are selected that are composed by the Control Manager in order to build up the sequence of execution. Goal Sender: This component sends the different goals to the execution environment, which returns the results obtained from the execution of the services. Its functionality is quite simple since the sequence of execution is predefined in the BIRD Control Manager. The sending of goals is sequential, without taking into account any other workflow constructs. GUI: This is the component that interacts with the user. It collects the users request and presents the results obtained to them. The following figure depicts the basic outlook of the GUI. Figure 9. Simple GUI outlook 41

45 Control Manager: This is the main component of the architecture. It manages the different interactions among the components. Firstly, it accepts the users input through the GUI. It can be either natural language text or a structured sentence written with the assistance of the Ontology-guided Input. If the input is in natural language, then it instructs the Language Analyzer to attempt the recognition of the major concepts in the text and communicates with the Goal Loader and the Goal Matcher to orchestrate the different goals that will be sent to the execution environment through the Goal Sender. Then, it communicates with the GUI so that the users receive a view of the selected goals and decides if they are correct and comply with their expectations. Finally, if the user approves them, they are sequentially sent. In this section, we have depicted the BIRD architecture. In the following sections, we will discuss how to deal with the Semantic Web Services approach. 7.3 Needle in a haystack: Dealing with Semantic Web Services One of the most important features of the system is its capability to interoperate with different Semantic Web Services execution environments. Several approaches dealing with the Semantic Web Services have emerged. As the process to find a common standard for this technology has not yet finalized, it is important not to obviate any of them. Therefore, BIRD has been designed to support and interact with some of these approaches (those incorporating an execution environment). In this section, the whole set of approaches submitted to the W3C (World Wide Web Consortium) are briefly described, and the way BIRD deals with them is detailed. Along with some of the W3C submissions for Semantic Web Services, execution environments have also been defined to automate the execution of Web Services semantically annotated. Therefore, WSMX for example, as we pointed out before, is an execution environment aimed at serving as reference implementation for WSMO. It enables automatic discovery, selection, mediation, invocation and interoperation of Semantic Web Services by accepting goals in a specified format as input and returning the Web Services invocation results. Another example is the OWL-S Virtual Machine [2] for OWL-S annotated Web Services. It uses OWL-S description of Web Services and OWL ontologies to control the interaction between Web Services. Similarly to WSMX, OWL-S Virtual Machine is a complete framework for Semantic Web Services that starts from parsing an OWL-S 42

46 description, and executes the process model consistently with the OWL-S opera-tional semantics. Besides, it uses the OWL-S Grounding to transform the abstract description of the information exchanges between the provider and the requester into WSDL operations. Finally, METEOR-S [3] can be associated with WSDL-S. The METEOR-S project attempts to add semantics to the complete Web process lifecycle. Particularly, four categories of semantics are identified: Data Semantics (semantics of inputs/outputs of Web Services), Functional Semantics (what does a service do?), Execution Semantics (correctness and verification of execution), and QoS Semantics (performance and cost parameters associated with service). The main advantage of this approach is that it is built upon existing service oriented architecture and Semantic Web standards if possible, thus adding semantics to current industry standards. BIRD is able to interact with these three execution environments. It possesses a goal template repository that contains a different set of goal templates for each execu-tion environment. When BIRD obtains the knowledge representing the user goals, it tries to match it with the whole set of goal templates. It can be the case that goal tem-plates of these three different execution environments are needed in order to achieve the user goals. If so, BIRD sends the goals to the different execution environments sequentially as needed. It is also possible that all the goals can be accomplished from the same execution environment, thus being easier for BIRD to achieve the user expectation. The way it works is depicted in the Figure 6. 43

Figure 10. BIRD dealing with Semantic Web Execution Environments In this section, we have tackled how BIRD deals with Semantic Web Services Execution Environments.

4 Using BIRD for Biomedical Information Integration In this section, we subsequently present a real-world based use case scenario in order to show the advantages

Firstly, the PubMed 10 database, a free search engine offered by the United States National Library of Medicine (as part of the Entrez information retrieval system).

MEDLINE covers over 4,800 journals published and also offers access to citations to articles that are out-of-scope (e.g.

47 Figure 10. BIRD dealing with Semantic Web Execution Environments In this section, we have tackled how BIRD deals with Semantic Web Services Execution Environments. In the following, we will focus on the breakthroughs and efficiency of applying this approach to biomedical data sources. 7.4 Using BIRD for Biomedical Information Integration In this section, we subsequently present a real-world based use case scenario in order to show the advantages provided by BIRD from the user perspective. There are three data sources that are being integrated, detailed in what follows. Firstly, the PubMed 10 database, a free search engine offered by the United States National Library of Medicine (as part of the Entrez information retrieval system). The inclusion of an article in PubMed does not endorse that article contents and the ser-vice allows searching the MEDLINE database. MEDLINE covers over 4,800 journals published and also offers access to citations to articles that are out-of-scope (e.g., covering plate tectonics or astrophysics) from certain MEDLINE journals, primarily general science and general chemistry journals, for which the life sciences articles are indexed for MEDLINE, in-process citations which provide a record for an article before it is indexed and added to MEDLINE or converted to out-of-scope status and

The Model-Driven Semantic Web Emerging Standards & Technologies

The Model-Driven Semantic Web Emerging Standards & Technologies Elisa Kendall Sandpiper Software March 24, 2005 1 Model Driven Architecture (MDA ) Insulates business applications from technology evolution,