Master Thesis. Andreas Schlicker

Size: px

Start display at page:

Download "Master Thesis. Andreas Schlicker"

Ernest Ellis
5 years ago
Views:

of the Requirements for the Degree of Master of Science in Bioinformatics supervised by Dr. Francisco Domingues Prof. Dr. Thomas Lengauer, Ph.

1 Master Thesis A Global Approach to Comparative Genomics: Comparison of Functional Annotation over the Taxonomic Tree by Andreas Schlicker A Thesis Submitted to the Center for Bioinformatics of Saarland University, Saarbrücken, Germany, in Partial Fulfillment of the Requirements for the Degree of Master of Science in Bioinformatics supervised by Dr. Francisco Domingues Prof. Dr. Thomas Lengauer, Ph.D. Computational Biology and Applied Algorithmics Group Max Planck Institute for Informatics Saarbrücken, Germany August 30, 2005

3 Abstract Genome sequencing projects produce large amounts of data that are stored in sequence databases. Entries in these databases are annotated using the results of different experiments and computational methods. These methods usually rely on homology detection based on sequence similarity searches. Gene Ontology (GO) provides a standard vocabulary of functional terms, and allows a coherent annotation of gene products. These annotations can be used as a basis for new methods that compare gene products on the basis of their molecular function and biological role. In this thesis, we present a new approach for integrating the species taxonomy, protein family classifications and GO annotations. We implemented a database and a client application, GOTaxExplorer, that can be used to perform queries with a simplified language and to process and visualize the results. It allows to compare different taxonomic groups regarding the protein families or the protein functions associated with the different genomes. We developed a method for comparing GO annotations which includes a measure of functional similarity between gene products. The method was able to find functional relationships even if the proteins show no significant sequence similarity. We provide results for different application scenarios, in particular for the identification of new drug targets.

4 I hereby declare that this thesis is entirely my own work except where otherwise indicated. I have used only the resources given in the list of references. Andreas Schlicker August 30, 2005 i

5 Acknowledgments I would like to thank my supervisors Francisco Domingues and Thomas Lengauer. Francisco helped me with his ideas and experience during my thesis. He gave me valuable advice in doing scientific research and in scientific writing. I thank Prof. Lengauer for giving me the chance to make this master thesis in his research group. Furthermore, I want to thank all members of AG3 for creating a nice and warm atmosphere. Special thanks go to Jörg Rahnenführer for his assistance in developing the funeqscore and proofreading the thesis. Additionally, I would like to thank Mario Albrecht, Andreas Kämper, and Ingolf Sommer for proofreading, and Joachim Büch for continues computer support. ii

6 Table of Contents List of Figures List of Tables List of Algorithms List of Abbreviations vi vii viii ix 1 Introduction 1 2 Annotations and ontologies Ontologies Ontologies in general Gene Ontology (GO) Semantic similarity A graph-based measure Resnik s measure Lin s measure Comparison of sim Resnik and sim Lin Applications of semantic similarity to GO Data sources Pfam SMART NCBI Taxonomy Gene Ontology Annotation (GOA) UniProt Methodology Combining families, taxonomy and function The database implementation Functional equivalent gene products Calculating the funeqscore Adding relevance to the FuneqScore Finding Funeqs iii

7 4 Implementation Database Gene database Completed genome database GOTaxExplorer Query language Graphical user interface (GUI) Command line interface (CLI) Using GOTaxExplorer Results Selecting families from alveolata Investigating the PHP domain Comparing biological processes Finding functionally equivalent proteins Saccharomyces cerevisiae Mycobacterium paratuberculosis Relevance score Conclusions 59 Bibliography 66 Appendix 67 A User Manual 67 A.1 System requirements A.2 GOTaxExplorer A.2.1 Configuration file A.2.2 Command Line Interface (CLI) A.2.3 Query language A.2.4 Graphical User Interface (GUI) A.3 Helper programs for database maintenance iv

8 List of Figures 2.1 Pane of GO Biological Process Comparison of Resnik s and Lin s measures of semantic similarity Information triangle Information flow in the database implementation Diagram of the comparison of two GO term sets Diagram of the calculation of the funeqscore for two proteins Data schemes of the two databases Main window of GOTaxExplorer with query and result frame visible Window with the 2D view on the hierarchies D view on a GO comparison result D output frame for GO and Taxonomy results Searching for the accession number of the PHP domain GO results example D view on query results GO comparison results Tree view of GO comparison results Results from a query for functionally equivalent proteins DNA gyrase subunit A Distribution of the PHP domain over the taxonomic tree Distribution of the funeqscore Lin, avg and funeqscore Lin, max for the comparison of S. cerevisiae with human Distribution of the funeqscore Lin, avg and funeqscore Lin, max for the comparison of M. paratuberculosis with human Distribution of the funeqscore Relevance, avg and funeqscore Relevance, max for the comparison of M. paratuberculosis with human A.1 The File menu A.2 The Frames menu A.3 The Query menu A.4 The Results menu A.5 The query frame. This frame contains facilities to incrementally build a query.. 76 A.6 Frame for entering a SQL query v

9 A.7 The results frame contains all query results in tabs A.8 The name search frame provides the possibility to search for an entity with a part of its name A.9 Popup menu displayed for a node in the 3D graph view vi

10 List of Tables 4.1 Entities used in the examples Pfam families in alveolata that do not appear in mammals Pathogens with proteins belonging to PHP domain Biological processes of proteins with PHP domain Proteins from different pathogens with the PHP domain that are involved in "DNA replication" (GO: ) The 4 biological processes from fungi that appear not in mammals but have the highest similarity to mammalian processes according to sim Lin The 5 biological processes from fungi that appear not in mammals and have the lowest similarity to mammalian processes (calculated with sim Lin ) Number of S. cerevisiae proteins with functional equivalents in the corresponding data set Number of M. paratuberculosis proteins with functional equivalents in the corresponding data set Number of M. paratuberculosis proteins with functional equivalents in the corresponding data set computed with sim Relevance vii

11 List of Algorithms 1 Lord et al. s algorithm for calculating semantic similarity Algorithm for comparing sets of GO terms Algorithm for computing the funeqscore of two gene products A and B viii

12 List of Abbreviations Bio-Data Warehouse Biological Process Command Line Interface Directed Acyclic Graph European Bioinformatics Institutes Functional Equivalents Gene Ontology Gene Ontology Annotation Genomes Online Database Graphical User Interface Hidden Markov Models Information Content Lowest Common Ancestor Minimum Spanning Tree Molecular Function Mouse Genome Informatics National Center for Biotechnology Information Ribonucleic Acid Ribosomal Ribonucleic Acid Saccharomyces Genome Database Sequence Retrieval System Shanghai Center for Bioinformation Technology Sigma Translation Initiation Factor Single Nucleotide Polymorphisms Swiss Institute of Bioinformatics Transfer Ribonucleic Acid Uniform Resource Locator BioDW BP CLI DAG EBI Funeqs GO GOA GOLD GUI HMM IC lca MST ML MGI NCBI RNA rrna SGD SRS SCBIT SIGF SNP SIB trna URL ix

13 Chapter 1 Introduction According to the Genomes Online Database (GOLD TM ) as of July 2005 [1], 266 genomes are fully sequenced and another 1226 genomes are in the process of being sequenced. These sequencing projects generate large amounts of raw data that need to be further analyzed. The raw sequences contain no biological information. This information has to be added through annotations of the sequences. The process of annotation can be divided into three steps [2]: Where? Nucleotide-level annotation locates the different sequence features like genes and promoters in the genome. What? Protein-level annotation creates a catalog of all proteins of an organism and their molecular functions. How? Process-level annotation relates the genome to the biological processes in the cell. The nucleotide-level annotation addresses several problems. Researchers locate genes, trna and other non-coding RNAs. Gene finding in prokaryotic genomes amounts to finding open reading frames, but the gene structure in eukaryotes is much more complex, and therefore more challenging to predict correctly. Non-coding RNAs like trna and rrna can be found through their similarity to known sequences, while smaller non-coding RNAs are more diverse and harder to detect. Genetic markers, that were known before the genome sequence was available, are mapped to the sequence allowing to connect the pre-genomic literature with post-genomic research. Repetitive elements account for a large proportion of eukaryotic genomes. They are detected and masked before the assembly of the compete genome starts. Single nucleotide polymorphisms (SNP) are also identified and mapped at this stage. Protein-level annotations rely heavily on bioinformatics methods and on direct assay experiments. Experiments are used to reveal the functions of a protein and possible binding partners. A powerful bioinformatics approach to characterize new proteins is the identification of homologs. New sequences are compared to sequences of already known proteins in databases in order to find homologous relationships. Homologs provide a basis for assigning families to unknown proteins as well as generating testable hypothesis for their molecular function. These putative functions are verified with experimental methods. Bioinformatics methods for the detection of homology are fast and fully automated. Orthology, as defined by Fitch [3], is a 1

14 2 CHAPTER 1. INTRODUCTION homology relationship where sequence divergence follows speciation events. However, gene duplication events lead to the occurrence of paralogous sequences. These paralogs, in contrast to orthologs, have high sequence similarity but they can differ in function. Orthology can be inferred by the comparison of a sequence tree, that is derived from a multiple sequence alignment, and the species tree [4]. Protein-level annotations are available in the Swiss-Prot and Swiss-Prot TrEMBL [5] databases. The proteins in Swiss-Prot are manually annotated with functional terms, whereas Swiss- Prot TrEMBL contains automatically annotated sequences. Swiss-Prot and TrEMBL are crossreferenced with secondary databases that classify proteins into domains and families. Examples for family databases are Pfam [6] and SMART [7]. Process-level annotation is a challenging annotation step. Different experimental techniques can be used to elucidate the molecular function and the biological process of a protein. These include high-throughput methods like microarray expression analysis and other techniques like yeast two-hybrid studies. Gene knockout and gene silencing experiments are used to determine which gene products are essential for the cell, and provide functional information. Genome context methods predict in which biological processes a protein takes part [8]. The hypotheses can then be tested with experimental techniques. Most relevant for functional annotations is the Gene Ontology Annotation (GOA) [9] project at the European Bioinformatics Institute (EBI). This project annotates the protein sequences in UniProt [10] with Gene Ontology terms. Gene Ontology [11] is a resource for functional annotations that provides a vocabulary for describing the molecular function, biological process and the cellular component of a gene product. Details on GOA and Gene Ontology are given in Section Comparative genomics methods rely on the comparison of sequences at the genome scale [12, 13]. One application is the mapping of genes in new genomes. The comparison of two genomes reveals possible syntenic regions, parts where the gene order is conserved in both genomes. This allows to transfer information from an annotated genome to a new one. The comparison of two genomes provides information on the genes but also on the regulatory elements. Since these elements are more conserved than other inter-genic regions, they can be found through a comparison of closely related organisms. This method can be used to find genes in the human genome based on the genome annotations of model organisms or to locate genes orthologous to human genes. If a human disease gene has an ortholog in another organism such as yeast, it is possible to study the molecular function of the gene and the biological process that causes the disease in this organism and then transfer the knowledge to human. The comparison of different genomes provides deeper insight into the underlying molecular biology. Comparative genomics can also be used to compare pathogens with non-pathogenic strains. Differences can help to understand the process of pathogenesis and thus give a better understanding of diseases caused by the pathogens. Comparing pathogens with their host shows genes unique to the pathogen. If these genes are essential for the viability of the pathogen, they are good target candidates for drug development. However, if they are not essential but are disease-associated, they present starting points for further experiments. Once the genomes are annotated, it is possible to compare proteins on the basis of their functional annotations in order to find functionally equivalent proteins independent of homol-

15 3 ogy. An annotation-based method has several advantages over the sequence comparison based approaches. Sequence comparison based techniques do not directly provide information about the functionality of the sequences found. Proteins with significant sequence similarity that have evolved from a common ancestor can have different functions. Furthermore, proteins with different ancestors that have no significant sequence similarity can have the same function. Modern methods for orthology detection will not find sequences that have similar functions but do not have significant sequence similarity. This can be accomplished by the use of methods relying on the comparison of functional annotations. Annotation-based methods can help to understand how conserved or diverse is the molecular biology of the different types of organisms, and how the different molecular functions and biological processes have evolved. Comparing the cellular functions and processes of pathogenic and non-pathogenic strains can be more informative in understanding pathogenesis than the comparison of sequences. Comparing functional annotations can also be applied to the search for new drug targets. A comparison of the pathogen and human annotations reveals additional functionalities in the pathogen. The pathogen proteins that perform these functions are natural candidates for drug targets. Furthermore, it is possible to extend this analysis to complete taxonomic groups instead of single organisms which allows to find functions common to a group of organisms but absent in other species. It is possible to find targets selectively for one pathogen or for a whole group of organisms with this method. One could for example find target molecules in human pathogens that are absent in bacteria from the human gastrointestinal tract flora. The wealth of databases and information requires to combine the different resources. This includes linking different databases in a query as well as comparing the information from different databases with each other. A central search tool that allows direct access to different types of information is essential for biological research. There are several query systems to access protein databases and functional information on the internet. Probably the two largest and wellknown systems are the Sequence Retrieval System (SRS) [14] at the European Bioinformatics Institute (EBI) and Entrez at the National Center for Biotechnology Information (NCBI) [15]. Both systems are very sophisticated and allow to search in different databases to gather relevant information. They are useful for general applications but do not allow more complex queries that link different databases. It is not possible to start querying one database and to use the results as query for another database. One could for example use SRS to query UniProt [10] for all human proteins that are involved in amino acid synthesis. Nevertheless, it is not possible to formulate a query that results in a list of Pfam [16] domains that appear in human proteins involved in amino acid synthesis. In order to answer this question, SRS requires the user to query for the proteins and compile the list of Pfam domains manually. The Bio-Data Warehouse (BioDW) from the Shanghai Center for Bioinformation Technology (SCBIT) and Fudan University [17] combines information of different databases. This database connects the KEGG database [18], Gene Ontology (GO) [11], GenBank [19], Swiss- Prot [5], InterPro [20], and Enzyme [21] with each other. The links between the databases are either provided through cross-references or through GO annotations. BioDW offers a semantic similarity search to find proteins with GO annotations that are semantically similar to the annotations of the query protein. The user can search with the Gene Ontology annotations of a

16 4 CHAPTER 1. INTRODUCTION selected protein and receives a list of semantically similar GO terms. Then one can search for proteins that are annotated with one of these terms. Lord et al. [22] implemented a search tool that allows the user to identify semantically similar proteins within the subset of human proteins in Swiss-Prot. The tool uses the GO annotations of the proteins to calculate a semantic similarity value. The GO annotations of a query protein are compared to the annotations of the proteins in the database. Lord et al. use the same measure for semantic similarity used in BioDW. The approaches differ in the respect that Lord et al. use all GO annotations for determining an overall semantic similarity and BioDW uses only one GO annotation. Another approach is taken by the FungalWeb Ontology [23, 24]. This consortium creates a new ontology that covers all aspects from fungal enzyme biotechnology. It integrates organism taxonomy, enzymatic activity and properties of small chemical compounds into one ontology. It is designed as a decision making tool for the biotechnology industry. We will investigate ontologies in general and GO, the approaches of Lord et al., and BioDW in particular in Chapter 2. We present a new approach to comparative genomics that integrates functional information and families with the taxonomic classification. The implementation consists of a database and a client application for formulating queries and processing the results. The implementation permits to select custom sets of GO terms, families or taxonomic groups. For example, it is possible to compare arbitrarily selected organisms or groups of organisms from the taxonomic tree on the basis of the functionality of their genes. Furthermore, it enables to determine the distribution of specific molecular functions or protein families in the taxonomy. It is also possible to compare sets of GO terms. In order to be able to compare functional annotations, we implemented a semantic similarity measure. This allows not only to compare exactly matching terms but also to assess the similarity of two GO terms. We present a method for identifying functionally equivalent and functionally related gene products from two organisms on the basis of GO annotations and a semantic similarity measure for GO. This method is independent of sequence similarity and therefore capable of finding functionally equivalent or functionally related proteins that are not homologous. Proteins essential for the pathogen but without related proteins in human are possible targets for the development of new anti-infectives. In Chapter 2 we describe the theoretical background of Gene Ontology, semantic similarity measures and their application to GO. We describe our strategy in detail in Chapter 3. There, we explain how the data is combined and present a method for assigning functionally equivalent proteins in detail. The implementation of our strategy to integrate several data sources and to assign functionally equivalent proteins is explained in Chapter 4. In Chapter 5 we report results from some experiments we performed. We give examples for the selection of different sets, the comparison of sets of GO terms, and the identification of functionally equivalent proteins in different organisms. Chapter 6 focuses on some conclusions we draw from the presented approach and on possible further extensions.

17 Chapter 2 Annotations and ontologies In this chapter, we describe ontologies in general and Gene Ontology in detail. We introduce similarity measures for comparing GO terms and give details on the databases used in our approach. 2.1 Ontologies Ontologies in general The term ontology historically arose in the field of philosophy where it refers to the existence and properties of objects and the relations between them in every aspect of reality [25]. In computer science, the term ontology is used in a more narrow meaning. Ontologies are working models of entities and interactions. This can either involve a single domain of knowledge or can be generic [26]. Different groups describe information they gather with different terms and use terms with different meanings. Information scientists refer to a taxonomy with common concepts and terms, relevant for a specific application, as an ontology. The Digital Libraries Initiative at the University of Illinois at Urbana-Champaign [27] defines ontology as An explicit formal specification of how to represent the objects, concepts, and other entities that are assumed to exist in some area of interest and the relationships that hold among them. Ontologies are important tools for sharing knowledge between humans and computers. They generally are hierarchically structured and can be illustrated as graphs. This organization of an ontology can help humans to understand the relationships between the concepts. The entities are represented by nodes and the edges depict the relationships between the concepts. The structure and complexity of the graph is determined by the type of relationships. The graphs can be either directed or undirected and can contain cycles. In simple cases, the graph has the form of a tree. Taxonomies are examples of ontologies. The taxonomy of organisms for instance contains all species and groups of organisms as nodes and edges represent "is-a" relationships between groups of organisms. In this case, the ontology graph simplifies to a directed tree. This can be 5

18 6 CHAPTER 2. ANNOTATIONS AND ONTOLOGIES thought of as one of the oldest modern ontologies introduced by Linneaus in the 18th century. By taxonomy, we refer to the NCBI Taxonomy in this thesis. Another example is the WordNet taxonomy [28], a taxonomy of the English language. WordNet is an online lexical database that represents lexical concepts with sets of synonyms [29]. This taxonomy is a directed acyclic graph of words in which the concepts are connected with "is-a" relationships. Ontologies are valuable tools for representing biological knowledge. There are many different molecular biology databases that contain genomic, structural or functional data, for example. Differences in the terminology and meaning complicate the combination of different resources in one query [30]. Biological knowledge is often represented by texts and annotations written in natural language. However, natural language uses many terms ambiguously. This vagueness can be resolved by humans but has a large impact on the usability of automatic methods. Ontologies can help in this situation through providing a structured standardized vocabulary that can be used to describe the biological knowledge. Expressing biological knowledge with the terms of a common ontology allows to combine different databases and to compare the knowledge they contain. Many different ontologies have been developed in the fields of molecular biology and bioinformatics [26]. The RiboWeb Ontology [31] aims at helping the construction of threedimensional models of ribosomal components. Other examples are the Ontology for Molecular Biology [32] for clarifying the communication within the molecular biology database community and the TAMBIS Ontology [33] which describes bioinformatics tasks and resources Gene Ontology (GO) A comparison of the genomes of Saccharomyces cerevisiae and Caenorhabditis elegans revealed a substantial number of homologous proteins whose biological function could be inferred from this relationship [11]. Many of these proteins belong to essential processes like transcription. The Gene Ontology Consortium started with the goal to produce a structured and well defined vocabulary to be able to describe the roles of gene products in eukaryotes. These descriptions should be easy to transfer between sequences with high similarity in different organisms. The consortium was founded by merging the FlyBase [34], the Mouse Genome Informatics (MGI) [35] and Saccharomyces Genome Database (SGD) [36]. GO was initially designed to annotate gene products from a generic eukaryotic cell. As a consequence, the ontology contains no concepts of organs or body parts. Today, many databases joined the consortium and prokaryotic organisms are also annotated with GO terms. The vocabulary is refined continually as missing terms are added on a daily basis. GO design GO is composed of three orthogonal ontologies: biological process, molecular function and cellular component. Molecular functions describe the task of a gene product, i.e. its biochemical activity. These terms are applied in protein-level annotations. Examples for molecular functions are "transporter" or "Toll receptor ligand". A biological process is an ordered assembly of several molecular functions that is accomplished by more than one gene product or a complex of gene products. Examples for biological processes are "translation" and "camp biosynthesis". The cellular component describes the cellular substructure where a gene

19 2.1. ONTOLOGIES 7 product is found and complexes formed of several gene products it belongs to. "Ribosome" is a cellular component as well as "nucleus" for example. Biological process and cellular component answer the question how and where a gene product is active thus providing a means for processlevel annotations. Biological process is the largest ontology of GO and contains more than 9500 terms as of July The molecular function ontology is the second largest ontology with more than 7000 terms and the cellular component contains more than 1500 terms. The ontologies are independent of each other and each gene product can be annotated with terms from all ontologies. Since gene products can have several functions in different processes and appear in more than one cellular component, gene products can be mapped to more than one concept from each ontology. Each GO term has a unique identifier consisting of seven numbers and prefixed with "GO:", e.g. "GO: " for "cell". A definition and alternative names as synonyms can be attached to each term. However, this information is not mandatory. The structure of GO Each ontology is a directed acyclic graph (DAG) where the nodes stand for terms and the edges, also called links, represent the relationships between different terms. Edges are directed from their source, which is called the parent node, to a target vertex which is called child. Nodes can have more than one child and more than one parent node. Vertices without children are called leafs, the other nodes are called inner nodes. Each ontology has a root which is the only node without parent. An artificial root node connecting all three ontologies has been introduced. The depth of a node is also called its rank. Most nodes cannot be assigned a unique rank because they have different parents and alternative paths from the root to a node have different lengths. There are two kinds of relationships, "is-a" and "part-of" links. The "is-a"-relationship is used when the child is a subclass of the parent, e.g. "intracellular signaling cascade" is a subclass of "signal transduction". The "part-of"-relationship is used when the child is a component of the parent, "nucleus" is a component of "intracellular" for example. Each term can have different relationships with its different parents. The number of edges in all three ontologies is about one and a half times higher than the number of nodes. By far most links are "is-a" relationships and the highest fraction of "part-of" relationships is found in the cellular component ontology [22]. Annotations with GO Several sequencing projects have been using the GO vocabulary to annotate their sequences. The vocabulary is designed to be used for all kinds of gene products and not only for proteins. This means GO is also suitable to annotate the different types of non-coding RNAs in the cell. Annotations with GO terms have to be attributed to a source and they should indicate the evidence supporting the annotation. To this end, a list of evidence codes was introduced. These codes can be used to make assumptions about the reliability of the annotations. Annotations are commonly inferred from electronic annotations (IEA), traceable author statements (TAS) and non-traceable author statement (NAS). TAS annotations are augmented with the scientific publication they are derived from. Other evidence codes such as inferred from direct assay (IDA) indicate the experiment used to elucidate the information on the gene product.

20 8 CHAPTER 2. ANNOTATIONS AND ONTOLOGIES The manual annotations rely on published information from scientific literature, experiments, the biological knowledge, and the level of confidence of the curator; thus, they are affected by subjective influences [37]. Automated procedures exploit mappings to other databases like cross-references to protein family databases, for example. The annotations with these secondary databases are made with bioinformatics methods like detecting sequence similarity or automatic processing of scientific literature. These automated techniques are considerably faster than manual methods. They are also more consistent than annotations made by human curators because automatic methods rely on exact rules. However, they are not as reliable as the work of human curators. A more detailed view on the process used by the GO Annotation project at EBI is given in Section The True Path rule The True Path rule reads [38]: The pathway from a child term to its top-level parent(s) must always be true. This means that an annotation of a gene product with a GO term is valid if and only if all annotations with the ancestors are also valid. The ancestors of node u are the nodes on every path from u up to the root of the graph. This rule has some important implications on the methods developed which will be discussed later. Problems with the use of GO There are some shortcomings in the use of GO [30]. The relationships defined by GO can have different meanings which leads to problems interpreting the links in the GO graphs. The "part-of" edges are used in the meaning "controls" and "subprocess of", for example. A link that stands for "controls" points to a process that is regulated by another whereas "subprocess of" hints to a component of a process. Using the same relation with different meanings is confusing. Furthermore, the definition of GO terms should standardize the usage of these terms through different communities. However, many GO terms still lack a proper definition. 2.2 Semantic similarity In this section, we cover the basic concepts of semantic similarity as used in information theory and information retrieval. Then we discuss two possible measures for semantic similarity in ontologies. Finally, we present approaches for the application of such measures to GO A graph-based measure Semantic similarity measures capture the similarity that is shared between two concepts in an ontology. For "is-a" ontologies, the structure of the graph can be exploited to define a similarity measure. Higher levels in the hierarchy usually represent generic terms, and the more specific a term is, the deeper it appears in the hierarchy. However, it is not easy to assign a rank to each term. Since terms often have different parents, they also have different ranks. This makes it difficult to differentiate between generic and specific terms on the basis of their ranks. Nodes that are close together in the graph represent similar concepts, but links in GO do not always represent uniform distances. There are paths that are only four terms deep and the last term is

21 2.2. SEMANTIC SIMILARITY 9 already very specific. This mostly reflects the vagaries of biological knowledge [22] and not anything related to the concepts represented by these terms. Edge counting The simplest distance measure is to count the number of edges between two nodes. The shorter the path between two nodes is, the more similar they are [39]. If the vertices are connected by more than one path, the shortest path is used. This approach has some problems. It assumes that the distance covered by one edge is uniform over the whole graph. This is not the case in real ontologies. Two siblings on the first rank are obviously much less related than two siblings deeper in the graph in the sense that the concepts are more generic. However, the edge counting measure would assign the same similarity to both pairs. The GO terms "cellular process" (GO: ) and "physiological process" (GO: ) are both children of the GO term "biological_process" (GO: ) (see Figure 2.1). The lower level GO terms "rrna metabolism" (GO: ) and "mrna metabolism" (GO: ) are also siblings. However, the latter two concepts are certainly more closely related than the first two Resnik s measure Resnik uses another approach to semantic similarity than simple edge counting [39]. He uses the concept of "information content" (IC) to define a semantic similarity measure. The information content is based on the probability p(c) of a term and measures the amount of its information. The probability assigned to a concept is defined as its relative frequency of occurrence. Since this measure was developed for a taxonomy of words in a natural language, the probability is based on the number of occurrences of each word in a large text. If all children are specifications of their parents, the total number of occurrences of a given term is the sum of its occurrences plus the number of occurrences of its children. This leads to a monotonically increasing probability on a path from concept c up to the root. The root has probability p(root) = 1 if it is unique. Resnik uses the negative logarithm to the base 10 of the concept s probability, IC(c) = log 10 p(c), as information content. A concept with high probability has low information and thus is assigned a low information content. The more information two concepts share the higher is their similarity. The shared information is captured by the set of common ancestors in the graph. The amount of shared information and thus the similarity between the two concepts is quantified by the information content of the common ancestors. Using the maximum information content of the common ancestors amounts to taking the shortest path from each node to the lowest common ancestor in the edge counting scenario. This leads to the following formula for semantic similarity between two concepts in an ontology: sim Resnik (c 1, c 2 ) = max ( log p(c)), (2.1) c S(c 1,c 2 ) where S(c 1, c 2 ) is the set of common ancestors of concepts c 1 and c 2. The minimum similarity is zero but there is no maximum for this measure. Resnik evaluated his measure using WordNet taxonomy [28].

22 10 CHAPTER 2. ANNOTATIONS AND ONTOLOGIES Figure 2.1: Pane of the GO biological process hierarchy (made with QuickGO at EBI).

23 2.2. SEMANTIC SIMILARITY Lin s measure Lin uses a similar approach to semantic similarity [40]. He bases his definition upon three intuitions. 1. The similarity between two concepts relates to their commonality. 2. The similarity is inversely related to the differences between the two concepts. 3. The maximum similarity is reached when the two concepts are identical. Comparing these intuitions to the definition by Resnik shows two differences. First of all, Resnik s measure only takes commonality into account and disregards the differences between the concepts. Secondly, the maximum according to Resnik s similarity measure is not necessarily reached if the two concepts are identical. Lin defines the similarity as the ratio of the commonality of the concepts and the information needed to fully describe the two concepts. The commonality of the concepts is again captured by their common ancestors. The information needed to fully describe both concepts is simply the sum of their information. This holds because the random selection of one concept is independent of the random selection of the second concept. This leads to the equation: sim Lin (c 1, c 2 ) = ( 2 log p(c) ) max. (2.2) c S(c 1,c 2 ) log p(c 1 ) + log p(c 2 ) S(c 1, c 2 ) again is the set of common ancestors of concepts c 1 and c 2. In contrast to sim Resnik (Equation 2.1), this measure has a well-defined maximum of 1 and the values range between 0 and Comparison of sim Resnik and sim Lin Considering the example in Figure 2.2 another difference of the two measures becomes obvious. According to Resnik all possible pairs (c x, c y ) with c x c y and c x, c y {c 1, c 2, c 3, c 4 }, have the same semantic similarity. In particular, the pairs (c 1, c 2 ) and (c 3, c 4 ) in Figure 2.2 have the same similarity according to Resnik. Lin s measure is better suited to distinguish between these pairs, provided that no two terms have the same information content. It would assign a higher similarity to (c 1, c 2 ) than to (c 3, c 4 ). Both semantic similarity measures rely on the information content of the lowest common ancestor (lca). The less related two GO terms are, the higher their common ancestors are in the graph and hence the lower the information content and the semantic similarity. This also holds for siblings. Thus, these measures circumvent the problem of the edge counting approach. Our methods use Resnik s and Lin s measures of semantic similarity for assessing the similarity of GO terms. This will be discussed later in Chapter Applications of semantic similarity to GO In this chapter, we discuss a semantic similarity measure for GO terms that is adapted from Resnik s similarity measure [39] and was first used by Lord et al. [22] in the context of semantic

24 12 CHAPTER 2. ANNOTATIONS AND ONTOLOGIES Figure 2.2: Small example for the comparison of Resnik s and Lin s measures of semantic similarity (lca is the lowest common ancestor). sim Resnik (c 1, c 2 ) = sim Resnik (c 3, c 4 ) and sim Lin (c 1, c 2 ) > sim Lin (c 3, c 4 ). similarity for GO. We introduce the work of Lord et al. and the Bio-Data Warehouse (BioDW) from SCBIT [17]. Probability measure for GO Resnik s and Lin s similarity measure both rely on the concept of information content. To be able to calculate the information content of a GO term, it is necessary to define a probability function p : T [0, 1] for the set T of GO terms. The probability of a term to occur is defined as its frequency in the annotations in a database [22]. The frequency of a term is given by freq(c) = anno(c) + h children(c) freq(h). (2.3) anno(c) is the number of gene products annotated with this term in the database. children(c) is the set of child nodes of term c. The sum h children(c) freq(h) results from the True Path rule. The probability of term t is then estimated by the quotient of its frequency divided by the frequency of the root, the term s relative frequency. p(c) = freq(c) freq(root), (2.4) where T again is the set of GO terms. The probability is calculated for each ontology independently of the others. It is monotonically increasing as one moves up on a path from a leave to the root.

25 2.2. SEMANTIC SIMILARITY 13 Semantic similarity search tool by Lord et al. Lord et al. [22] implemented a search tool that allows the user to identify semantically similar proteins within the human subset of proteins in Swiss-Prot. They use Resnik s definition (Equation 2.1) but calculate the information content of the GO terms using the natural logarithm instead of the logarithm to the base 10. For two gene products g 1 and g 2 with sets of GO annotations A 1 and A 2 with size m and n, respectively, the similarity is calculated with the following formula: SIM(g 1, g 2 ) = 1 m n c 1 A 1,c 2 A 2 sim Resnik (c 1, c 2 ). (2.5) The three parts from Gene Ontology, biological process, molecular function and cellular component, are evaluated separately. Thus each protein pair is assigned three independent similarity values. Algorithm 1 is used for calculating the semantic similarity of protein A and protein B with respect to biological process. The calculation of semantic similarity for molecular function and cellular component is done the same way. Algorithm 1: Lord et al. s algorithm for calculating semantic similarity. 1: BP A select biological process terms from protein A 2: BP B select biological process terms from protein B 3: sum 0 4: for a BP A do 5: for b BP B do 6: sum sum + sim Resnik (a, b) 7: end for 8: end for 9: SIM(A, B) sum BP A BP B // BP A and BP B denote the size of the sets BP A and BP B, respectively. Evaluation Lord et al. evaluated their method through a comparison of semantic similarity values with sequence similarity. They calculated the covariance between the BLAST "BitScore" and the semantic similarity. They found a covariance of 0.58 comparing molecular function and sequence identity, 0.28 for biological process and 0.36 for cellular component. The correlation is better for higher sequence similarity and higher semantic similarity values. Two important circumstances influence this correlation. First of all, proteins that perform the same or similar functions do not necessarily have high sequence similarity. Secondly, the correlation is worse in the case of biological process which is possibly due to the fact that a biological process is composed of different molecular functions. These functions are performed by proteins that are not homologous, belong to different families and have no significant sequence similarity. The same holds for cellular component since all proteins from one component or complex will have similar cellular component annotations, but they can differ significantly in sequence and molecular function.

26 14 CHAPTER 2. ANNOTATIONS AND ONTOLOGIES The GO graph contains two types of links, "is-a" links and "part-of" links. Resnik developed his similarity measure for an "is-a" taxonomy, but also states that "part-of" links can be seen as features that contribute to similarity. Lord et al. found no difference in the correlation between semantic and sequence similarity if "part-of" links are discarded or not. This is possibly due to the fact that the number of "is-a" edges in the GO graph is much higher than the number of "part-of" links. Thus the "part-of" links do not have a high impact on the calculation of semantic similarities. The approach from Lord et al. is restricted to a search within the subset of human proteins in Swiss-Prot. The calculation of semantic similarity is only based on one of the three aspects of Gene Ontology. This means that similarity is relative only to one ontology, which does not imply that they are overall semantically similar. BioDW As already mentioned in the Introduction, the Bio-Data Warehouse (BioDW) at SCBIT [17] provides a semantic similarity search that can be used to search within a database. Starting with a query protein, the user can retrieve a list of corresponding GO annotations. One of these GO terms can be selected and a list of semantically similar GO terms is calculated. In the final step, the user selects some of these similar terms and a list with all proteins annotated with one of the selected terms is displayed. The result of this search is a list of proteins that are annotated with one semantically similar GO term. This search relies only on the similarity of one single annotation and makes no statement on the semantic similarity of the whole protein. The semantic similarity between GO terms is calculated with Resnik s formula (Equation 2.1). Clustering with semantic similarity Speer et al. developed a memetic algorithm that clusters genes based on their biological process annotations [41]. A memetic algorithm is a combination of a genetic algorithm with a local search heuristic. Speer et al. define a distance d between two genes based on Lin s similarity measure for GO terms (see Section for details on the measure) as d Lin (c i, c j ) = 2 (1 sim Lin (c i, c j )). (2.6) Speer et al. represent the genes as nodes of a tree and connect them with edges. The edges are assigned the dissimilarity between the start and end nodes as weight. An algorithm calculates a minimum spanning tree (MST) of the complete gene tree. A MST contains all nodes and has minimal sum of edge weights. The algorithm then cuts edges in this MST in order to find optimal clusters of genes. Genes can be annotated with more than one biological process. In this case, Speer et al. use the lowest distance between the annotations of two genes as their distance and disregard the rest of the annotations. This clustering approach is useful for replacing the manual categorization of genes on a microarray into GO categories.

27 2.3. DATA SOURCES Data sources There are many databases that store sequence annotations on the nucleotide, the protein, and the process-level. The EMBL database at EBI [42] is one of the largest nucleotide sequence databases. Protein annotation data is provided by the UniProt database. Protein family and domain databases contain information on regions conserved in many proteins. Databases that focus on single organisms sometimes contain data on gene knock-out experiments and on interactions between different proteins. The tertiary structure of proteins is determined with X-ray crystallography or with NMR spectroscopy, and this data is deposited in the Protein Data Bank [43]. The problem with many different databases is that they use incompatible data schemes and identifiers. Sometimes, they even use the same terms in different meanings. These problems complicate cross referencing different databases Pfam The Pfam database is a collection of protein families [6, 16, 44]. There are four possible classifications for the entries: family, domain, repeat and motif. A family consists of related proteins. A domain is defined as an autonomous structural unit or as a reusable sequence unit. A repeat is not usually stable in isolation and multiple tandem repeats are generally required to form a globular domain or extended structure. Motifs describe short sequence units found outside globular domains. The entries are represented by profile-hidden Markov Models (profile-hmm) that are derived from multiple sequence alignments. A seed alignment, that contains representative sequences for a family, is used to derive a profile-hmm. This HMM is used to scan a non-redundant subset of Swiss-Prot and Swiss-Prot TrEMBL and the sequences found are combined to the full alignment. Historically, Pfam was developed to help in the annotation process of the Caenorhabditis elegans genome. Pfam consists of a hand-curated part with domains and families, Pfam-A, and of automatically generated families, Pfam-B. As of version 17.0 from March 2005, Pfam-A contains 7868 families and covers 75 % of the sequences from Swiss-Prot and TrEMBL. Pfam-B families are derived from ProDom [45] domains and add another 19 % sequence coverage. The construction of Pfam-B is used to detect sequences that are missed by the Pfam-A family profile-hmms. The most important quality control in Pfam is the overlap check. Any residue cannot belong to more than one family. If a violation of this rule is detected, the respective family alignments have to be refined. Each entry is annotated with a textual description, links to other resources and with literature references. The annotation provides information on function of proteins containing the family and on the distribution of the family in different species. Another addition has been the development of a hierarchy of clans that contains related Pfam families. Pfam families that have evolved from a common ancestor are grouped together in one clan. Pfam is a member of the InterPro consortium [20]. InterPro is an effort to integrate different protein family databases. Protein families are mapped to generic GO terms that describe proteins belonging to that family. These functional annotations of families can then be transferred to new members of a family. A hand curated mapping for InterPro entries is available from the GO website (

28 16 CHAPTER 2. ANNOTATIONS AND ONTOLOGIES SMART SMART (Simple Modular Architecture Research Tool) has been developed to coordinate knowledge from scientific literature and sequences databases [7, 46]. It was designed to provide an annotated collection of alignments from cytoplasmic signaling domains that can be found in many eukaryotes. SMART originally aimed at providing a resource for gapped alignments of signaling domains. There are almost 700 SMART domains, as of July SMART domains are represented with Hidden Markov Models (HMM) and multiple sequences alignments. Homologous sequences that have a statistically significant sequence similarity are used to build a phylogenetic tree with ClustalW [47]. A single sequence from every branch of the tree is included in a seed alignment. This seed alignment is used to derive a profile for database search. The sequence profiles are used to build HMMs which can be used to search for domains in new sequences. SMART is also a member of the InterPro consortium and a mapping of SMART domains to GO terms is derived from the mapping of InterPro domains to GO terms. SMART and Pfam are complementary to a large extent NCBI Taxonomy The species taxonomy provides a resource for grouping related species together. The taxonomy has the form of a tree where each taxon has only one parent. The links represent "is-a" relationships. The NCBI Taxonomy [48] contains all organisms with at least one sequence in the GenBank database. The NCBI Taxonomy is an attempt to include taxonomic and phylogenetic information from different sources. Phylogentic trees represent the evolution of organisms or single sequences. Sources for NCBI Taxonomy include scientific literature, databases and human experts. As of July 2005, the NCBI Taxonomy contains almost species and overall more than taxa Gene Ontology Annotation (GOA) The number of sequences in the Swiss-Prot and TrEMBL databases increases very fast and it is not feasible to provide manual annotation at the same speed. The Gene Ontology Annotation project at the EBI aims at providing GO annotation for gene products for all completely sequenced organisms [9, 37, 49]. GOA uses a combined approach with manual and automated annotation procedures. The automated procedure is based on existing annotations from other resources. Entries in Swiss- Prot and TrEMBL are linked to InterPro and to Enzyme Commission (EC, [50]) numbers. A manually curated mapping of InterPro entries to GO terms provides the basis for assigning GO terms to the gene products. Multifunctional proteins containing several InterPro domains are assigned multiple GO terms with this procedure. An EC number to GO mapping is used in the same manner. Swiss-Prot contains a controlled list of keywords with a definition to clarify their biological meaning and intended use. These keywords are manually mapped to detailed GO terms. This mapping is also used in the automated annotation process. The automated annotation procedures provided 69 % GO coverage for the sequences in UniProt as outlined by Camon et al. [37]. Manual annotation is more detailed but is also more labor intensive. Therefore, the

29 2.3. DATA SOURCES 17 proteins are prioritized for manual annotation. GOA concentrates on human proteins without GO annotation that are relevant for diseases. Manual annotation involves extensive literature search not only to find statements on the function but also the evidences for these statements. Mappings to GO terms that become obsolete are replaced manually by a suited term. The GOA project created a benchmark for automated GO annotations for the BioCreAtIvE task 2. This task evaluates whether information retrieval methods can help biologists in annotating new sequences. In an analysis of human curator performance, they found out that human curators extracted the exact GO term in 94 % of the cases and achieved a recall rate of 72 %. The recall rate is the ratio of the number of valid annotations made to the total number of possible annotations. A benchmark of the automated methods currently used by GOA showed that they achieve a precision of 91 to 100 % [37]. Annotations that are derived from EC numbers were more detailed than the annotations derived from InterPro or Swiss-Prot keyword mappings UniProt The UniProt database ( [10, 51] is a comprehensive resource for information on proteins and their annotations. It is a joined effort from the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Georgetown University that are members of the UniProt Knowledgebase (UniProt) consortium. It combines the Swiss-Prot and TrEMBL [5] databases from EBI and SIB with the PIR-Protein Sequence Database (PIR-PSD) [52] from Georgetown University. PIR-PSD has been discontinued after release 80.0 from December 31, UniProt consists of three parts: the UniProt Archive (UniParc), the UniProt Knowledgebase (UniProt) and the UniProt Reference (UniRef). UniParc is a very comprehensive nonredundant protein sequence database that contains sequences from Swiss-Prot, TrEMBL, PIR- PSD, EMBL [42], Ensembl [53] and many more. Entries from the source databases that represent the same sequence are merged into one single UniParc entry and assigned a unique UniParc identifier. Each UniParc entry provides cross-references to the respective entries in the source databases. UniRef is a collection of three representative protein sequence databases. It consists of NREF100, NREF90 and NREF50. In NREF100, identical sequences from UniProt and sequence fragments from the same species are merged into one entry. NREF90 and NREF50 are subsequently build from NREF100. All entries from all species are merged if they have a mutual sequence identity of at least 90 % or at least 50 %, respectively. UniProt is the central database of protein sequences with annotations and functional information. There are two parts in UniProt, Swiss-Prot and TrEMBL. The entries in Swiss-Prot are curated by hand and cross-references to other databases are added manually or reviewed by curators. Annotation by curators is based on literature and includes protein functions, posttranslational modifications, secondary structure and possible splice isoforms. TrEMBL contains all proteins that are predicted from coding genomic sequences in the EMBL database [42] and that are not yet included in Swiss-Prot. Proteins in TrEMBL are annotated automatically with cross-references to other resources. Swiss-Prot serves as basis for the generation of rules for the automatic annotation of TrEMBL entries. The RuleBase [54] method groups the proteins according to their InterPro annotation, and common annotations of Swiss-Prot entries in one group are transferred to TrEMBL entries in the same group. In

30 18 CHAPTER 2. ANNOTATIONS AND ONTOLOGIES addition to this rule-based approach, the UniProt consortium implemented a fully automated approach, called Spearmint, which is based on decision trees [55]. This approach calculates decision trees from a set of reliable annotations, and new proteins are characterized with these decision trees. RuleBase and Spearmint improved the annotation in 32 % and 55 % of the UniProt entries, respectively. In order to reduce erroneous annotation from these methods, a post-processing system called Xanthippe has been implemented. This is based on exclusion mechanisms and decision trees [56]. UniProt uses standardized nomenclatures and controlled vocabularies to describe specific information on proteins. These controlled vocabularies include Swiss-Prot keywords which are also mapped to GO terms. UniProt provides cross-references to more than 60 other databases and thus represents a hub for protein information. Important cross-references include the annotations of the protein sequences with Pfam and SMART domains and families. The GO annotations provided by the GOA project are also included in UniProt. Furthermore, UniProt uses the NCBI Taxonomy to identify the organism source of a protein sequence. UniProt classifies annotations made to proteins with evidence codes which include the data source, the types of evidence and the method of the annotation. The evidence labels help users to assess the quality and reliability of annotations. UniProt contains more than 1.8 million proteins from over 85,000 different species, as of July 2005.

31 Chapter 3 Methodology In this chapter, we describe in detail how we combined the different data sources. This allows to select sets of the different types of information. Furthermore, we explain how to compare the GO annotations of different proteins in order to find functionally equivalent proteins. 3.1 Combining families, taxonomy and function Our method links the gene products with function information, protein family classification and the taxonomy of the source organism. We implemented a relational database that links the different kinds of information and allows complex queries. The information used in our approach can be visualized as a triangle (Figure 3.1). The corners of the information triangle represent the secondary information on function, family and the taxonomic tree. The gene products in the center of the triangle are directly linked to the three corners, thus connecting the other information with each other. In order to allow combinations of these different types of information, a uniform data scheme is used and relations between the different information types are defined. The database supports efficient queries that involve an arbitrary combination of the information in the triangle as input and also as output. Using data from organisms that are not completely sequenced can yield misleading results. Consider the query: "Select all Pfam families that are in species A and not in species B". If species B is not completely sequenced, it is possible that families are excluded simply because the proteins are not yet characterized. Therefore, we decided to use a database that contains only information from completely sequenced organisms. We restricted the database to proteins only The database implementation Figure 3.2 illustrates the interaction of the user with the database. The input interface is used to formulate a query. This query is translated into SQL and subsequently submitted to the database. The results returned from the database are processed and presented to the user in an output interface. Processing the results is important to help users to get an overview of their findings. Since GO can be represented by a graph and the taxonomy by a tree, the results from these sources can be visualized in the corresponding structure. Visualization of results can be 19

32 20 CHAPTER 3. METHODOLOGY Figure 3.1: Information triangle illustrating the connection between the different information sources. Figure 3.2: Information flow in the database implementation. very helpful in interpreting the results and drawing conclusions. A detailed explanation of the visualization can be found in Section 4. An application allows the user to easily formulate queries and format the results. In order to ease the usage of the database as much as possible, we provide a query language that is independent from the underlying data scheme. This allows to pose queries without knowing details of the database. The query language has a simple form and consists of only few operators and thus is easy to use. Additionally, this approach has the advantage that the user interface does not change after changes to the database are made. Three basic types of queries are possible:

33 3.1. COMBINING FAMILIES, TAXONOMY AND FUNCTION Selection of sets of families, taxa, functional terms or proteins 2. Comparison of two sets of GO terms 3. Assignment of functionally equivalent gene products to gene products from different species Selection of sets The database allows the user to search all integrated data sources with certain conditions. It is possible to freely combine different input sources to select a set of entities. This allows queries like "Select all species that have proteins with the PHP domain" or "Select all proteins with the PHP domain that are involved in DNA replication". Furthermore, this approach permits to access every rank in the taxonomy and does not restrict the user to species or to predetermined organism groups. The selection of sets of GO terms is more complex than the selection of sets of other information. Considering the simple query "Which biological processes are in fungi and not in human?", where BP1 is a GO term mapped to a fungal protein, and BP2 is a GO term mapped to a human protein, the following situations become apparent: 1. BP1 equals BP2 2. BP1 is an ancestor of BP2 3. BP1 is an descendant of BP2 4. BP1 and BP2 are descendants of the same GO term The first case is obvious since both terms are equal. In the second case, keeping the True Path rule in mind, the situation is also evident. This rule implies that the human gene product could also be annotated with BP1. Therefore, BP1 has to be considered to appear in human and thus has to be excluded from the final result set. This case is handled by the selection of GO sets. If we consider the third case where the annotations are made the other way round, the situation is not that clear. In this case, the human gene product is annotated with the less detailed GO term. This can have several reasons. One possible explanation is that there is less knowledge about the human protein. However, it could also be the case that the fungal protein is a subclass of the human protein and is more specific. The annotation is still quite subjective, in particular where it is based on human decisions. This can lead to an inconsistency of the annotations of different curators. Furthermore, different automated annotation procedures can make more or less detailed annotations that are hard to compare. This leads to the possibility that there is more precise knowledge about the human protein but it is not accurately annotated. However, the exact GO term can still be missing. It remains unclear whether the human gene product is involved in the same process as the gene product from fungi or not. This situation is complicated by the two relation types in GO, "is-a" and "part-of". A "part-of" relationship indicates that the child is a component of the parent. This implies that the two proteins probably perform different parts of the same process or function. The "is-a" relationship in contrast indicates that the child is one type of the parent and there could be several different types.

34 22 CHAPTER 3. METHODOLOGY In the fourth case, both proteins are accurately annotated and they are related through a common ancestor. It is necessary to be able to compare the concepts of BP1 and BP2 to assess their semantic similarity. BP1 and BP2 are regarded as different in the selection of GO term sets and we included additional functionality to compare sets of GO terms. Comparison of two sets of GO terms In order to handle the problems in the selection of GO term sets, we implemented an algorithm for the comparison of sets of GO terms. A possible query is: "Which biological processes occur in fungi and not in human and how similar is their most similar process in human?". First, the sets of fungal processes that occur not in human and of human processes are selected. Secondly, the GO terms in the first set are ranked according to their semantic similarity to the GO terms in the second set. The outline of this comparison is shown in Figure 3.3. Algorithm 2 is used for the comparison of the two sets of GO terms. The user can select both sets with arbitrary queries. Figure 3.3: Diagram of the comparison of two GO term sets. The semantic similarity can be calculated according to Resnik (see Section 2.2.2) or Lin (see Section 2.2.3). We provide both possibilities and two lists of GO terms are calculated, one ranked according to sim Resnik (Equation 2.1) and the other one ranked with sim Lin (Equation 2.2). Assignment of functionally equivalent gene products As more and more functional annotations become available, these annotations can be used to compare proteins in terms of their functions. Such a method allows, for example, to compare two proteomes A and B and to obtain for each protein in A the functionally most similar proteins in B. This approach is capable of finding functionally equivalent proteins that are not homologous to each other and do not have significant sequence similarity. The method allows a more direct comparison of the bi-

35 3.2. FUNCTIONAL EQUIVALENT GENE PRODUCTS 23 Algorithm 2: Algorithm for comparing sets of GO terms. 1: A select GO terms 2: B select GO terms 3: result {} //Sorted descending according to similarity 4: for a A do 5: max 0 6: term null 7: for b B do 8: sim calculatesim(a, b) 9: if sim > max then 10: max sim 11: term b 12: end if 13: end for 14: result.add(max, term) 15: end for ological processes of the organisms than by a comparison of homologous relationships. The GO annotations can be used to find functionally equivalent proteins that have equivalent roles in different organisms or to find functionally related proteins. One possible query is: "For all proteins from Mycobacterium paratuberculosis find the functionally most similar proteins from human". First, the user selects two sets of proteins, set A and set B. Then, the proteins in set A are compared to the members of set B. A similarity value between the mappings of genes in set A to GO and the mappings of genes in set B is calculated. These values can be used to calculate a score for the functional similarity between two proteins. The best matching hits from set B for each gene product in set A are returned. We describe the details in Section Functional equivalent gene products Two gene products from different organisms perform the same molecular function and take part in the same biological processes if they are functionally equivalent. If they have similar roles, they are annotated with semantically similar GO terms. This does not necessarily imply that they have evolved from a common ancestor. The semantic similarity between two GO terms serves as similarity measure for a comparison of the GO mappings. In the following, we describe our idea of functional equivalents (Funeqs) and three scoring schemes for deriving Funeqs Calculating the funeqscore Functionally equivalent gene products have similar molecular functions and take part in similar biological processes. This implies that they have to be annotated with similar GO terms in both categories. Gene products are mapped to a set of different GO terms in one ontology. Testing

36 24 CHAPTER 3. METHODOLOGY whether two gene products are functionally equivalent or not, involves a comparison of the two sets of GO mappings. It is also possible that the functions of different gene products in one organism are performed by a single gene product in a second organism. In order to compare two annotations, it is necessary to assess the relatedness of the two sets of GO terms. It often occurs that one gene product is mapped to more specific terms than the other gene product. This is sometimes due to missing GO terms but can also be caused by missing knowledge. In addition, incomplete GO annotations are a problem for this approach. There is no species whose gene products are all annotated with GO terms. It is therefore possible to miss the real functional equivalent. A method for finding functionally equivalent gene products needs a robust measure that is capable of assessing the similarity between sets of GO terms even if they are not equally detailed. It needs to be robust to missing annotations and terms. It is essential to handle gene products that have different molecular functions and take part in more than one biological process or if one has a more specific function. Moreover, it is necessary to combine the different ontologies, molecular function, biological process, and cellular component to obtain a complete assessment of the equivalence. The first step in the comparison of two gene products is the pairwise comparison of their GO mappings. The mappings to the different ontologies (molecular function, biological process, and cellular component) are examined separately. Considering two gene products A and B annotated with the sets GO A and GO B of GO terms, a similarity matrix S is calculated. This matrix contains all possible pairwise similarity values of the mappings GOi A of gene product A and the mappings GOj B of gene product B. sim(go1 A, GOB 1 ) sim(goa 1, GOB 2 ) sim(goa 1, GOB M ) sim(go 2 A, GOB 1 ) sim(goa 2, GOB 2 ) sim(goa 2, GOB M ) S = sim(gon A, GOB 1 ) sim(goa N, GOB 2 ) sim(goa N, GOB M ) (3.1) The matrix can be calculated either with sim Lin (Equation 2.2), sim Resnik (Equation 2.1), sim Rel (see Section 3.2.2), or sim Relevance (see Section 3.2.2). The matrix S is not necessarily square since the proteins can have different numbers of GO mappings. Because the proteins usually have different GO mappings, the matrix is mostly not symmetric. The rows and the columns of S therefore represent two different directional comparisons: row vectors are a comparison of A to B column vectors are a comparison of B to A Our algorithm assigns best hits from the comparison of the mappings of one gene product with the mappings of the other gene product. The best hit is identified by the highest similarity score. A high similarity value of the best hit indicates that this function is represented in both gene products. Finding the best hits for the comparison of A with B is the same as finding the

37 3.2. FUNCTIONAL EQUIVALENT GENE PRODUCTS 25 maximum values in the rows in matrix S (row maxima). The maximum values in the columns of S, referred to as column maxima, represent the best hits for the direction B to A. The averages over the row maxima and the column maxima give similarity values for the comparison of A to B and the comparison of B to A, respectively: rowscore = 1 N N max 1 j M sim(goa i, GOj B ), (3.2) i=1 columnscore = 1 M M max 1 i N sim(goa i, GOj B ). (3.3) j=1 rowscore and columnscore are in the range between 0 and 1. Average Score average: One possibility to combine the scores for both directions is to calculate the GOScore avg = 1 (columnscore + rowscore). (3.4) 2 This scoring enforces that both gene products have the same types of functionality because a high score can only be achieved if columnscore and rowscore are high. Additional functions in either of them that have no high semantic similarity hit in the other one will have a penalizing effect. However, multi-functional gene products or partially characterized proteins present a problem for this score. Consider a multi-functional protein A that performs the same functions as a protein B but has some additional functions, or these functions are not annotated in A. In this setting, the columnscore will be high but the rowscore will be low. This gives a low overall GOScore avg although A can be seen as functionally related to B. Maximum Score The maximum score GOScore max is defined as: GOScore max = max{columnscore, rowscore}. (3.5) This scoring addresses the problems involved with multi-functional gene products or incomplete annotation. It detects whether one gene product is functionally related to the other and does not enforce that both gene products are functionally related. FuneqScore A final funeqscore is calculated after either GOScore avg (Equation 3.4) or GOScore max (Equation 3.5) was calculated for molecular function (referred to as MF Score in the following) and biological process (referred to as BP Score in the following). Two gene products that have a high score in one ontology but only an average score in the other one can be considered average matches. However, their score should be higher than the score of two gene products that are average matches in both categories. Simply adding MF Score and BP Score or taking the average would not distinguish between these two cases. Squaring the MF Score and the BP Score favors high similarity in one ontology and an average score in

38 26 CHAPTER 3. METHODOLOGY the other over an average score in both ontologies and thus allows a distinction of these cases. Therefore, we calculate the funeqscore for two gene products as: funeqscore = 1 [( BP score ) 2 ( MF score ) 2 ] 2 +. (3.6) UL BP score UL MF score UL BP score and UL MF score denote the maxima of the two GOScores. The funeqscore can be calculated with either GOScore avg or GOScore max and using either sim Lin, sim Resnik, sim Rel or sim Relevance. The expression funeqscore Lin, avg indicates that the score was computed with Lin s similarity and the GOScore avg, for example. We use sim Lin and sim Relevance for calculation of funeqscores. The GOScore max, GOScore avg, and the funeqscore range between 0 and 1 with these similarities. Lin s similarity is not a metric since it does not satisfy the triangle inequality. This holds also for the funeqscore. The funeqscore can be applied to any type of gene product that is annotated with GO terms. Furthermore, it can be calculated with any semantic similarity measure that has a well-defined maximum. It is possible to extend the comparison to take also cellular component mappings into account Adding relevance to the FuneqScore The funeqscore as described in the section above does not take into account how detailed the GO annotations are. The comparison of two gene products annotated with "protein binding" (GO: ) will have a high score as well as two gene products that are annotated with "STAT protein nuclear translocation" (GO: ). However, the annotation of the latter two is much more useful because it is very detailed in comparison to the first one. Functional equivalents for proteins, that are annotated only with generic terms, are not very helpful. If the function of a protein is not known in detail, it is not possible to find functionally equivalent proteins based on these annotations. In this case, most proteins found will have only a slightly similar function. A score that is dependent on the similarity of two gene products and the relevance of their annotations can better distinguish between meaningful and insignificant functional equivalents. Deriving relevance The definition of semantic similarity used in our approach is based on the information content of GO terms. The information content in turn is defined through the probability of the GO terms in the annotation. The probability derived in Section has the property that it decreases with the depth of a term, i.e. more detailed terms have a lower probability than coarser terms. This gives a measure for the relevance of a term: the lower its probability the higher is its relevance. Adding relevance to the similarity In order to take relevance information into account we modify Lin s similarity measure (Equation 2.2). This leads to the following formula: sim Rel (c 1, c 2 ) = ( 2 log p(c) ) max c S(c 1,c 2 ) log p(c 1 ) + log p(c 2 ) (1 p(c 2)). (3.7)

39 3.2. FUNCTIONAL EQUIVALENT GENE PRODUCTS 27 Figure 3.4: Diagram of the calculation of the funeqscore for two proteins. This measure takes the relevance of the second term into account and is not symmetric. Instead of using the probability of the second term, it is also possible to use the probability of the lowest common ancestor: sim Relevance (c 1, c 2 ) = ( 2 log p(c) ) max c S(c 1,c 2 ) log p(c 1 ) + log p(c 2 ) (1 p(c)). (3.8) This amounts to assessing the relevance of the common information. sim Relevance is symmetric as well as Lin s similarity, i.e. sim Relevance (c 1, c 2 ) = sim Relevance (c 2, c 1 ), and is also in the range between 0 and 1. Since the higher the relevance of a term, the lower its probability, we use 1 p(c) for the computation of sim Rel and sim Relevance. Using a similarity score, the algorithm assigns best hits, the GO term with the highest semantic similarity value, to each GO mapping and uses the similarity of these best hits to determine the rowscore (Equation 3.2) and the columnscore (Equation 3.3). Using sim Rel or sim Relevance, the most similar term is no longer necessarily the best hit. The relevance similarity represents a tradeoff between high similarity and high relevance thus favoring a paretooptimal best hit. The rest of the procedure remains unchanged. First, the similarity matrix S is calculated with either sim Rel or sim Relevance. Then, the GOScores are calculated and then combined via Equation 3.6 to give the overall score Finding Funeqs The overall procedure for computing the funeqscore of two gene products is described in Algorithm 3 and visualized in Figure 3.4. This algorithm can be applied to any type of gene products and can be easily extended to compare two sets of gene products with each other. It is also

40 28 CHAPTER 3. METHODOLOGY independent of the funeqscore and can be used with funeqscore Lin, avg, funeqscore Lin, max, funeqscore Rel, avg, funeqscore Rel, max, funeqscore Relevance, avg or funeqscore Relevance, max. Algorithm 3: Algorithm for computing the funeqscore of two gene products A and B. 1: BP A select biological process terms from A 2: BP B select biological process terms from B 3: S {} 4: for a BP A do 5: for b BP B do 6: S ab calculatesim(a, b) 7: end for 8: end for 9: rows rowscore(s) 10: columns columnscore(s) 11: BPscore GOScore(rowS, columns) //maximum or average score 12: MF A select molecular function terms from A 13: MF B select molecular function terms from B 14: S {} 15: for a MF A do 16: for b MF B do 17: S ab calculatesim(a, b) 18: end for 19: end for 20: rows rowscore(s) 21: columns columnscore(s) 22: MFscore GOScore(rowS, columns) //same GOScore as above 23: return funeqscore(bpscore, MFscore)

41 Chapter 4 Implementation We describe the relational database that integrates the different information sources in this chapter. We will also describe GOTaxExplorer, an application that is designed to easily access the database. GOTaxExplorer features a command line interface and a graphical user interface that provide the possibility to query the database and to present the results. 4.1 Database In order to achieve the integration of the different data sources, we implemented a relational database scheme. The database allows any combination of the different data sources in a single query, access to all levels in the taxonomy and in the Gene Ontology, adding new data and data sources easily. We implemented the relational data model with the MySQL database ( com) which provides a fast and reliable relational database server that is well suited for our purposes. We used version max of MySQL that is installed on a computer with symmetric multiprocessing with four CPUs and running Solaris 9. The intended use of a database is very important for the design. The main focus of our database is on queries and there will be no frequent changes of the data, apart from scheduled updates. This means that it is more important to have low response time to queries instead of providing for fast updates of the data. A relational data model consists of tables which represent entities with attributes. The rows stand for instances of the entity and the attributes characterize the instance. A relational database is designed to handle relations between different tables. This is particularly useful if several entity instances share information. This common information is stored in a new table. This prevents data redundancy and thus eases data updates and deletes. Relational databases are designed to easily handle 1-to-1 or 1-to-many relationships, i.e. one instance is related to one or many instances of another entity. There are many-to-many relationships between the 29

42 30 CHAPTER 4. IMPLEMENTATION different types of biological data. These relationships have to be resolved with additional tables. Database management systems store the data in unordered files on a hard disk. Queries that involve large tables are very slow because every row has to be examined in order to calculate the results. Indexes help to resolve this situation. Indexes are sorted lists of the table contents and their use can speed up queries tremendously. However, they significantly add to the space consumption of the database and slow down updates and deletes because they have to be restored after each operation that changes the data. These points need to be addressed by the database design. A universal database that contains all information available is practical for many different purposes. Despite these advantages, there are major drawbacks if one considers the comparison of species or groups of organisms. UniProt contains sequences from more than 85,000 species but only around 260 of them are fully sequenced, as of July If a query considers also information of unfinished species, the results can be misleading. The query "Select all processes from species A that are not in species B" will include some processes from A that also occur in B if the corresponding proteins from B are not known yet. Considering only the subset of completed genomes will give more accurate results. Therefore, we build two databases. The first one contains all available data and can be used in future and existing other projects, this database is called "genes". The second database contains only the data from completely available genomes and is designed for this application, it is called "completed_genomes". This database contains additional tables that improve the speed of queries available through the query language Gene database Figure 4.1(a) shows the scheme of the "genes" database. The database can be divided into three layers. Layer 1 is represented by the GENE and GENE2NAME tables which contain the information on the gene products. Layer 2 contains the tables for the NCBI Taxonomy, GO, Pfam and SMART, the secondary databases. The third layer connects the other two layers with primary matching tables. The cross-references between different data sources can correspond to many-to-many relationships. The primary matching tables resolve the problem that gene products can have arbitrarily many mappings to the tables from layer 2 and vice versa. The open design allows for easy addition of other tables with further information and also makes an easy automatic update of the database possible. These two properties make it an ideal starting point for future projects that rely on gene product annotation data. Since the purpose of this database is rather general, we decided to add only indexes that support the insertion and update of data. Further indexes are too specific and slow these tasks down and add to the space requirements of the database Completed genome database The "completed_genomes" database is shown in Figure 4.1(b). The overall architecture is the same as for "genes" with two exceptions. First, we added tables for storing precalculated simi-

43 4.1. DATABASE 31 (a) "genes" database (b) "completed_genomes" database Figure 4.1: Data schemes of the two databases.

44 32 CHAPTER 4. IMPLEMENTATION larity values for GO term pairs. These tables are used to speed up the comparison of GO term sets and the Funeq calculations. Secondly, we introduced secondary matching tables in order to decrease query time. In the original database there are primary matching tables connecting layer one and layer two. This results in complex queries that use tables from the second layer and the third layer for execution. The secondary matching tables directly link tables from the second layer which circumvents the usage of the primary matching tables for most standard queries and thus reduces the complexity of the queries significantly. A second advantage of this approach is that the secondary matching tables used for the query are generally smaller than the primary matching tables which further increase query speed. The downside of these tables is that they add data redundancy to the database. The redundancy increases the space consumption and makes it necessary to rebuild the secondary matching tables after every database update. The concrete application of the "completed_genomes" database allows to implement a rather exhaustive indexing scheme. Since the query language defines the possibilities for the user to access the database, it is possible to add indexes for all columns that can appear in query conditions. This improves the performance of queries tremendously. The space burden is a minor issue for this database as the query speed is of higher priority. The database "completed_genomes" contains information from 229 species which have more than 996,000 proteins. This is almost half of the number of sequences in UniProt. The relative term frequencies needed for the semantic similarity measures are based on the annotations in the "completed_genomes" database. The frequency of a GO term is the sum of its and its descendants occurrences in the "completed_genomes" database. The total number of term occurrences of an ontology is the frequency of the root. These numbers are calculated for each ontology independent of the other two ontologies. The selection of sets is in the order of seconds to minutes. Selecting all human Pfam families takes less than a second. The query "Select all GO annotations from human proteins" takes slightly over one minute for example. Comparing two sets of GO terms can range between minutes and hours depending on the size of the two sets. Calculating functional equivalents between two completed proteomes takes several hours. The running time of queries is dependent on the capacities of the database and of the database server. The calculation of functional equivalents is dependents more on the client computer running GOTaxExplorer than on the database server. We used Java TM 1.5 ( to implement a set of programs that parse the primary data files obtained from the source databases and load the information into the "genes" database. Each data source is updated with a separate program. This design makes it possible to independently update each data source as new releases become available. The "completed_genomes" database is build with an additional program that uses a list of completely sequenced organisms. We compiled a list of completely sequenced organisms from two sources. The Genomes OnLine Database (GOLD) [1] provides an excellent list with finished sequencing projects. The second resource was the Integr8 project at the EBI ( We used the data files from the source databases that were available at the beginning of May Because Java TM is platform-independent it is possible to deploy the database update programs on many platforms without making any changes to the source code. MySQL is also

45 4.2. GOTAXEXPLORER 33 available for many operating systems. Therefore, the database can be installed on a large variety of computer systems. We tried to follow the official SQL standards and added database specific information to configuration files. This should ease the migration of the database to another database management system. However, some update programs use MySQL specific features which need to be replaced in case of a migration. 4.2 GOTaxExplorer GOTaxExplorer is the main tool that provides the user with possibilities to query the database and to visualize the results. We implemented GOTaxExplorer in Java TM to be able to exploit the platform independence. Java s Swing library provides a good basis for the development of a graphical user interface. GOTaxExplorer requires Java version 1.5 or higher and should run on any operating system such a Java runtime environment is available for. We successfully tested GOTaxExplorer on Debian Linux, Red Hat Linux, Solaris 9 and Windows The database can either be installed locally or on an accessible database server. We used a MySQL database and the MySQL Connector/J version to connect to the database. GOTaxExplorer features a complete graphical user interface and a lightweight command line interface that are designed to be easy to use. Both interfaces allow the user to fully exploit the query capabilities of GOTaxExplorer and the graphical user interface additionally has some advanced input and output features. It allows to search the taxonomic tree and the GO graphs in order to select specific taxa or GO terms. The output is presented in a list and can also be highlighted in the GO graph or the taxonomic tree. Further explanation of the features of the graphical user interface can be found in Section The features of the command line interface are described in Section In addition to the query language, there is the possibility to use the interfaces to enter SQL queries. The comparison of GO sets and the calculation of functional equivalents can be parallelized. The number of threads can be chosen arbitrarily in our implementation. Performing these comparisons on a computer cluster leads to a tremendous speedup provided that the database server can handle a large number of queries. The calculation of functional equivalents can benefit more from multiple CPUs than the comparison of GO sets because all queries during the calculation are cached in main memory. This leads to a release of database resources and allows to deploy more threads Query language Since the approach relies primarily on a relational database, the user would be required to use SQL for the queries. In order to be able to formulate SQL queries, it is necessary to know the database scheme and to have in-depth knowledge of SQL. It is needed to formulate rather complex SQL queries to be able to fully exploit all possibilities. This is not practical neither for sporadic nor for daily use. We therefore implemented a query language that is flexible and easy to use at the same time. One advantage of a custom query language is that it greatly simplifies the use of the database and the query tools through placing an abstraction layer on top of the

46 34 CHAPTER 4. IMPLEMENTATION database. The disadvantage of a query language is that the user needs to learn the language to be able to use the program. Therefore, we implemented facilities in the GUI to help the user to perform queries. These will be explained in the next section. In order to allow queries that are not possible with the query language, GOTaxExplorer provides a SQL mode that allows to enter SQL queries directly. Entries from the GENE, the SMART, the PFAM, the TERM and the TAXON tables will be referred as entities. The query language supports all three types of possible queries: 1. Selection of sets of entities 2. Pairwise semantic similarity between sets of GO terms 3. Pairwise functional similarity between sets of proteins The selection of sets with the query language follows a very easy pattern: <result> WHERE <condition> RESTRICT <limit> This simple query allows to define arbitrary sets of any information contained in the database. The user selects the results to receive with <result>. It is possible to select more than one result class at once. The <condition> is composed of the entities and boolean operators. Taxonomic groups present a problem for query evaluation. Every species resolves into a set of entities. Since a taxonomic group is a set of species, a group evaluates to a set of sets of entities. The set of entities of a taxonomic group can be defined as the union or the intersection of the species sets. We use the union of the sets of the different species as default. However, the user can enforce the intersection of the sets. An entity is defined by its domain, "PFAM:", "SMART:", "GO:", or "TAX:", and the accession number from the source database, e.g. "GO:GO: " or "PFAM:PF02811". The condition can contain any combination of the different information sources that are connected with boolean operators. The <limit> allow the user to restrict the computation of GO results to annotations with certain evidence codes or to receive only GO terms from specific parts of GO. For taxonomy, the user can restrict the results to species in order to exclude all upper-level taxa. Limits are only valid for GO and taxonomy results and will be ignored if used in other cases. An example for the selection of sets is: GO WHERE TAX:1770 AND NOT TAX:9606 RESTRICT TT:biological_process This query selects all GO terms from Mycobacterium paratuberculosis that are not in human and limits the evaluation to biological processes. In order to support the semantic comparison of two sets of GO terms, we added the "SIM" operator. A similarity query looks like: GO WHERE <condition> RESTRICT <limit> SIM GO WHERE <condition> RESTRICT <limit>

47 4.2. GOTAXEXPLORER 35 The user has the possibility to freely define the two sets of GO terms with two subqueries. The only restriction that applies is that the subqueries must return GO terms only. An example for such a query is: GO WHERE TAX:1770 AND NOT TAX:9606 SIM GO WHERE TAX:9606 This query calculates semantic similarity scores between all GO annotations from M. paratuberculosis that are not in human with all human GO annotations. We have taken the same approach in order to support the search for functionally equivalent proteins. In this case, the query looks like: GENE WHERE <condition> FUNEQ GENE WHERE <condition> Again, the language allows a completely arbitrary selection of gene sets to be compared. This query type allows to compare two different organisms or groups of organisms to find functional equivalents. Furthermore, the freely definable sets allow to focus the comparison on specific functionalities, either proteins with specific GO terms or protein families, from two organisms for example. This makes an analysis of single interesting proteins possible. GENE WHERE TAX:1770 AND GO:GO: FUNEQ GENE WHERE TAX:9606 AND GO:GO: assigns functionally equivalent proteins between proteins from M. paratuberculosis involved in DNA replication and human proteins involved in DNA replication. All options of the query language are explained in the Appendix A Graphical user interface (GUI) The main window of GOTaxExplorer is divided into three parts (see Figure 4.2), the menu bar on top, the workplace in the middle and a status bar at the bottom. The status bar shows information on lengthy tasks such as running queries or loading graphs. The menu bar allows to select different options with the mouse. All options are also accessible with shortkeys to allow experienced users a quick navigation. The menu provides options to show and hide different frames in the workspace. It provides also possibilities to select query options. The "Results" menu allows the user to save or delete single results and to select the graph representations for visualization. The workspace contains the main query interface and the results area. The query interface is divided into three parts, the query field on top, the facilities for building a query in the middle and finally the buttons to submit the query and to clear it. The query field contains a drop down list with all queries issued, serving as a query history. This is particularly useful if several queries have to be executed in turn. Then, the user does not need to type in the whole query again. The middle of the interface provides drop-down lists and selection buttons to conveniently build the query. This allows to use GOTaxExplorer without precisely knowing the query language syntax. The user can decide to write the query directly into the field or to build it incrementally with the facilities provided. It is also possible to use the selection buttons for parts

36 CHAPTER 4. IMPLEMENTATION Figure 4.2: Main window of GOTaxExplorer with query and result frame visible. of the query and to insert the rest manually.

48 36 CHAPTER 4. IMPLEMENTATION Figure 4.2: Main window of GOTaxExplorer with query and result frame visible. of the query and to insert the rest manually. A pane on the right shows a two-dimensional tree representation of either part of the GO or the taxonomic tree. We chose a tree representation because the GO graph is very dense. It contains almost one and a half as many edges as nodes which makes it rather hard to locate a specific node in the graph. The tree representation is very easy to browse and thus eases the search for specific nodes. Since GO is a DAG, we had to convert its structure into a tree. To this end, we duplicated all sub-graphs that have more than one parent. Only one ontology is displayed at once to further simplify the search. The tree can be used to find entities and to directly insert them into the query. The tree visualization provides also access to the external information from the source databases with an installed internet browser. An additional search frame allows to search the database with the name of the entity to be added to the condition. This helps in finding the accession number for this entity. The results of a query are shown as table in the results frame. Each table is placed in a new tab and provides also the possibility to open a web browser with the information from the source database. Results can be saved in tab-delimited text files that can easily be imported by other programs for further processing.

4.2. GOTAXEXPLORER 37 Figure 4.3: Window with the 2D view on the hierarchies. All three ontologies and the taxonomic tree are shown simultaneously. Figure 4.4: 2D view of the molecular function ontology showing results from a selection of GO terms.

49 4.2. GOTAXEXPLORER 37 Figure 4.3: Window with the 2D view on the hierarchies. All three ontologies and the taxonomic tree are shown simultaneously. Figure 4.4: 2D view of the molecular function ontology showing results from a selection of GO terms. Hits are colored red and the number in brackets indicates the number of hits in the subtree.

50 38 CHAPTER 4. IMPLEMENTATION GOTaxExplorer provides the possibility to visualize results from the taxonomic tree and GO graph in a two dimensional tree or a three dimensional graph. Both output variants can be selected from the menu. The two dimensional trees are shown in a separate program window (see Figures 4.3 and 4.4). Hits and their ancestors are colored red and the number of hits in the subtree beginning at the node is indicated in parentheses after the name. This allows the user to have a general idea of the results or to quickly browse the hierarchy to determine the regions with hits. In order to allow further analysis of the hits, the program provides the possibility to search on the internet the source database for the selected entry. The three GO ontologies and the taxonomic tree are shown in one window. The trees are created after each database update and then stored in a file on hard disk. This is faster than creating them every time and reduces the load on the database. We implemented a three dimensional representations of the graphs with WilmaScope ( For detailed system requirements see Appendix A.1. Wilma- Scope provides classes to construct a graph representation in memory and to write this graph to a file and to read it. Additionally, the library provides several layout algorithms. We create the different graphs after each database update and use a spring embedded layout algorithm. The graph is then written to a file and loaded from hard disk by GOTaxExplorer. A screenshot of a graph representation can be seen in Figure 4.5. Nodes are colored green and hits and their ancestors are colored red. The size of a node increases with the number of hits in the subgraph induced by this node. Nodes without hits are hidden by default helping to significantly reduce the computation load and thus improving the usability. In addition, we hide the labels of nodes without hits which further improves working speed. In order to ease the work with the graph and to improve the usability we implemented some features to help the user to quickly navigate within the graph. A pop-up menu provides information on the selected node as well as the possibility to access the source database on the internet. The user can decide which nodes to display and which ones to hide. To this end, we included options in the pop-up menu which allow to display and hide hole subgraphs and the children of a node. Furthermore, it is possible to hide labels or to make them visible. This helps to show only interesting parts of the graph and improves the speed of the application and the readability of the graph. An edge pop-up menu offers the possibility to directly center the edge s start or end node. A hit can also directly be centered from the result table in the main window. This helps in localizing hits and exploring their neighborhood. The user can zoom in and out to get a detailed view of some part of the graph or to get an overview over the complete graph. The visualization of the GO graphs is a valuable tool that helps users to interpret the results. The structure of the GO graph presents a problem for this tool. Since the GO graph is rather complex, it is hard to arrange the nodes in a way that allows for easy interpretation of the results. The three dimensional approach allows to spread the graph more but makes it less concise. Improving this visualization will also help to interpret the results gathered with the application Command line interface (CLI) The command line interface supports all query types and options are available through text commands. The system requirements are much lower than the requirements of the GUI version.

51 4.2. GOTAXEXPLORER 39 Figure 4.5: 3D output frame for GO and Taxonomy results. The picture shows a part of cellular component with hits colored red and the node options menu.

52 40 CHAPTER 4. IMPLEMENTATION It should run on any computer that can run a suitable Java virtual machine. The CLI is designed to be used in automated scripts or if no graphical environment is available. Therefore, it does not provide any help on entering a query or any visualization of the results. It is possible to use a file with commands for GOTaxExplorer as input and redirect the output of the program to a second file. This allows to use another program for automatically generating queries and to use the results from GOTaxExplorer in programs for automatic processing of the results. It is also possible to add support for GOTaxExplorer to other programs that can use the CLI to perform queries and to receive the results for further processing. All query modes that are available from the menu in the GUI are available with simple text commands in the CLI version. The CLI provides a help command that explains all available options Using GOTaxExplorer We present examples on using GOTaxExplorer for different tasks. We will explain in detail how to use the graphical user interface for selection of sets, for a comparison of GO sets and for deriving functionally equivalent proteins. A detailed user manual is given in the Appendix A. The entities in Table 4.1 are used in the following examples. Table 4.1: Entities used in the examples. Entity PFAM:PF02811 GO:GO: TAX:4751 TAX:40674 Description PHP domain from Pfam DNA recombination from GO biological process Fungi from NCBI Taxonomy Mammals from NCBI Taxonomy Investigating the PHP domain In order to query for information associated with the PHP domain, it is necessary to know the Pfam accession number of this domain. The graphical user interface provides a full text search to find this number (see figure 4.6). The search frame provides the possibility to search the database with keywords for the accession number of entities. The results are presented in a table in the same frame and a double-click on the correct result inserts the entity into the query. The query can be completed using the selection buttons or manually. The complete query for species that have proteins with the PHP domain (Pfam acc: PF02811) and for biological processes of these proteins reads TAX, GO WHERE PFAM:PF02811 RESTRICT TT:biological_process, SPECIES This query combines queries for taxa and for GO terms. It limits the GO results to terms from the biological process ontology and the taxonomy results to species. The results are shown in tabs in a separate frame (see Figure 4.7); a new tab is created for each table. The tables contain

4.2. GOTAXEXPLORER 41 Figure 4.6: Searching for the accession number of the PHP domain. Figure 4.7: Biological processes that are associated with proteins that contain the PHP domain.

Both results are also highlighted in the respective tree views (see Figure 4.8). The four panes can be resized to view larger parts of one tree.

53 4.2. GOTAXEXPLORER 41 Figure 4.6: Searching for the accession number of the PHP domain. Figure 4.7: Biological processes that are associated with proteins that contain the PHP domain. all relevant data and provide the possibility to access the online source database with a pop-up menu. Both results are also highlighted in the respective tree views (see Figure 4.8). The four panes can be resized to view larger parts of one tree. The tree representation is very useful for browsing the hierarchy and locating specific nodes. A double-click on a node of a tree inserts this entity into the query field. This can be used to easily add a GO term to the query for further

42 CHAPTER 4. IMPLEMENTATION Figure 4.8: This window shows tree views of all three ontologies and the taxonomic tree. Hits and their ancestors are highlighted in red.

54 42 CHAPTER 4. IMPLEMENTATION Figure 4.8: This window shows tree views of all three ontologies and the taxonomic tree. Hits and their ancestors are highlighted in red. The tree in the upper left corner is the biological process ontology and the lower right corner contains the taxonomy. analysis. The query GENE WHERE PFAM:PF02811 AND GO:GO: returns all proteins that have the PHP domain and are involved in "DNA recombination". The results are again added in a new tab to the results frame. It is possible to directly open a browser with the UniProt entry page for a protein from the list of results. Hovering over any tab with the mouse will display the query that returned this result as tooltip, i.e. the query is shown besides the mouse pointer. GOTaxExplorer allows to save single result sets to a tab-delimited text file. Furthermore, it is possible to delete single result tabs or to delete them all at once. GO results can contain detailed and more generic terms. GOTaxExplorer filters out generic terms if there are also descendants of these terms in the result set. However, it is possible to receive all terms. Another option allows to print all possible paths from result terms to the root.

55 4.2. GOTAXEXPLORER 43 Figure 4.9: Result frame showing a table with the results from a GO comparison. The table contains the information on the most similar term from set 2, according to Resnik and Lin, for each term from set 1. Comparing biological processes from fungi and mammals In order to compare the biological processes of two species groups, it is necessary to find the taxonomy identifier for them. The taxonomic tree can be loaded into the tree pane on the right of the query frame which makes it easy to find the accession numbers for mammals and fungi. A double-click on the tree inserts the respective entity into the query. The full query reads GO WHERE TAX:4751 RESTRICT TT:biological_process SIM GO WHERE TAX:40674 RESTRICT TT:biological_process The first part selects all biological process terms from fungi, set 1, which includes all fungal species in the database, and the second part returns the set with all processes from mammalian species, set 2. The table with the results can be separated into three parts (see Figure 4.9). First, the rows contain the accession number and the name of the GO term in set 1. The second part gives the most similar term from set 2 according to Lin with similarity, accession number and name. The last part contains the information on the most similar term according to Resnik. The visualizations of the graph contain also the similarity values. The 2D tree view includes them in parentheses after the name (see Figure 4.10). The values are added to the node pop-up menu in the 3D view of the graph. This allows to browse the graph without jumping between two windows. Finding functionally equivalent proteins The calculation of functionally equivalent proteins is very similar to the comparison of the GO sets. The selection buttons can be used to build the first query and, after adding the "FUNEQ" operator, they can be used to build the second query. The query to calculate functional related and functionally equivalent proteins between two sets of proteins with the PHP domain reads: GENE WHERE PFAM:PF02811 FUNEQ GENE WHERE PFAM:PF02811 This query uses the maximum score (Equation 3.5) and the average score (Equation 3.4) for the comparison. The number of threads to be used for the comparison can be entered in a configuration file. Comparing two proteomes with each other is a lengthy operation since all proteins from the first one have to be compared to all proteins from the second proteome. If one is

44 CHAPTER 4. IMPLEMENTATION Figure 4.10: The window shows the tree view of the biological process with results from a GO set comparison. Each term in set 1 is compared to each term in set 2.

56 44 CHAPTER 4. IMPLEMENTATION Figure 4.10: The window shows the tree view of the biological process with results from a GO set comparison. Each term in set 1 is compared to each term in set 2. All terms from set 1 and their ancestors are colored red and the highest similarity to any term in set 2 is added to the name of the hits in parentheses. The first number is the similarity according to Lin, the second according to Resnik. interested in a specific biological process, molecular function or protein family, the query can be redefined to return only the interesting proteins. This reduces the computation time tremendously. The functional equivalent results contain the 10 functionally most similar proteins for each query protein. However, this number can be chosen in the configuration. The results from such a query can be split in three parts (see Figure 4.11). The first part contains the accession number of the query gene product. The second part consists of the rank, the accession number and the average scores for the target gene product. The table contains the funeqscore Lin, avg, the MF Score Lin, avg, BP Score Lin, avg, and the funeqscore Rel, avg. The last part contains the same information as the second but all scores calculated with GOScore max.

4.2. GOTAXEXPLORER 45 Figure 4.11: Results from a query for functionally equivalent proteins. Using the SQL mode GOTaxExplorer provides a SQL mode for direct access to the database.

57 4.2. GOTAXEXPLORER 45 Figure 4.11: Results from a query for functionally equivalent proteins. Using the SQL mode GOTaxExplorer provides a SQL mode for direct access to the database. This mode allows to query the database with SQL. The SQL query SELECT pfam.acc, pfam.name FROM pfam INNER JOIN pfam2go ON pfam.id=pfam2go.pfam_id INNER JOIN term ON term.id=pfam2go.go_id WHERE term.acc= GO: selects all Pfam families that can be found in proteins that are involved in DNA recombination. It is equivalent to the query: PFAM WHERE GO:GO: This mode is available from the menu in the GUI and with a text command in the CLI. The SQL mode (see Figure A.6 in the Appendix) allows to enter an arbitrary SQL query and the results are added to a new tab in the results frame. This can be used for example to get a full list of SMART domains in the database. It can also be used to combine data in ways that are not possible with the query language.

58 Chapter 5 Results We used GOTaxExplorer to test some applications investigating the distribution of families and functional terms in taxonomic groups, with a focus on the selection of possible drug targets. Good targets for drug design are proteins that contain families that occur in the pathogen but not in its host, or that take part in biological processes in the pathogen that do not appear in the host. First, we created some queries to select sets of different information. Then, we used GOTaxExplorer to select all biological processes in fungi that do not appear in mammals and sorted these processes according to similarity to any process occurring in mammals. Additionally, we looked for functionally equivalent and functionally related proteins between Saccharomyces cerevisiae and human, and between Mycobacterium paratuberculosis and human proteins. 5.1 Selecting families from alveolata Some important pathogens belong to the taxonomic group of alveolata (NCBI Taxonomy-Id 33630). The most important human pathogen in this group is Plasmodium falciparum which causes human malaria, a major thread to public health according to the WHO [57]. Cryptosporidium hominis is another alveolate important for public health. It causes acute gastroenteritis and diarrheal disease and accounts for a significant number of deaths of humans and animals [58]. There are other medically important alveolata like Cyclospora, Toxoplasma and Babesiosis species. The "completed_genomes" database contains data from five alveolata, two Cryptosporidium species, two Plasmodium species and Paramecium tetraurelia. We selected all Pfam families from alveolata that do not appear in mammals with the query PFAM WHERE TAX:33630 AND NOT TAX:40674 This query took less than one second and returned a list of 121 families. We summarize some of them in Table 5.1. DNA gyrases can be found in bacteria, archaea plants and alveolata and are also called topoisomerase II. Two proteins from P. falciparum contain this domain. These enzymes introduce 46

59 5.2. INVESTIGATING THE PHP DOMAIN 47 Table 5.1: Pfam families in alveolata that do not appear in mammals. Pfam identifier PF03989 PF04364 PF05096 PF03746 PF00830 Name DNA_gyraseA_C DNA_pol3_chi Glu_cyclase_2 LamB_YcsF Ribosomal_L28 supercoils into the DNA [59]. The DNA gyrase C-terminal domain (PF03989) forms 4 betastrands and the repeats of this domain form a beta-propeller (see Figure 5.1, [60]). It has been shown that this domain binds DNA unspecifically and may stabilize the DNA-topoisomerase complex. The DNA polymerase III chi subunit (PF04364) is part of the DNA polymerase III holoenzyme. The domain is found in a protein from Plasmodium yoelii yoelii and also in many bacteria. The chi subunit forms a complex with the psi subunit that is important for the formation of the holoenzyme. DNA polymerase III in general is responsible for the replication of the chromosome in bacteria. It attaches new nucleotides to the new DNA strand and also has a proofreading function, thus assuring the correct replication of the DNA [59]. Glutamine cyclotransferases (PF05096) catalyze the cyclization of free L-glutamine and N-terminal glutaminyl residues in proteins to pyroglutamate and pyroglutamyl residues respectively. The proteins in this family are involved in the formation of thyrotropin-releasing hormone and other biologically active peptides [61]. This protein family is found in bacteria, plants and Plasmodium species, but not in mammalian proteins. The exact molecular function of the LamB/YcsF family (PF03746) is still unknown. However, it includes the lam locus of Aspergillus nidulans which consists of two genes, involved in the utilization of lactams such as 2-pyrrolidinone. The Ribosomal L28 family (PF00830) contains ribosomal L28 proteins from bacteria and chloroplasts and can be found in the L28 protein from P. falciparum for example. The ribosome consists of a large and a small subunit and is made up of several rrna molecules and many proteins. The proteins mainly stabilize the structure of the complex and many of these proteins have additional functions independent of the ribosome. Targeting the ribosome with new drugs can prove very efficient since it is necessary for the translation of mrna into proteins. These drugs could target the ribosomal proteins as well as the rrna molecules. 5.2 Investigating the PHP domain The PHP domain (PF02811) is a putative phosphoesterase domain and belongs to the Pfam clan "Amidohydrolase superfamily". This family includes bacterial DNA polymerase III proteins as well as histidinol phosphatases and uncharacterized proteins. One member of this family is the hypothetical protein Ycdx from E. coli, the only member of this family with known 3D

48 CHAPTER 5. RESULTS (a) Sideview (b) Topview Figure 5.1: Picture of DNA gyrase subunit A (UniProt acc: O51396) with 6 copies of the DNA gyrase C-terminal domain.

It has been shown that the active site of this protein contains three zinc ions in the middle of a barrel of seven alpha helices and seven beta sheets [62].

60 48 CHAPTER 5. RESULTS (a) Sideview (b) Topview Figure 5.1: Picture of DNA gyrase subunit A (UniProt acc: O51396) with 6 copies of the DNA gyrase C-terminal domain. Each repeat forms a blade of the 6-bladed beta-propeller. Both pictures were created with Jmol. structure. It has been shown that the active site of this protein contains three zinc ions in the middle of a barrel of seven alpha helices and seven beta sheets [62]. The putative function of this domain is the hydrolysis of pyrophosphate during DNA synthesis and thus facilitating DNA polymerization. A comparison of M. tuberculosis proteins with human proteins has shown that proteins from M. tuberculosis with the PHP domain have no significant sequence similarity to human proteins. One example is the DNA polymerase III alpha subdomain. Because this protein is essential for the replication of the chromosome it is a good target for drug design [63]. Based on these previous results, we used GOTaxExplorer to investigate the PHP domain in greater detail. First, we looked at the distribution of the domain over the taxonomic tree. Figure 5.2 shows the 2D view of the Taxonomy in GOTaxExplorer. As can be seen from this figure, this domain is widespread over all superkingdoms, i.e. Archaea, Bacteria and Eukaryota. However, human belongs to the group of metazoans and the domain does not occur in metazoans. Eukaryotic proteins with this domain belong to the class of phosphatases, and some are involved in DNA replication. Table 5.2 summarizes important human pathogens that contain proteins with this domain. The next step in our analysis was to identify all biological processes in which proteins with the PHP domain participate. We found nine different processes which are summarized in Table 5.3. The suggested function of the domain conveys the idea that "DNA replication" (GO: ), "DNA repair" (GO: ) and "DNA recombination" (GO: ) are

5.2. INVESTIGATING THE PHP DOMAIN 49 Figure 5.2: Distribution of the PHP domain over the taxonomic tree. Table 5.2: Pathogens with proteins belonging to PHP domain.

61 5.2. INVESTIGATING THE PHP DOMAIN 49 Figure 5.2: Distribution of the PHP domain over the taxonomic tree. Table 5.2: Pathogens with proteins belonging to PHP domain. NCBI taxonomy Species Proteins (UniProt accession) identifier 210 Helicobacter pylori P Neisseria meningitidis Q9JXZ2 727 Haemophilus influenzae P43743, P Streptococcus pneumoniae 1423 Bacillus subtilis Q9AHD4, Q54518, Q54518, Q54518, O86886, P72510, Q6UVK6, Q6UVL3, Q6UVL7, Q7WVX2, Q8KWQ2, Q8VU33, Q9AH97, Q9AHA6, Q9RIN2, Q9ZFU0, Q9ZII8, Q9AHB9, Q8RLQ2 O34623, P13267, O34411, P96717, P94544, O Mycobacterium tuberculosis P63977, Q7D5L9, P96221, O Staphylococcus aureus subsp. aureus Mu50 P63981, Q99QX6, Q99UW2, Q99X65 vital processes this domain is involved in. We took a closer look at the "DNA replication" (GO: ) and used GOTaxExplorer to get a list of proteins that are annotated with this process and contain the PHP domain. Table 5.4 summarizes some of these proteins. The errorprone DNA polymerase from M. tuberculosis (UniProt accession: O50399) is probably not essential for the replication of the chromosome but is involved in damage-induced mutagenesis and translesion synthesis and thus probably causes the development of antibiotic resistance.

62 50 CHAPTER 5. RESULTS Table 5.3: Biological processes of proteins with PHP domain. GO identifier GO: GO: GO: GO: GO: GO: GO: GO: GO: Name histidine biosynthesis intein-mediated protein splicing two-component signal transduction system (phosphorelay) DNA replication nitrogen compound metabolism DNA repair polysaccharide biosynthesis DNA recombination electron transport Table 5.4: Proteins from different pathogens with the PHP domain that are involved in "DNA replication" (GO: ). UniProt accession Name Description Source organism Error-prone DNA Mycobacterium O50399 DNAE2 polymerase tuberculosis P96221 P96221_MYCTU Hypothetical protein O31902 YorL YORL protein Mycobacterium tuberculosis Bacillus subtilis P94544 yshc Hypothetical protein YSHC Bacillus subtilis Q99UW2 SAV1143 DNA-dependent DNA Staphylococcus aureus polymerase beta chain (strain Mu50/ATCC ) The hypothetical protein P96221 is the other protein from M. tuberculosis that has this domain and is involved in DNA replication. This protein shows DNA polymerase activity. The YorL protein from Bacillus subtilis (UniProt accession: O31902) is a DNA polymerase III alpha subunit. This subunit is a DNA polymerase and also possess a 3-5 exonuclease activity. The DNA polymerase III is the major replication enzyme in bacteria and responsible for the replication of the chromosome. The hypothetical protein yshc (UniProt accession: P94544) from B. subtilis is also involved in DNA replication, contains a PHP domain and has a DNAdirected DNA polymerase activity. S. aureus causes many different diseases such as pneumonia, masitis and meningitis and is responsible for the majority of hospital-acquired infections [64]. This makes it important for public health and it is already resistant to most antibiotics available. The DNA-dependent

63 5.3. COMPARING BIOLOGICAL PROCESSES 51 DNA polymerase beta chain (UniProt accession: Q99UW2) is the only protein from S. aureus involved in DNA replication, and this domain is probably a good target for the development of new anti-infective drugs. All these proteins are potentially good targets for the development of new drugs. The distribution of the PHP domain in bacteria and fungi and the absence in human makes it a good potential target. However, further analysis is required to elucidate the exact function and its mechanism. 5.3 Comparing biological processes from fungi and mammals Fungi cause many important diseases in humans and the threat to public health is increasing. People with unsound immune system are especially susceptible to fungal infections. Drug resistance is starting to be a problem since some Candida species are already resistant to established drugs such as azole and triazole drugs for example [65]. We compared the biological processes from fungi that do not appear in mammals to the processes in mammals with the query GO WHERE TAX:4751 AND NOT TAX:40674 RESTRICT TT:biological_process SIM GO WHERE TAX:40674 RESTRICT TT:biological_process The query took approximately two minutes. Table 5.5 summarizes the four processes with highest sim Lin from fungi that, although absent in human, have very similar processes in mammals. The fungal GO term is a child of the mammalian GO term in all these cases. Table 5.5: The 4 biological processes from fungi that appear not in mammals but have the highest similarity to mammalian processes according to sim Lin. Fungal term Human term GO identifier Name sim Lin sim Resnik GO id Name GO: biotin biosynthesis GO: biotin metabolism GO: thiamin biosynthesis GO: thiamin metabolism GO: peroxisome matrix protein import, docking GO: peroxisome matrix protein import GO: cell wall chitin biosynthesis GO: cell wall chitin metabolism Biotin is an essential vitamin (vitamin H) that cannot be produced by humans and animals [66]. It is an important coenzyme and prosthetic group that is involved in many different reactions. Since mammals cannot produce biotin, the annotation of mammalian proteins with "biotin metabolism" is not detailed enough. There is a GO term for biotin catabolism but no

64 52 CHAPTER 5. RESULTS term for processes that require biotin as cofactor. This is probably a problem for annotating the mammalian proteins. The processes from fungi are related but not identical in this case. Thiamin is also an essential vitamin (vitamin B1). This suggests that the annotations of mammalian proteins are also not detailed enough in this case. "Peroxisome matrix protein import, docking" is linked with a "part-of" edge to its parent term "peroxisome matrix protein import". This suggests that the fungal proteins and the mammalian proteins perform their actions in the same process. However, the annotation of the mammalian proteins is not detailed enough to conclude the exact part of the process they take part in. "Cell wall chitin biosynthesis" is a process that does not take place in mammals. Chitin is an important component of fungal cell walls and therefore is this process essential for fungi. There are two mammalian proteins that are annotated with "cell wall chitin metabolism", both are acidic mammalian chitinase precursors, Q9BZP6 from human and Q91XA9 from mouse. The mammalian proteins degrade chitin and are possibly involved in the defense against pathogens. The mammalian proteins could probably also be annotated with "cell wall chitin catabolism" (GO: ) making their real function more clear. The five processes that have the lowest similarity to mammalian processes according to sim Lin are summarized in Table 5.6. Resnik s and Lin s measures often return different processes as the most similar process. However, all of these are very dissimilar and are not related to the fungal processes. "Chitin localization" is a vital process for fungi since the chitin has to be transported to the place of cell wall assembly. The cell division control protein 24 (UniProt acc: P11433) has been shown to be vital for baker s yeast (SGD accession: S ). Since this process does not occur in mammals, this protein is a putative drug target. The other four processes do not appear to be essential for the viability of fungi. However, fruiting body formation is essential for forming spores, and spores are the most important form of asexual reproduction for fungi. This analysis revealed some processes that are important for fungi and cannot be found in mammals. This knowledge can lead to new strategies and therapies for curing infections with fungi. The proteins that are involved in these processes are possible targets for new drugs. 5.4 Finding functionally equivalent proteins The analysis of two organisms for finding functionally equivalent proteins is limited to their GO coverage. As of May 2005, Saccharomyces cerevisiae is the organism with the highest GO coverage in UniProt of about 70 % of its proteins Saccharomyces cerevisiae We performed a comparison of the proteomes of Saccharomyces cerevisiae (baker s yeast) and human in order to assign functionally equivalent and functionally related proteins to all baker s yeast proteins. The query for calculating this comparison took approximately eleven hours on a computer cluster using ten threads. Figure 5.3 shows the histogram of the score distribution

65 5.4. FINDING FUNCTIONALLY EQUIVALENT PROTEINS 53 Table 5.6: The 5 biological processes from fungi that appear not in mammals and have the lowest similarity to mammalian processes (calculated with sim Lin ). GO identifier Name sim Lin GO id name GO: plasmid partitioning GO: two-component signal transduction system (phosphorelay) GO: chitin localization GO: GO: lactate metabolism GO: GO: fruiting body formation GO: GO: propionate metabolism GO: phosphoenolpyruvate-dependent sugar phosphotransferase system methionine biosynthesis eye morphogenesis (sensu Endopterygota) methionine biosynthesis using funeqscore Lin, avg and funeqscore Lin, max. Table 5.7 summarizes the numbers for both scores. Figure 5.3: Distribution of the funeqscore Lin, avg and funeqscore Lin, max for the comparison of S. cerevisiae with human. The table shows that there are many yeast proteins that lack proper GO annotation and can therefore not be assigned functional equivalents. The maximum score tends to be higher than the average score. Using the average scoring, more proteins get low and medium scores. One

66 54 CHAPTER 5. RESULTS Table 5.7: Number of S. cerevisiae proteins with functional equivalents in the corresponding data set. Data set No. of sequences No. of sequences funeqscore Lin, avg funeqscore Lin, max Proteins without GO annotation (No GO) Proteins without Biological Process annotation (No BP) Proteins without Molecular Function annotation (No MF) Proteins with score in [0.0, 0.2[ (S0.0) 0 0 Proteins with score in [0.2, 0.4[ (S0.2) Proteins with score in [0.4, 0.6[ (S0.4) Proteins with score in [0.6, 0.8[ (S0.6) Proteins with score in [0.8, 1.0[ (S0.8) Proteins with score = 1.0 (S1.0) Proteins with score = 1.0 and no matching Pfam (No Pfam) can also see that there are many proteins that have a functional equivalent with a 1.0 score, which means that they have exactly the same function, but do not belong to the same families. These proteins are functionally equivalent but did not evolve from a common ancestor. This relationship would be missed by homology based methods. We analyze some examples from different ranges of funeqscore. The examples go from high to low funeqscore. The results illustrate that the score separates functionally equivalent proteins from proteins with similar functions. The examples also show that it is possible discriminate proteins that have no functionally equivalent protein in the other proteome. Ribonucleoside-diphosphate reductase The ribonucleoside-diphosphate reductase in yeast contains two different subunits, a large chain and a small chain. In yeast there are two types of small chains (UniProt acc: P09938 and P49723) and two types of large chains (UniProt acc: P21524 and P21672). One of the large chains (P21524) is essential for mitotic viability whereas the other large chain is DNA damageinduced. The small chains can function in the dimer with both large chains. All four UniProt entries are annotated to the same GO terms. The human ribonucleoside-diphosphate reductase also consists of a large and a small subunit. The human UniProt entries are annotated with the same GO terms as the yeast entries. They are the functional equivalent proteins to the yeast proteins. All four yeast proteins return the two human proteins as their only functionally equivalents with a funeqscore Lin, avg of 1.0. The funeqscore clearly separates the two hits from all other proteins. The large subunits from yeast and from human show more than 60 % sequence identity. The small subunits also show high sequence similarity. Ubiquinol-cytochrome-c reductase complex core protein I The ubiquinol-cytochrome-c reductase complex in yeast consists of ten subunits. The core protein I (UniProt acc: P07256) of this complex has two hits with funeqscore Lin, avg = 0.899, the human proteins P07919 and P Both human proteins with this score are also members

67 5.4. FINDING FUNCTIONALLY EQUIVALENT PROTEINS 55 of this complex. P07919 shows no significant sequence similarity and does not share Pfam families with P P31930 has a sequence identity of approximately 30 % and shares both Pfam families with P The hits on ranks three to nine are also members of the human complex but have slightly different GO annotations and therefore a smaller score. Pre-mRNA splicing factor RNA helicase The pre-mrna splicing factor RNA helicase (UniProt acc: P23394) is probably involved in unwinding the U4/U6 base-pairing interaction in the U4/U6/U5 snrnp thus facilitating the first covalent step in RNA splicing. The best human hit is the U5 small nuclear ribonucleoprotein 200 kda helicase (UniProt acc: O75643), which is also putatively involved in the second step of RNA splicing. They both have the molecular function of an ATP-dependent RNA helicase and both are involved in RNA splicing but the yeast protein is involved in the assembly of the U2-type pre-catalytic spliceosome. This is translated into the following scores: funeqscore Lin, avg = and funeqscore Lin, max = The sequences have no significant sequence similarity. Hypothetical oxidoreductase The hypothetical oxidoreductase in NUP120-CSE4 (UniProt acc: P35731) is involved in aerobic respiration and fatty acid metabolism. It shows 3-oxoacyl-[acyl-carrier protein] reductase activity and has no real functional equivalent in human. The two functionally most similar proteins in human have a funeqscore Lin, avg of 0.474, and are clearly not functionally related Mycobacterium paratuberculosis We compared Mycobacterium paratuberculosis with human proteins to assign functionally equivalents to the ones from Mycobacterium. Figure 5.4 shows the histogram of the score distribution and Table 5.8 summarizes the numbers. Table 5.8: Number of M. paratuberculosis proteins with functional equivalents in the corresponding data set. Data set No. of sequences No. of sequences funeqscore Lin, avg funeqscore Lin, max Proteins without GO annotation (No GO) Proteins without Biological Process annotation (No BP) Proteins without Molecular Function annotation (No MF) Proteins with score in [0.0, 0.2[ (S0.0) 0 0 Proteins with score in [0.2, 0.4[ (S0.2) 9 0 Proteins with score in [0.4, 0.6[ (S0.4) Proteins with score in [0.6, 0.8[ (S0.6) Proteins with score in [0.8, 1.0[ (S0.8) Proteins with score = 1.0 (S1.0) Proteins with score = 1.0 and no matching Pfam (No Pfam)

56 CHAPTER 5. RESULTS Figure 5.4: Distribution of the funeqscore Lin, avg and funeqscore Lin, max for the comparison of M. paratuberculosis with human.

68 56 CHAPTER 5. RESULTS Figure 5.4: Distribution of the funeqscore Lin, avg and funeqscore Lin, max for the comparison of M. paratuberculosis with human. It can be seen that approximately one half of the proteins from M. paratuberculosis is lacking a GO annotation and many proteins have a functional equivalent with a score of 1.0. It becomes obvious that the maximum score is biased towards perfect matches. We show an example for an almost perfect match and two M. paratuberculosis proteins that have no functionally related human proteins. The sigma translation initiation factor The sigma translation initiation factor (SIGF) (UniProt acc: Q73ZX7) from M. paratuberculosis is involved in the recognition of the promotor elements of a gene. SIGF matches three Pfam families, Sigma-70 region 2 (Pfam acc: PF04542), Sigma-70 region 3 (Pfam acc: PF04539) and Sigma-70, region 4 (Pfam acc: PF04545). Region 2 is the most conserved part of the protein and contains the -10 promotor recognition helix and the primary core RNA polymerase binding determinant. The region 3 forms a discrete domain consisting of three helices within the sigma factor and is also involved in binding the core RNA polymerase. Region 4 contains a helix-turn-helix motif and is involved in binding the -35 promotor region of the DNA. There is no human protein that matches exactly the GO annotations of SIGF. However, there are two proteins with funeqscore Lin, avg above 0.9, TAF4 (UniProt acc: Q5TBP5) with and Hypothetical protein DKFZp781E21155 (UniProt acc: Q6AI29) with Both human proteins are involved in transcription initiation and function as transcription initiation factor. The funeqscore Lin, max for both proteins is 1.0 because all GO annotations from SIGF can be found in the human proteins too. The human proteins share no Pfam families with SIGF and

69 5.4. FINDING FUNCTIONALLY EQUIVALENT PROTEINS 57 have no significant sequence similarity to SIGF. SIGF is important for translation of DNA into mrna. Its three domains do not occur in human and there is no significant sequence similarity to the two human proteins. These facts suggest that it putatively is a good target for new anti-infective drugs. The BioA protein BioA (UniProt acc: Q740R8) from M. paratuberculosis belongs to the class-iii pyridoxalphosphate-dependent aminotransferase family (Pfam acc: PF00202). BioA catalyzes an intermediate step in the synthesis of biotin. There is no functional equivalent protein in human. The highest scoring protein with funeqscore Lin, avg = is the Methylcrotonoyl-CoA carboxylase alpha chain (UniProt acc: Q96RQ3) which uses biotin as cofactor. The MurB protein The UDP-N-acetylmuramate dehydrogenase MurB (UniProt acc: Q73SU8) is involved in the synthesis of bacterial peptidoglycan. Peptidoglycan is an important constituent in bacterial cell walls. MurB belongs to the MurB_C family (Pfam acc: PF02873) that is found only in bacteria. The protein shows no sequence similarity to any human protein and the functionally most similar protein, a UDPglucose dehydrogenase (UniProt acc: Q9NY20), has a funeqscore Lin, avg of MurB could be a good target for new anti-infectives because it has a very important function and is not similar to any human protein Relevance score Table 5.9 and Figure 5.5 summarize the results of the comparison of human with M. paratuberculosis using sim Relevance. Using sim Relevance leads to lower scores in the comparison than with other similarity measures. There is no term with probability zero and hence there is no protein that has a perfect match with score 1.0. Using the maximum score still leads to higher scores than the average score. The hypothetical protein MAP1681C (UniProt acc: Q73ZC2) from M. paratuberculosis is annotated with the molecular function "catalytic activity" (GO: ) and the biological process "metabolism" (GO: ). These mappings are very general annotations that allow not to draw a conclusion on the exact role of this protein. However, there are many human proteins that have the same annotation and therefore get a funeqscore Lin, avg or funeqscore Lin, max of 1.0, one example is LOC (UniProt acc: Q6P2C7). Although their annotations match perfectly, it is not possible to say whether these proteins are really functional equivalent. Using funeqscore Relevance, avg or funeqscore Relevance, max they have a score of This low score reflects the uncertainty in the annotations of the proteins. This example shows that taking the relevance into account with sim Rel or sim Relevance helps to distinguish between meaningful Funeqs and insignificant assignments.

58 CHAPTER 5. RESULTS Table 5.9: Number of M. paratuberculosis proteins with functional equivalents in the corresponding data set computed with sim Relevance. Data set No.

70 58 CHAPTER 5. RESULTS Table 5.9: Number of M. paratuberculosis proteins with functional equivalents in the corresponding data set computed with sim Relevance. Data set No. of sequences funeqscore Relevance, avg Proteins without GO annotation (No GO) Proteins without Biological Process annotation (No BP) Proteins without Molecular Function annotation (No MF) Proteins with score in [0.0, 0.2[ (S0.0) 3 0 Proteins with score in [0.2, 0.4[ (S0.2) Proteins with score in [0.4, 0.6[ (S0.4) Proteins with score in [0.6, 0.8[ (S0.6) Proteins with score in [0.8, 1.0[ (S0.8) Proteins with score = 1.0 (S1.0) 0 0 Proteins with score = 1.0 and no matching Pfam (No Pfam) 0 0 No. of sequences funeqscore Relevance, max Figure 5.5: Distribution of the funeqscore Relevance, avg and funeqscore Relevance, max for the comparison of M. paratuberculosis with human.

71 Chapter 6 Conclusions We presented an approach for the integration of different biological data sources into one unified resource. Our method makes it possible to select sets of proteins, families or taxa according to different criteria. The results provide a cross taxonomic view of the biological processes and the underlying molecular biology. The comparison of molecular functions and biological processes in diverse organisms shows differences and common themes along the taxonomic tree. Processes that are found in pathogens but not in human present possible targets for future drug development. It is possible to select two sets of GO terms and to perform a pairwise comparison of the terms in these sets to assess their semantic similarity. The funeqscore is a similarity measure that assesses the functional relatedness of two gene products with a comparison of their GO annotations. It is in the range from 0 to 1. It allows to find functionally equivalent or related gene products in two organisms and to detect gene products that have no functionally related counterparts in other related organisms. The similarity measures sim Rel and sim Relevance for GO terms additionally allow to assess the significance of the similarity. They enable a discrimination between gene products annotated with generic terms and gene products annotated with specific terms. Comparing GO annotations is problematic for several reasons. First of all, human curators have different criteria for making these annotations. Missing GO terms also present a problem. Since the exact terms are missing in such cases, curators need to annotate the gene products to more generic terms until the correct term is available. Automatic annotations differ in the level of detail depending on the method used and the primary information exploited. These problems have an impact on the comparison of sets of GO terms and the assignment of functional equivalents. The distinction between "is-a" and "part-of" links in GO needs to be addressed by a future extension of the comparison of GO terms. The coverage of gene products with GO annotations presents a problem for the functional equivalent comparison. Since many proteins are still unannotated, it is possible that some functionally equivalent proteins are not detected yet. Integrating more data into the database will improve the usefulness of our approach. An important extension might be the inclusion of RNA since non-coding RNAs have important functions in all cells. There are some attempts to divide RNAs into families as it is done with proteins. Rfam, in analogy to Pfam, is a database that collects multiple sequence alignments 59

72 60 CHAPTER 6. CONCLUSIONS and covariance models that describe RNA families [67]. If more non-coding RNAs are annotated, this data can be included in the analysis and can give a more complete result from the comparison of organisms or groups of species. It is also possible to include the RNAs in the computation of functionally equivalent gene products since GO is also suited to annotate RNAs. Gene knockout experiments yield valuable data on the dependence of different organisms on their gene products. This information can be added to validate possible drug targets. There is no cut-off for the funeqscore that provides a separation between functionally equivalent and functionally related gene products. Such a cut-off is desirable for the interpretation of the results. It is possible to store the result from functionally equivalent comparisons in the database for later retrieval. This allows to build a database with functionally equivalent gene products. The funeqscore can be used to cluster gene products according to their GO annotations into functional groups. Since cluster algorithms usually are based on difference measures, it is necessary to define a distance measure based on the funeqscore. A straight forward and simple choice is (1 funeqscore). A possible application is to cluster functionally equivalent proteins into functional groups. A combined approach with functional and sequence similarity could then be used to find orthologous genes and to distinguish them from non-homologous functionally related genes. Such a clustering method can also be used to group genes on a microarray. While GO is evolving, it can be expected that the annotations with GO will become more accurate and the coverage of these annotations will increase. This will lead to an improvement in the identification of functionally related proteins based on the GO annotations. We expect that the impact of our approach on concrete applications, like drug development, increases with the availability and quality of the functional annotations of proteins.

73 Bibliography [1] Bernal A., Ear U., and Kyrpides N. Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res, 29(1):126 7, Jan [2] Stein L. Genome annotation: from sequence to biology. Nat Rev Genet, 2(7): , Jul [3] Fitch W.M. Homology a personal view on some of the problems. Trends Genet, 16(5):227 31, May [4] Storm C.E.V. and Sonnhammer E.L.L. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics, 18(1):92 9, Jan [5] Boeckmann B., Bairoch A., Apweiler R., Blatter M.C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O Donovan C., Phan I., Pilbout S., and Schneider M. The SWISS- PROT protein knowledgebase and its supplement TrEMBL in Nucleic Acids Res, 31(1):365 70, Jan [6] Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L., and Sonnhammer E.L. The Pfam protein families database. Nucleic Acids Res, 28(1):263 6, Jan [7] Schultz J., Milpetz F., Bork P., and Ponting C.P. SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A, 95(11): , May [8] Koonin E.V. and Galperin M.Y. Sequence-Evolution-Function: Computational Approaches in Comparative Genomics. Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts USA, [9] Camon E., Magrane M., Barrell D., Lee V., Dimmer E, Maslen J., Binns D., Harte N., Lopez R., and Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res, 32(Database issue):d262 6, Jan [10] Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O Donovan C., Redaschi N., and Yeh L.L. The Universal Protein Resource (UniProt). Nucleic Acids Res, 33(Database issue):d154 9, Jan

74 62 BIBLIOGRAPHY [11] Ashburner M., Ball C.A, Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Harris M.A., Hill D.P., Issel-Tarver L., Kasarskis A., Lewis S., Matese J.C., Richardson J.E., Ringwald M., Rubin G.M., and Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1):25 9, May [12] Brown T.A. Genomes. BIOS Scientific Publishers Ltd, [13] Strachan T. and Read A.P. Human Molecular Genetics 2. John Wiley & Sons, Inc., [14] Zdobnov E.M., Lopez R., Apweiler R., and Etzold T. The EBI SRS server recent developments. Bioinformatics, 18(2):368 73, Feb [15] Schuler G.D., Epstein J.A., Ohkawa H., and Kans J.A. Entrez: molecular biology database and retrieval system. Methods Enzymol, 266:141 62, [16] Bateman A., Coin L., Durbin R., Finn R.D., Hollich V., Griffiths-Jones S., Khanna A., Marshall M., Moxon S., Sonnhammer E.L.L., Studholme D.J., Yeats C., and Eddy S.R. The Pfam protein families database. Nucleic Acids Res, 32(Database issue):d138 41, Jan [17] Cao S.L., Qin L., He W.Z., Zhong Y., Zhu Y.Y., and Li Y.X. Semantic search among heterogeneous biological databases based on gene ontology. Acta Biochim Biophys Sin (Shanghai), 36(5):365 70, May [18] Kanehisa M. and Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 28(1):27 30, Jan [19] Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J., and Wheeler D.L. GenBank. Nucleic Acids Res, 31(1):23 7, Jan [20] Mulder N.J., Apweiler R., Attwood T.K., Bairoch A., Bateman A., Binns D., Bradley P., Bork P., Bucher P., Cerutti L, Copley R., Courcelle E., Das U., Durbin R., Fleischmann W., Gough J., Haft D., Harte N., Hulo N., Kahn D., Kanapin A., Krestyaninova M., Lonsdale D., Lopez R., Letunic I., Madera M., Maslen J., McDowall J., Mitchell A., Nikolskaya A.N., Orchard S., Pagni M., Ponting C.P., Quevillon E., Selengut J., Sigrist C.J.A., Silventoinen V., Studholme D.J., Vaughan R., and Wu C.H. InterPro, progress and status in Nucleic Acids Res, 33(Database issue):d201 5, Jan [21] Bairoch A. The ENZYME database in Nucleic Acids Res, 28(1):304 5, Jan [22] Lord P.W., Stevens R.D., Brass A., and Goble C.A. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 19(10): , Jul [23] Shaban-Nejad A., Baker C. J. O., Butler G., and Haarslev V. The fungalweb ontology the core of a semantic web application for fungal genomics. In 1st Canadian Semantic Web Interest Group Meeting (SWIG 04) Montreal, Quebec, Canada, 2004.

75 BIBLIOGRAPHY 63 [24] Baker C.J.O., Witte R., and Haarslev V. Shaban-Nejad A., Butler G. The FungalWeb Ontology: Application Scenarios, available at: [25] Ontology, an introduction, available at [26] Stevens R., Goble C.A., and Bechhofer S. Ontology-based knowledge representation for bioinformatics. Brief Bioinform, 1(4): , Nov [27] University of Illinois at Urbana-Champaign Digital Libraries Initiative. Ontology definition, available at [28] Christiane Fellbaum, editor. WordNet, An Electronic Lexical Database. MIT Press, May [29] Miller G.A. Wordnet: a lexical database for english. Communications of the ACM 38, 11:39 41, [30] Schulze-Kremer S. Ontologies for molecular biology and bioinformatics. In Silico Biol, 2(3):179 93, [31] Chen R.O., Felciano R., and Altman R.B. RIBOWEB: linking structural computations to a knowledge base of published experimental data. Proc Int Conf Intell Syst Mol Biol, 5:84 7, [32] Schulze-Kremer S. Ontologies for molecular biology. Pac Symp Biocomput, pages , [33] TAMBIS. [34] FlyBase Consortium. The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res, 30(1):106 8, Jan [35] Blake J.A., Eppig J.T., Richardson J.E., and Davisson M.T. The Mouse Genome Database (MGD): expanding genetic and genomic resources for the laboratory mouse. The Mouse Genome Database Group. Nucleic Acids Res, 28(1):108 11, Jan [36] Ball C.A., Dolinski K., Dwight S.S., Harris M.A., Issel-Tarver L., Kasarskis A., Scafe C.R., Sherlock G., Binkley G., Jin H., Kaloper M., Orr S.D., Schroeder M., Weng S., Zhu Y., Botstein D., and Cherry J.M. Integrating functional genomic information into the Saccharomyces genome database. Nucleic Acids Res, 28(1):77 80, Jan [37] Camon E.B., Barrell D.G., Dimmer E.C., Lee V., Magrane M., Maslen J., Binns D., and Apweiler R. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics, 6 Suppl 1:S17, [38] Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res, 11(8): , Aug 2001.

76 64 BIBLIOGRAPHY [39] Resnik P. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res, 11:95 130, Jul [40] Lin D. An information-theoretic definition of similarity. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML-98), [41] Speer N., Spieth C., and Zell A. A memetic clustering algorithm for the functional partition of genes based on the gene ontology. In Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2004), [42] Kanz C., Aldebert P., Althorpe N., Baker W., Baldwin A., Bates K., Browne P., van den Broek A., Castro M., Cochrane G., Duggan K., Eberhardt R., Faruque N., Gamble J., Diez F.G., Harte N., Kulikova T., Lin Q., Lombard V., Lopez R., Mancuso R., McHale M., Nardone F., Silventoinen V., Sobhany S., Stoehr P., Tuli M.A., Tzouvara K., Vaughan R., Wu D., Zhu W., and Apweiler R. The EMBL Nucleotide Sequence Database. Nucleic Acids Res, 33(Database issue):d29 33, Jan [43] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The Protein Data Bank. Nucleic Acids Res, 28(1): , Jan [44] Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S.R., Griffiths-Jones S., Howe K.L., Marshall M., and Sonnhammer E.L.L. The Pfam protein families database. Nucleic Acids Res, 30(1):276 80, Jan [45] Bru C., Courcelle E., Carrère S., Beausse Y., Dalmar S., and Kahn D. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res, 33(Database issue):d212 5, Jan [46] Letunic I., Copley R.R., Schmidt S., Ciccarelli F.D., Doerks T., Schultz J., Ponting C.P., and Bork P. SMART 4.0: towards genomic data integration. Nucleic Acids Res, 32(Database issue):d142 4, Jan [47] Thompson J.D., Higgins D.G., and Gibson T.J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22(22): , Nov [48] Wheeler D.L., Chappey C., Lash A.E., Leipe D.D., Madden T.L., Schuler G.D., Tatusova T.A., and Rapp B.A. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 28(1):10 4, Jan [49] Camon E., Magrane M., Barrell D., Binns D., Fleischmann W., Kersey P., Mulder N., Oinn T., Maslen J., Cox A., and Apweiler R. The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res, 13(4):662 72, Apr 2003.

77 BIBLIOGRAPHY 65 [50] Enzyme Nomenclature. Academic Press, San Diego, CA, USA, [51] Apweiler R., Bairoch A., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O Donovan C., Redaschi N., and Yeh L.S.L. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res, 32(Database issue):d115 9, Jan [52] Barker W.C., Garavelli J.S., Huang H., McGarvey P.B., Orcutt B.C., Srinivasarao G.Y., Xiao C., Yeh L.S., Ledley R.S., Janda J.F., Pfeiffer F., Mewes H.W., Tsugita A., and Wu C. The protein information resource (PIR). Nucleic Acids Res, 28(1):41 4, Jan [53] Hubbard T., Andrews D., Caccamo M., Cameron G., Chen Y., Clamp M., Clarke L., Coates G., Cox T., Cunningham F., Curwen V., Cutts T., Down T., Durbin R., Fernandez- Suarez X.M., Gilbert J., Hammond M., Herrero J., Hotz H., Howe K., Iyer V., Jekosch K., Kahari A., Kasprzyk A., Keefe D., Keenan S., Kokocinsci F., London D., Longden I., McVicker G., Melsopp C., Meidl P., Potter S., Proctor G., Rae M., Rios D., Schuster M., Searle S., Severin J., Slater G., Smedley D., Smith J., Spooner W., Stabenau A., Stalker J., Storey R., Trevanion S., Ureta-Vidal A., Vogel J., White S., Woodwark C., and Birney E. Ensembl Nucleic Acids Res, 33(Database issue):d447 53, Jan [54] Fleischmann W., Möller S., Gateau A., and Apweiler R. A novel method for automatic functional annotation of proteins. Bioinformatics, 15(3):228 33, Mar [55] Kretschmann E., Fleischmann W., and Apweiler R. Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics, 17(10):920 6, Oct [56] Wieser D., Kretschmann E., and Apweiler R. Filtering erroneous protein annotation. Bioinformatics, 20 Suppl 1:I342 I347, Aug [57] World Health Organisation. Fact sheet no. 94. malaria. available at [58] Virginia Commonwealth University. Center for the study of biological complexity. available at [59] Madigan M.T., Martinko J.M., and Parker J., editors. Brock Biology of Microorganisms. Prentice Hall, Upper Saddle River, NJ 07458, 9th edition edition, [60] Qi Y., Pei J., and Grishin N.V. C-terminal domain of gyrase A is predicted to have a beta-propeller structure. Proteins, 47(3): , May [61] Information on ec , available at [62] Teplyakov A., Obmolova G., Khil P.P., Howard A.J., Camerini-Otero R.D., and Gilliland G.L. Crystal structure of the Escherichia coli YcdX protein reveals a trinuclear zinc active site. Proteins, 51(2): , May 2003.

78 66 BIBLIOGRAPHY [63] Schlicker A., Domingues F., Kaemper A., Sommer I., and Lengauer T. AutoDart - Automated Differential Analysis of Proteomes, [64] Todar K. Todar s online textbook of bacteriology available at [65] National Institute of Allergy and Division of Microbiology & Infectious Diseases Infectious Diseases. Fungal infections available at [66] Biochemicalien Lexikon. Biotin (vitamin h) available at [67] Griffiths-Jones S., Moxon S., Marshall M., Khanna A., Eddy S.R., and Bateman A. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33(Database issue):d121 4, Jan [68] Wang H., Azuaje F., Bodenreider O., and Dopazo J. Gene expression correlation and gene ontology-based similarity: An assessment of quantitative relationships. In Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2004), [69] Bodenreider O. and Azuaje F. Incorporating ontology-driven similarity knowledge into functional genomics: An exploratory study. In Proceedings of the 2004 IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE-2004), 2004.

79 Appendix A User Manual This manual explains all options available in GOTaxExplorer and the update programs for the database. Documentation of the source code is available for developers as HTML and bundled with the program. A.1 System requirements GOTaxExplorer needs Java TM 1.5 or later to be installed. It requires a database to be accessible either locally or over a network link. GOTaxExplorer is configured to use a MySQL database. If another database is used, it is very likely that it is necessary to rewrite some SQL commands used by GOTaxExplorer. The database access information and driver can be changed in the configuration file. GOTaxExplorer should run reasonably fast on a computer with a CPU with more than 2 GHz. The amount of main memory required depends on the size of the data sets returned by the queries. Performance is in large parts restricted by the database server since database queries are the most common and costly operations. Due to the fact that the comparison of sets of GO terms and functional equivalents are implemented multi-threaded, they can benefit from a client computer with more than one CPU. However, the database will be the limiting factor in such a setting. Using more than one thread on a single CPU increases execution speed because it helps to operate the database at full capacity. The database load generated by GOTaxExplorer can be adjusted with the number of threads used for a GO set comparison or a functional equivalence calculation. Since all queries are cached during a functional equivalence calculation, the database is released by the client. However, the caching uses large amounts of memory and there should be available at least 1024 MB of main memory or virtual memory. In order to use the 3D visualization of graphs, the Java3D libraries and the WilmaScope libraries have to be installed. GOTaxExplorer needs up to 1024 MB of memory for this visualization. Furthermore, it needs a high performance graphics card with OpenGL rendering capabilities. The command line interface is faster than the GUI. It also requires 512 MB of memory for the comparison of sets of GO terms of a functional equivalence calculation. 67

80 68 APPENDIX A. USER MANUAL A.2 GOTaxExplorer GOTaxExplorer is available with a command line interface and a graphical user interface. Both interfaces can be used to issue queries using the query language or SQL and to see the results in a table. A configuration file is used to provide details on the database access and to change program behavior. A.2.1 Configuration file GOTaxExplorer uses a configuration file to configure parameters for the database access and for important files. The configuration file is given with the command line option "-c". The following options are available: url The database uniform resource locator (url). user The username for the database. password The password for the database access. driver Determines the database driver class, this class has to be in the classpath, default: com.mysql.jdbc.driver. default-browser Command to open the browser that should be used to access online databases on Unix systems. On other systems, the system default browser is used. Either the absolute path is required or the browser has to be in the user s path, default: firefox. term-matcher Determines the number of threads to be used for a comparison of sets of GO terms, default: 1. funeqs The number of functionally most similar gene products that are displayed for each entity. If a single gene product is compared to a set of gene products, all members of the set are shown with the scores. If two sets are compared this number defaults to 10. protein-matcher The number of threads to be used for a functional equivalence comparison. bp-graph The file with the WilmaScope graph for biological process, the absolute path is required, default:./bp.xwg.

81 A.2. GOTAXEXPLORER 69 mf-graph The file with the WilmaScope graph for molecular function, the absolute path is required, default:./mf.xwg. cc-graph The file with the WilmaScope graph for cellular process, the absolute path is required, default:./cc.xwg. tax-graph The file with the WilmaScope graph for the taxonomy, the absolute path is required, default:./tax.xwg. bp-tree The file with the tree representation for biological process, the absolute path is required, default:./bp.root. mf-tree The file with the tree representation for molecular function, the absolute path is required, default:./mf.root. cc-tree The file with the tree representation for cellular process, the absolute path is required, default:./cc.root. tax-tree The file with the tree representation for the taxonomy, the absolute path is required, default:./tax.root. The default-browser option as well as all parameters for graphs and trees are not needed by the command line interface version of GOTaxExplorer and thus skipped. All options can also be given on the command line as parameters. The command line parameters override the options in the configuration file. The parameter name is composed of and the key name. Exceptions are: url is replaced by "-u", user is replaced by "-l", password is replaced by "-p". Syntax of the configuration file A sample configuration file is given below. The parameters are written as key-value-pairs with a "=" in between. Whitespaces around the "=" are ignored. Comments can be included by adding a "#" at the beginning of a line. driver = com.mysql.jdbc.driver #driver=com.p6spy.engine.spy.p6spydriver url = jdbc:mysql://server:3306/completed_genomes user=me password=passwd bp-tree=/path/to/bp.root protein-matcher=5

82 70 APPENDIX A. USER MANUAL A.2.2 Command Line Interface (CLI) The command line interface can be used for scripting or if no graphical environment is available. It is possible to feed in a file with query commands and to redirect the output to another file. Furthermore, the CLI can be used in scripts to allow other programs to start GOTaxExplorer with automatically generated queries and to pipe the query results to post-processing programs. This helps in automating recurring processes and eases the result handling. The CLI provides textual commands to select the different query options. The commands are: QUIT Q EXIT Exit GOTaxExplorer. HELP? <return> Print a help message with available commands, details on the query language and example queries. SQL 1 Enter SQL mode, allows to enter queries in SQL. SQL 0 Leave SQL mode and enter query language mode (default). GOPATHS 1 Print out the paths to GO graph root for all matching GO terms. GOPATHS 0 Do not print paths to the root (default). GOFILTER 1 Filter GO results that only lowest hits are shown (default). GOFILTER 0 Show all GO results. Queries have to be terminated with a semi-colon ";". This allows to enter queries spanning multiple rows. Commands can be typed either in upper or lower case but must not be terminated with a ";". This interface provides no input and output facilities. In order to search for a specific entity, it is necessary to enter the SQL mode and to use a SQL query. It is not possible to load a new configuration file. This file has to be loaded at program startup with the "-c" command line parameter. A.2.3 Query language The query language provides support for three basic query types.

83 A.2. GOTAXEXPLORER 71 Selection of sets <result> WHERE <condition> RESTRICT <limit> Comparison of GO sets GO WHERE <condition> RESTRICT <limit> SIM GO WHERE <condition> RESTRICT <limit> Computation of functional equivalents GENE WHERE <condition> FUNEQ GENE WHERE <condition> The <result> can be composed of one or more of the following types: GO, PFAM, SMART, TAX, GENE. Different results have to be separated with a comma ",". The condition is composed of boolean operators and entity definitions. The operators "AND", "OR" and "NOT" are available. Round brackets "(...)" can be used to group parts of the condition. An entity is always defined with a keyword that indicates the data source followed by a colon ":" and the accession number from the source database, e.g. "GO:GO: " or "PFAM:PF02811". The following keywords are possible: GO for GO terms. DGO for GO terms but use the smart2go or pfam2go mappings; this is only possible for SMART and PFAM results. PFAM for PFAM families. DPFAM for PFAM families but use the pfam2go mapping for GO results. SMART for SMART domains. DSMART for SMART domains but use the smart2go mapping for GO results. TAX for a taxonomic group or species; each species in the group is resolved to a set of result entities and the final result is calculated as the union of these sets. ATAX for a taxonomic group or species; each species in the group is resolved to a set of result entities and the final result is calculated as the intersection of these sets.

84 72 APPENDIX A. USER MANUAL GENE for a protein. Limits are only applicable to GO and taxonomy results. They are ignored in all other cases. The taxonomy queries can be limited to return only species by adding "SPECIES" to the <limit>. GO results can be limited to ontologies using "TT:ontology". Here, "ontology" can either be "biological_process", "molecular_function" or "cellular_component". Adding "EXTT" to the <limit> leads to the exclusion of the specified ontologies. Similar to this, it is possible to restrict the results to mappings with selected evidence codes. The codes are added with "EC:ec" where "ec" is a valid evidence code. The specified evidence codes can be excluded from the computation by adding "EXEC" to <limit>. Sample queries A query that selects all human proteins is GENE WHERE TAX:9606 A more complex query that selects all species with proteins belonging to the PHP domain (Pfam accession: PF02811) and involved in "DNA repair" (GO: ) reads TAX WHERE PFAM:PF02811 AND GO:GO: RESTRICT SPECIES The query GO WHERE TAX:4751 RESTRICT TT:biological_process SIM GO WHERE TAX:9606 RESTRICT TT:biological_process compares all GO biological process from fungi (Taxonomy accession: 4751) with the processes from human (Taxonomy accession: 9606). Modifying the query to compare only processes from fungi that are not in human leads to GO WHERE TAX:4751 AND NOT TAX:9606 RESTRICT TT:biological_process SIM GO WHERE TAX:9606 RESTRICT TT:biological_process Performing a functional equivalence comparison between Mycobacterium paratuberculosis and human is possible with the following query GENE WHERE TAX:1770 FUNEQ GENE WHERE TAX:9606 A.2.4 Graphical User Interface (GUI) The graphical user interface can be divided into three parts (see Figure 4.2 in Chapter 4). The status bar at the bottom of the window provides information on running tasks like queries. The menu bar is used to show the frames in the workplace and to select query options. The workplace contains the different frames which can be resized and minimized.

A.2. GOTAXEXPLORER 73 The menu bar The menu bar provides options for displaying input and result views and for choosing query options. All options are accessible with shortkeys.

85 A.2. GOTAXEXPLORER 73 The menu bar The menu bar provides options for displaying input and result views and for choosing query options. All options are accessible with shortkeys. The File menu provides three options (see Figure A.1). Figure A.1: The File menu. Open file Opens a file dialog to open a new configuration file (shortkey: Ctrl+O). Memory consumption Opens a dialog showing the amount of memory currently used by GOTaxExplorer (shortkey: Ctrl+E). Close Exit GOTaxExplorer (shortkey: Alt+F4). The Frames menu allows to show the different frames in the main window (see Figure A.2). Furthermore, it allows to change the tree displayed in the query frame. Figure A.2: The Frames menu. Show input graph Provides the possibility to display the different tree views in the query frame. The available views are: Biological Process (shortkey: Ctrl+B), Molecular Function (Ctrl+M), Cellular Component (Ctrl+C), and Taxonomy (Ctrl+T).

86 74 APPENDIX A. USER MANUAL Show query frame Makes the query frame visible (shortkey: Ctrl-Q). Show result frame Sets the result frame visible (shortkey: Ctrl-R). Show name search frame Opens the frame for searching the database with the name of an entity (shortkey: Ctrl+N). Show 2D trees frame Opens a new window with all 2D output views of the trees (shortkey: Ctrl+A). The Query menu provides the possibility to change query options that influence the results shown (see Figure A.3). Figure A.3: The Query menu. SQL mode Toggles the SQL mode (shortkey: Ctrl+S). If SQL mode is selected, the query frame is hidden and the SQL query frame is made visible. Filter GO results Toggles whether ancestors of hits should be filtered out. If this option is selected (default), all nodes that have children in the result set are removed and only the lowest possible terms are displayed (shortkey: Ctrl+F). Show complete GO paths GOTaxExplorer resolves all possible paths from result terms to the root if this option is selected (shortkey: Ctrl+P). This can be a lengthy operation if the result set is large since terms usually have more than one possible path to the root. The Results menu allows to save results to a file and to delete results from the result frame (see Figure A.4). The 2D and 3D views on the result sets can be toggled from this menu. Save result Allows to select one result and to save it into a text file. A file dialog provides the possibility to select the file name. The sorting of the results in the table is preserved.

A.2. GOTAXEXPLORER 75 Figure A.4: The Results menu. Delete result Allows to delete a result set from the result frame. It is also possible to delete all results (shortkey: Ctrl+D).

87 A.2. GOTAXEXPLORER 75 Figure A.4: The Results menu. Delete result Allows to delete a result set from the result frame. It is also possible to delete all results (shortkey: Ctrl+D). Show results in 3D graph This option allows to visualize the GO or taxonomy results from the last query in a 3D graph. Available 3D views: Biological Process (shortkey: Ctrl+I), Molecular Process (shortkey: Ctrl+L), Cellular Component (Ctrl+U) and taxonomy tree (shortkey: Ctrl+X). Show results in 2D trees The results from GO and Taxonomy queries will be automatically shown in the 2D trees if this option is selected. The workplace The workplace contains the query and the result frames. All frames can be hidden and made visible with options in the menu. All buttons and fields provide a tooltip that specifies their function. The query frame provides facilities to build a query and to submit it (see Figure A.5). The query can either be entered manually into the query field on top or build with the buttons provided. The query field serves also as query history and contains all former queries in a dropdown list. The query building facilities are divided into three parts. The results can be selected with the drop-down list and added to the query with the "Add result" button. The part to build the condition contains entity and operator selectors. An entity is specified with its database and the accession number. "DGO", "DPFAM" and "DSMART" specify that the direct mapping from the "pfam2go" or "smart2go" files are used to compute the results. The accession number can be inserted in the text field next to the database drop-down list and "Add entity" adds the entity to the query. "Add operator" adds the selected operator to the query. After either the "SIM" or the "FUNEQ" operator are entered, the selection buttons can be used to build the second query. GO results can be limited to an ontology or to specific evidence codes. It is also possible to exclude the selected ontology of the evidence codes. To this end, it is necessary to press the corresponding "exclude" button and to add the limits to the query. The tree view on the right allows to browse an ontology or the taxonomy for an entity. The menu can be used to select the

The "search" button at the bottom submits the query and the "clear" button deletes the query and resets the query frame. The SQL query frame is shown after entering the SQL mode through the menu.

88 76 APPENDIX A. USER MANUAL Figure A.5: The query frame. This frame contains facilities to incrementally build a query. tree to be displayed. A tooltip with the accession number is displayed after hovering the node with the mouse. Right-clicking a node opens a pop-up menu that allows access to the online databases. The "search" button at the bottom submits the query and the "clear" button deletes the query and resets the query frame. The SQL query frame is shown after entering the SQL mode through the menu. It contains a text area for entering a query (see Figure A.6). An arbitrary SQL query can be entered in this field. A button for submitting the query and a button for clearing the field are to the right of the text field. The results of SQL queries are also presented in a table in the results frame. Figure A.6: Frame for entering a SQL query.

A.2. GOTAXEXPLORER 77 The results frame contains a tab for each query result (see Figure A.7). The query that returned the result is displayed in a tooltip after hovering over the tab.

89 A.2. GOTAXEXPLORER 77 The results frame contains a tab for each query result (see Figure A.7). The query that returned the result is displayed in a tooltip after hovering over the tab. The results are formatted in tables that can be sorted by clicking on the header of the column. Holding the "Ctrl" key and clicking a column header specifies a secondary sorting which preserves the first order. The sorting is sustained if results are saved to text files. By right-clicking a row, a pop-up menu appears that allows to open a browser with the entry of the online database. If the results are from GO or the taxonomy and the 3D view is active, it is possible to center the corresponding node in the graph with an option in the pop-up menu. We also provide the possibility to search the tree representation for the results. The pop-up menu contains an option to search for the next occurrence of a node in the tree. This permits to find all occurrences of a node in the tree. Figure A.7: The results frame contains all query results in tabs. The name search frame allows to search the database with the name of an entity (see Figure A.8). The search keyword can be entered into the text field on top. The domain to be searched is selected with a drop-down list below the text field. The query is submitted with the "Search" button and the results are shown in the result area at the bottom. A double-click on a row of the results table inserts the entity into the condition of the current query. The 2D tree window This window is divided into four panes which contain all three ontologies and the taxonomic tree (see Figure 4.3). The panes can be resized by dragging the dividing bar between them. The hits and all possible paths to the root are colored red. A node can be expanded by a single-click and the number in brackets after the name indicates the number of hits in the subtree below this node. After a comparison of sets of GO terms, the terms in the first set are highlighted and their highest similarity to a term in the second set is shown in parentheses after the name. The accession number of the node is displayed in a tooltip and a pop-up menu provides the possibility to access the entity in the online database. An entity can directly be inserted into the query by a double-click. The pop-up menu in the results frame provides the possibility to search for the next occurrence of a hit and to make it visible in the tree window.

90 78 APPENDIX A. USER MANUAL Figure A.8: The name search frame provides the possibility to search for an entity with a part of its name. The 3D graph window The window shows one ontology or the taxonomic tree (see Figure 4.5 in Section 4). The mouse can be used to rotate, zoom and translate the graph. Holding the left mouse button and moving the mouse rotates the graph view. Moving the mouse back and simultaneously holding the middle mouse button zooms in and moving forth zooms out. Moving the mouse while pressing the right button translates the graph in the view. All hits and their ancestors are colored red as in the 2D view. The size of a node increases with the number of hits in the subgraph of this node. A right-click on a node opens a pop-up menu with several options (see Figure A.9). First, the node s name, accession number and the number of hits are given. The pop-up menu contains also links to the online database. If two GO term sets are compared, the pop-up menu shows the highest similarity values for each term in the first set. The pop-up menu provides the possibility to center the selected node in the view. It is possible to use several functions to show and hide nodes and their labels. After loading results in the view only the hits, their ancestors and the nodes on the second rank are visible. The "Show children" option makes all children of a node visible. If this option is greyed out, all children are already visible. The "Expand subgraph" button makes all nodes in the subgraph rooted at the selected node visible. "Collapse subgraph" hides all descendants of the selected node. It is possible to toggle the label of a selected node, the children s labels, the ancestor s labels and the descendant s labels with the options in the "Toggle Label" submenu. Hiding uninteresting parts of the graph and labels improves the performance of this view tremendously. A right-click on an edge pops up a menu with the options to center the edge s start node and end node. This allows to quickly browse the hierarchy.

91 A.3. HELPER PROGRAMS FOR DATABASE MAINTENANCE 79 Figure A.9: Popup menu displayed for a node in the 3D graph view. A.3 Helper programs for database maintenance The "genes" and the "completed_genomes" databases are constructed and updated with a set of Java TM programs. All of these programs use a configuration file for determining database details and the data files. The possible parameters are: url The database uniform resource locator (url). user The username for the database. password The password for the database access. driver Determines the database driver class, this class has to be in the classpath, default: com.mysql.jdbc.driver. bp-graph The file with the WilmaScope graph for biological process, the absolute path is required, default:./bp.xwg. mf-graph The file with the WilmaScope graph for molecular function, the absolute path is required, default:./mf.xwg.

CAP BIOINFORMATICS Su-Shing Chen CISE. 8/19/2005 Su-Shing Chen, CISE 1

CAP BIOINFORMATICS Su-Shing Chen CISE. 8/19/2005 Su-Shing Chen, CISE 1 CAP 5510-2 BIOINFORMATICS Su-Shing Chen CISE 8/19/2005 Su-Shing Chen, CISE 1 Building Local Genomic Databases Genomic research integrates sequence data with gene function knowledge. Gene ontology to represent