LinkDB: A Database of Cross Links between Molecular Biology Databases

Similar documents
DBGET Servers. Redirection. Redirection. Server2 DB3 DB4. Server1 DB1 DB2. LinkDB DB5 DB6. Default Server. Client Private DB. bget. blink.

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

EBI patent related services

New generation of patent sequence databases Information Sources in Biotechnology Japan

Biostatistics and Bioinformatics Molecular Sequence Databases

Data Mining Technologies for Bioinformatics Sequences

What is Internet COMPUTER NETWORKS AND NETWORK-BASED BIOINFORMATICS RESOURCES

EMBL-EBI Patent Services

Medical Informatics Databases Databases Databases Databases

User Manual. Ver. 3.0 March 19, 2012

The Use of WWW in Biological Research

A Protocol for Maintaining Multidatabase Referential Integrity. Articial Intelligence Center. SRI International, EJ229

Integrated Access to Biological Data. A use case

Graph Modeling and Analysis in Oracle

HsAgilentDesign db

hgu133plus2.db December 11, 2017

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix

Easy manual for SIRD interface

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

Proceedings of the Postgraduate Annual Research Seminar

INTRODUCTION TO BIOINFORMATICS

A Study on Development of a Deductive. Object-Oriented Database and Its. Application to Genome Analysis

Lezione 7. Bioinformatica. Mauro Ceccanti e Alberto Paoluzzi

Tutorial 4 BLAST Searching the CHO Genome

hgug4845a.db September 22, 2014 Map Manufacturer identifiers to Accession Numbers

BIOINFORMATICS. Pathways database system: an integrated system for biological pathways

Human Disease Models Tutorial

Learning procedure. Decision procedure

Geneious 5.6 Quickstart Manual. Biomatters Ltd

SciVerse ScienceDirect. User Guide. October SciVerse ScienceDirect. Open to accelerate science

Bioinformatics Hubs on the Web

Lab 4: Multiple Sequence Alignment (MSA)

The GenAlg Project: Developing a New Integrating Data Model, Language, and Tool for Managing and Querying Genomic Information

CAP BIOINFORMATICS Su-Shing Chen CISE. 8/19/2005 Su-Shing Chen, CISE 1

12. Key features involved in building biological 3databases

Lecture 5 Advanced BLAST

Bioinformatics explained: Smith-Waterman

EBI services. Jennifer McDowall EMBL-EBI

A Classification of Tasks in Bioinformatics

Lezione 7. BioPython. Contents. BioPython Installing and exploration Tutorial. Bioinformatica. Mauro Ceccanti e Alberto Paoluzzi

A Strategy for Database Interoperation. Peter D. Karp. Articial Intelligence Center. SRI International, EJ Ravenswood Ave.

Bioinformatics Data Distribution and Integration via Web Services and XML

Semantic Correspondence in Federated Life Science Data Integration Systems

Parallel Protein Information Analysis (PAPIA) System

Editing Pathway/Genome Databases

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

BIOSPIDA: A Relational Database Translator for NCBI

mgu74a.db November 2, 2013 Map Manufacturer identifiers to Accession Numbers

mpmorfsdb: A database of Molecular Recognition Features (MoRFs) in membrane proteins. Introduction

mogene20sttranscriptcluster.db

Flexible Integration of Molecular-Biological Annotation Data: The GenMapper Approach

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

INTRODUCTION TO BIOINFORMATICS

Complex Query Formulation Over Diverse Information Sources Using an Ontology

Sequence Variation Database Project at the European Bioinformatics Institute

Biosphere: the interoperation of web services in microarray cluster analysis

Chapter 30 Emerging Database Technologies and Applications

Important Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids

An Introduction to Taverna Workflows Katy Wolstencroft University of Manchester

Data Integration Framework of Pharmacology Databases Using Ontology

BioRuby and the KEGG API. Toshiaki Katayama Bioinformatics center, Kyoto U., Japan

Research on Pairwise Sequence Alignment Needleman-Wunsch Algorithm

EBP. Accessing the Biomedical Literature for the Best Evidence

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and World-wide Bioinformatics Resources.

Editing Pathway/Genome Databases

Information Resources in Molecular Biology Marcela Davila-Lopez How many and where

BIR pipeline steps and subsequent output files description STEP 1: BLAST search

Introduction to Sequence Databases. 1. DNA & RNA 2. Proteins

COLLEGE OF IMAGING ARTS AND SCIENCES. Medical Illustration

Tutorial: How to use the Wheat TILLING database

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Abstract. of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing

Metabolic Information Control System

Biology 644: Bioinformatics

Structural Bioinformatics

DNA Inspired Bi-directional Lempel-Ziv-like Compression Algorithms

Lecture 4: January 1, Biological Databases and Retrieval Systems

Optimizing Query results using Middle Layers Based on Concept Hierarchies

31 PathDB: a second generation metabolic database

Protein Data Bank Japan

User Guide for DNAFORM Clone Search Engine

Bioinformatics explained: BLAST. March 8, 2007

Ontology-Based Mediation in the. Pisa June 2007

NCBI News, November 2009

Genomic pathways database and biological data management

Hybrid Integration of Molecular-biological Annotation Data

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

BIOINFORMATICS A PRACTICAL GUIDE TO THE ANALYSIS OF GENES AND PROTEINS

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

Advances in Data Integration & Representation in Systems Biology

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

tem (AGIS), 166f Abstraction level, in KEGG data, 64, 65f Accessions, gi s Vs, 15 16

DNASIS MAX V2.0. Tutorial Booklet

Similarity searches in biological sequence databases

e-scider: A tool to retrieve, prioritize and analyze the articles from PubMed database Sujit R. Tangadpalliwar 1, Rakesh Nimbalkar 2, Prabha Garg* 3

Basic Local Alignment Search Tool (BLAST)

The Human PAX6 Mutation Database

: Intro Programming for Scientists and Engineers Assignment 3: Molecular Biology

The PDB and experimental data

ESG: Extended Similarity Group Job Submission

Transcription:

LinkDB: A Database of Cross Links between Molecular Biology Databases Susumu Goto, Yutaka Akiyama, Minoru Kanehisa Institute for Chemical Research, Kyoto University Introduction We have developed a molecular biology database retrieval system, DBGET, which allows users to retrieve entries by keywords or entry names among sixteen databases, have cross references to other databases. Therefore we can easily retrieve related entries by using the cross references. Figure 1 shows the databases currently supported in the DBGET system, and cross references among them. Figure 1 Easy retrieval of related entries is achieved especially in case that we use the WWW version of the DBGET system, which we call WebDBget[1], because WWW provides cross references to EMBL, PROSITE, PDB, and so on, and they are highlighted as clickable items in WebDBget. However, there are databases that do not have any or enough cross references to other databases. Even if the database has references to external databases, the user often must search databases several times to obtain required information. We show some examples below. OMIM does not have cross references to external databases, though it has internal references. Therefore it is difficult to retrieve related information, such as the amino acid sequence and the nucleotide sequence of the gene

responsible for a disease. If users want to retrieve the related amino acid sequence data from a GenBank entry, one possible way is first to retrieve the EMBL entry that has the same entries described in the cross reference field of the EMBL entry. The same situation occurs when the literature information is necessary from the LIGAND enzyme reaction database. If the literature information of the structure of the enzyme is necessary, the PDB search from the LIGAND and then MEDLINE search is required. If the user needs the literature information of the amino acid sequence and the nucleotide sequence, the situation is the same. It is not reasonable to describe all literature information of enzymes in the LIGAND entries, in the sense that there are different kinds of information such as protein structures, amino acid sequences, and nucleotide sequences. Therefore the management of cross references (cross links) between several databases is indispensable, and we constructed a database for cross link information among sixteen databases. We call this LinkDB. Method We constructed LinkDB according to the following three steps. This construction is similar to that of link information in SRS system[2,3]. The differences between 1. Extraction of original links Many molecular biology databases have cross links to external databases. First, we extracted those cross links and constructed original link tables for each database. The extraction have been done on the following three kinds of information. A. Links explicitly specified in the database entries Most databases have cross links, such as MEDLINE IDs in GenBank explicitly defined as the destination database and the entry number pair. B. Links to LIGAND enzyme reaction database There are description of E.C. numbers in DEFINITION or TITLE lines in many databases. These are used for establishing the links to LIGAND enzyme reaction database. C. Links by same accession numbers GenBank and EMBL have corresponding entries, which describe the same sequences, and they have the same accession numbers. We also extracted this information as cross links between GenBank and EMBL. 2. Construction of reverse links We constructed links to retrieve link information via inversing the original constructed in step 1 to be bidirectional. These links enable users to easily so forth.

3. Construction of indirect links be reached in one step. We constructed indirect links by combining these direct links. The indirect links enable users to retrieve, for example, All the indirect links in the LinkDB are precomputed ones; that is, we (database construct them. For now, we do not provide users with query interface to specify paths to compute indirect links interactively. Result We constructed links for about a million entries from the sixteen databases. The LinkDB can be accessed by using WebDBget. The name of the entry is highlighted in an entry window after retrieving it. When clicking the highlighted entry name, the result of LinkDB search will appear. Figure 2 shows a part of an OMIM entry in the WebDBget search result and the entry name is highlighted. Click here! MIM Entry: 308000 TITLE: *308000 HYPOXANTHINE GUANINE PHOSPHORIBOSYLTRANSFERASE [HPRT; HGPRT; TEXT: spastic cerebral palsy, choreoathetosis, uric acid urinary stones, anemia has been found by some (van der Zee et al., 1968). Figure 2. The result of LinkDB search includes a list of tuples of the database name, the entry name, the type of the link (original, reverse, or indirect), and the path information if it is indirect. The path information is important and useful in the sense that it can be a key to understand the information contained in the destination entry. For example, the indicates that the entry includes nucleotide sequence. We are planing addition of definition (or title) information of each destination entry in the LinkDB. Discussion The LinkDB is the database for link information among sixteen databases. The SRS system by Etzold et al.[2,3] also provides link information between several databases. Because SRS has a query language to flexibly construct indirect links, users can retrieve the link information of arbitrary path. Instead of providing such a query language, we precomputed usefull path information and constructed LinkDB including indirect path information. There are the following two advantages by precomputing possible paths.

It often takes much time to compute links, especially in case they are long. LinkDB is precomputed and therefore can be retrieved efficiently when an entry is specified. It is useful for the users who are not familiar with the information about links, especially indirect links. It also can notify even expert users of link information about newly added databases. The LinkDB should be updated right after any of the sixteen databases is updated, because the LinkDB is constructed from the cross links described in the underlying databases. Check of their update and update of the LinkDB is done totally automatically. Currently, it is done by checking release update of each underlying When we augment the underlying databases, the LinkDB constructor must specify the path from and to the newly augmented database. The end users, however, do not have to consider this specification. Using the LinkDB, we can retrieve the databases constructed in our laboratory and Database) and AAindex (Amino Acid Index Database). As shown in the example in the previous section, i.e. if we click the highlighted entry name in Figure 2, we can easily retrieve the relationship between a genetic disease and the related mutants. The most critical limitation is that there is the case that the LinkDB does not preserve each other (probably related to the same gene). We do not use a database management system for constructing and maintaining the LinkDB, because we constructed all links beforehand and the LinkDB is used to retrieve only the information for a specified entry. When we provide a query interface to specify paths interactively, however, a database management with indexing for the original and reverse links is indispensable. One possible solution is the use of deductive databases. Functions of deductive databases that process recursive queries with various conditions can be useful to compute specified paths with conditions. An example of such queries is a retrieval of all related literatures but only those about protein structures from a GenBank entry. The functions are also useful to databases in the LinkDB is one of the most important future directions as well as the extraction and representation of the biological data for guaranteeing the biological meaning of the links. Acknowledgement Priority Area Genome Informatics from the Ministry of Education, Science and Culture in Japan.

References 1. Akiyama, Y., Goto, S., Uchiyama, I. and Kanehisa, M.: WebDBGET: an 2. related database entries, MIMBD 95 (1995). 3. Etzold, T. and Argos, P.: Transforming a set of biological flat file libraries to