Software review. Biomolecular Interaction Network Database

Size: px

Start display at page:

Download "Software review. Biomolecular Interaction Network Database"

Kevin Peregrine Henderson
5 years ago
Views:

1 Biomolecular Interaction Network Database Keywords: protein interactions, visualisation, biology data integration, web access Abstract This software review looks at the utility of the Biomolecular Interaction Network Database (BIND) as a web database. BIND offers methods common to related biology databases and specialisations for its protein interaction data. Searching and browsing this database is easy and well integrated with the underlying data and the needs of scientists. Interaction networks are visualised with software that offers many useful options. The innovative ontoglyphs are used throughout to provide visual cues to protein functions, localisation and other aspects one needs to know for this data set. One can expect to get useful results that may be well integrated with one s research needs. INTRODUCTION Web-accessible databases are increasingly the primary sources of data and information for many biologists in their research. These cover a gamut of bioinformation from literature (PubMed, journals), sequences (NCBI, EBI), genomes (model organism and other databases), gene expression, molecular dynamics and proteomics, taxonomic and phylogenetics databases and others. To support access and use of these databases, the database maintainers develop or deploy web software ( webware ) to let you search, retrieve and analyse their data. Currently much of this database webware is developed for each individual project. There are efforts to produce and use among bio-databases a set of common methods, database tools and webware, such as the Generic Model Organism Database 1 and Ensembl/BioMart. 2 The Biomolecular Interaction Network Database (BIND) 3 is one web database with important information used by many bioscientists. It contains a wealth of protein interaction data with supporting curated literature and molecular function information. This includes data automatically captured from high-throughput projects, human-curated information from the scientific literature, as well as data integrated from other biology databases. Started in 1998, BIND is a project of the Blueprint Initiative 4 for public biomolecular data. This resource includes over 175,000 interactions, nearly 3,000 protein complexes and several pathways, drawn from publications and high-throughput experiments. How well does BIND support access to its data via web functions? Does a biologist find common access methods when using BIND and other web databases? Does BIND offer methods uniquely suited to its data? WEB DATABASE COMPONENTS Biology web databases share common components that are needed to make them most useful to scientists. The usefulness of a database depends as much on how fully its webware implements such components, as on the value of the underlying data. Table 1 indicates the major components of BIND s web database. This website database has been updated significantly in 2005, and offers improved browsing and search features, a 194 & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 6. NO JUNE 2005

2 Table 1: BIND database web components Component Options Comments Browsing All or limited by types, taxonomy Usable access to all interactions for discovery Searches Any text, fields, IDs, BLAST Extensive, appropriate choices Results View formats, export formats Filter results; missing refine-search Interaction reports Pairs, complexes, graphs Highly interactive and customisable Submit data Not reviewed Data exports All data contents, multiple formats Easy to get any or all data in useful formats Documents Tutorials, FAQ, publications Good selection, more complete user manual would help Related data (BIND) PreBIND, SMID Well integrated, supports main data External data Imports from sequence, molecular and genome databases, gene ontologies, external IDs Strong external integration new downloading interface, taxonomy identifier searching and expanded use of ontoglyphs. Searching and browsing BIND offers several search options: simple text searches, using IDs from several databases, and by specific data fields. BLAST searches of proteins in the database are also a choice. One can also start without search questions, by browsing through all data. This is handy for new customers who want see what is available. The searches allow you to focus on all aspects of this data set: literature information, molecule structure, gene information including functions, IDs and sequences, and taxonomy. After a search, one can get hundreds or possibly many thousands of interaction results. There are methods to limit or filter these to a smaller interesting set, or one can repeat a search with other terms. A method to refine a search, by adding new limiting terms, as one finds in other web databases, would be a useful addition. A specialised PreBIND data set is available for searching. It uses a supervised learning algorithm (SVM) to search for interactions described in literature. This will find papers that may not have obvious simple interaction references. This is an example of applying newer data-mining techniques to biological data that yields more useful results than common simpler database and text search methods. Visualising molecular interactions The BIND Interaction viewer shows interaction networks for complexes. The basic display is a network graph of relations between molecules, with glyphs for each molecule designed from the common ontoglyph components. This viewer offers numerous options for customising the view based on protein binding, function and localisation, and for manipulating the network graph. Such customisability allows one to better dissect and extract useful information from complex networks. This viewer comes in two versions, both Java driven. The newer version 3 is a Java application that will run on most computers with a Java runtime system installed. Version 2 is designed as an applet whose function is dependent on quirks of web browser versions. Ontoglyphs Ontoglyphs are a special feature of BIND that provides pictorial representations of gene ontology and product information, as shown in Figure 1. These can be used in filtering results, data summaries and other parts of the web access. Some 83 primitive symbols (glyphs) are used to represent function, binding and cellular localisation. Several of these glyphs are combined to represent a molecule s overall function in a pictorial way that is rather intuitive. It allows one to select and & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 6. NO JUNE

3 Protein Binding Protein Synthesis, Processing Viral life cycle Signal transduction Death and Regulation Defence/ immune response Cell multiplication Cell periphery Glyphs Figure 1: Ontoglyph example. The example protein (CD28) ontoglyph at bottom has functions, binding and cell location properties as identified by several component glyphs at top ONTOGLYPH CD28 see related molecular interactions more readily than reading text descriptions. Overall usability The many components of this web database system are well integrated with each other. These include common userinterface components, such as paged results and listings, common export-thisview sections, drop-down expansion of information, and related information links. The overall web page design is clean, not cluttered with extraneous information, and focuses on the presentation of interactions, which can be rather complex. The graphic ontoglyphs are a common feature that, once learned, are an invaluable aid to reduce complexity and to focus on those components a scientist is most interested in. Integration with other databases BIND imports data from several other databases with protein information, including model organism genome projects for mouse, fruitfly and yeast. High-throughput experiments with hundreds or thousands of protein interactions are targeted by BIND engineers who developed necessary tools to import each data set. Primary databases with protein-related information that are drawn on include NCBI sequences and structure, PubMed literature, Gene Ontology and others. Molecular structures from the MMDB 5 protein structure database are imported. Identifiers from several chemical and molecular registries are included, such as CAS and Beilstein. The Small Molecule Interaction Database (SMID) is a related database at Blueprint that focuses on small molecule components of proteins. The recently released SMID-Genomes section provides a highly useful integration of genome sequence data and small molecules from the PDB. 6 With this web database, a researcher can find molecules unique to a given organism genome, or molecules shared among various species. One can select up to five organisms, and find all the small molecules that are either unique to them, or shared in common among them. The ability to screen molecules by species through this web database has numerous uses. One important use is identifying pharmaceutical and agricultural target molecules. For instance, insect pesticides benign to a crop plant may be identified by finding molecules used in Drosophila that are not found in Arabidopsis. Documentation and help The web site provides a range of documentation on using BIND, how 196 & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 6. NO JUNE 2005

4 BIND data are collected and curated, lists of data sources and how these are integrated. This includes tutorial documents for getting started, and getting the most use out of this database. Descriptions and schema for the data structures, database and software tools are available. Publications on this database and answers to frequently asked questions are available. The tutorials provided did not appear to cover all current features, results, viewers and functions of this web database. Though many aspects can be learned by use, a more complete user manual would be a welcome addition. DISCUSSION The BIND web database offers a wellintegrated, fully featured portal into an important resource for understanding biomolecular interactions. Its innovative features and common components for finding, using and exporting its data provide a useful example for bioinformatics web databases. A scientist can expect to find what they are looking for if it exists in this database, and also to get data out of it into their spreadsheet, or other analysis tools with little problem. There are features that make it possible to start from other databases and get results one wants, or to take results from this database to another database. Support for such cross-database travel is essential to many researchers today. Web databases that centre on interactions, whether of molecules, genes or other factors, need good methods to visualise the interactions, as well as options for selecting, filtering and focusing on those portions of an interaction network that the scientist is most interested in. BIND works towards that need with interaction visualisation software, and many customising options. The Interaction visualiser works well (version 2 failed for this reviewer s web browser, but version 3 worked properly), and allows one to view and select from individual interactions. The web page results and reporting methods provide good support for finding and understanding interaction information. This project s innovation with ontoglyphs, for indicating components of molecules and interactions, is one that makes this service especially useful. This innovation could be adopted by other biology web databases to their advantage. One aspect that is missing or still in development at BIND is a method for searching graphically based on molecular structures in a more extensive way than the ontoglyphs and related methods allow. BIND s related small molecule SMID-Genomes database offers exceptional promise for species-oriented molecule and interaction discovery that one can expect to see future integrations. Related web databases for protein information include the Protein Data Bank (PDB), 6 NCBI s MMDB 5 and related protein databases, along with numerous others. KEGG 7 is a widely used database that integrates genes and protein pathways, along with chemical compound structures. The BRITE subsidiary of KEGG is a useful protein interaction data set that is integrated with other web search and protein datareporting functions at KEGG. IntAct 8 is protein interaction database with a similar basic goal to BIND, with interactions derived from literature curation and user submissions. This is a newer and smaller database, with some 50,000 pairs of interactions, and operates as a collaboration of several European database groups, including Max-Planck- Institut, Swiss Institute of Bioinformatics and the EBI. The web access to IntAct centres on database searches by several attributes: gene names, InterPro, SwissProt, Gene Ontology and PubMed identifiers. Results include tabular lists of paired interactions, along with descriptions, curated annotations and links to related protein interactions. Interaction networks are visualised with graphs that are similar in design to BIND. This service uses plain web images (GIF, JPEG) instead of a Java application as at BIND. This offers usability to a broader range of customers, while sacrificing user & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 6. NO JUNE

5 interactivity and customisability that the BIND viewer provides. Both of these services provide basic data output in XML formats. The BIND service provides a broader range of result listings and data export options, including ID lists, tabular forms and sequence FastA. Software of both projects is available as open source, although possibly not fully available. Both projects provide all data they curate and collate for public reuse without restrictions. The Blueprint Initiative and BIND have grown out of work by principal investigator Christopher Hogue, which includes related databases and software tools for biomolecular data. SeqHound is one of the popular adjunct programs from this group. It is a database of common public biological sequences and structures, along with software for efficient updating and rapid access to this collection. NBLAST, for network/cluster BLAST analyses, and a Distributed Folding Project, are some of the other useful bioinformatics tools from this group. Acknowledgments This work is supported in part by NIH grant 1R01HG to the author. Don Gilbert Biology Department, Indiana University, Bloomington, Indiana 47405, USA References Tel: gilbertd@indiana.edu 1. Stein, L. D., Mungall, C., Shu, S. et al. (2002), The generic genome browser: A building block for a model organism system database, Genome Res., Vol. 12, pp (URL: Kasprzyk, A., Keefe, D., Smedley, D. et al. (2004), EnsMart: A generic system for fast and flexible access to biological data, Genome Res., Vol. 14, pp (URL: Alfarano, C., Andrade, C. E., Anthony, K. et al. (2005), The Biomolecular Interaction Network Database and related tools 2005 update, Nucleic Acids Res., Vol. 33, pp. D418 D424 (Database issue) (URL: 4. URL: 5. Chen, J., Anderson, J. B., DeWeese-Scott, C. et al. (2003), MMDB: Entrez s 3D-structure database, Nucleic Acids Res., Vol. 31, pp (URL: Structure/MMDB/mmdb.shtml). 6. Berman, H. M., Westbrook, J., Feng, Z. et al. (2000), The Protein Data Bank, Nucleic Acids Res., Vol. 28, pp (URL: Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes, Nucleic Acids Res., Vol. 28, pp (URL: 8. Hermjakob, H., Montecchi-Palazzi, l., Lewington, C. et al. (2004), IntAct an open source molecular interaction database, Nucleic Acids Res., Vol. 32, pp. D452 D455 (URL: & HENRY STEWART PUBLICATIONS BRIEFINGS IN BIOINFORMATICS. VOL 6. NO JUNE 2005

Software review. Shopping in the genome market with EnsMart

Software review. Shopping in the genome market with EnsMart Shopping in the genome market with EnsMart Keywords: genome databases, human genome, comparative genomics, data mining, open source software Abstract Life scientists who work with the supermarket of genome