Down with Species-Specific Database Projects, Up with Data Services

Size: px

Start display at page:

Download "Down with Species-Specific Database Projects, Up with Data Services"

Andra Burns
5 years ago
Views:

1 1 Down with Species-Specific Database Projects, Up with Data Services Lincoln D. Stein, Cold Spring Harbor Laboratory This whitepaper begins with an illustration drawn from a database that has nothing to do with the plant kingdom. Figure 1 shows a Cell page from WormBase, a web site devoted to the genome and biology of C. elegans. Among the information available on this page is its role in the organism's life cycle, its lineage, its fate, a diagram of its anatomy, information on the genes that are known to be turned on in this cell, and links to citations that refer to the cell. Figure 1: The Standard HTML Display of a C. elegans Cell Figure 2 shows what is displayed when the user clicks on the link labeled "XML Display" at the top of the first page. This displays the raw database record in XML (extensible Markup Language) form. It is the same information as was

2 2 displayed on the HTML page, but freed of extraneous formatting and easily parsed by standard software. Figure 2: The XML Representation of a C. elegans Cell Every single data object in WormBase, whether it be a cell, a gene, a sequence, a protein, a genetic map, a mutant, or an allele, is available in XML form. To fetch the data, there is a simple published URL: where N and C are the name and class of the object to fetch in XML form. For the page shown earlier, the object name is ADAR, and the class is Cell. The schema for the class can be fetched using a variant on this URL: Where N is the name of the class, for example Cell.

3 3 At first this service might seem extraneous. After all, the primary target audience for WormBase, the bench biologist, has no need for XML. What he wants is a browsable HTML page, or a downloadable Excel spreadsheet. In fact, the XML service is provided for the benefit of computer-savvy biologists and bioinformatics professionals, both those directly involved with WormBase, and those outside the project. For the WormBase insiders, the XML interface allows us to decouple our HTML pages from the underlying database. The core WormBase sequence annotation viewer is driven entirely off XML, and can be moved from data source to data source without modification. More importantly, the XML service is one of the ways that we make WormBase "sticky" for other bioinformatics efforts. Should a bioinformaticist need to download a genetic map, a cell lineage or a set of sequence annotations from Wormbase, he need only generate a request for the URL given above and the database will obligingly yield up all it knows in a predictable, easily-parsed format. Other ways that we have endeavoured to make WormBase sticky are: 1) a universal incoming link format allows an external web page to link to WormBase using the database object's name and class. 2) ad hoc queries on the database using an HTTP request 3) direct ad hoc queries on the database via a command-line Perl API 4) flatfile dumps of sequence annotations using HTTP requests 5) structured dumps of sequence annotations using the distributed annotation system (DAS) protocol 6) FTPable tab-delimited tables containing frequently-requested extracts of the database 7) the entire database (in ACEDB format) can be downloaded and installed 8) all the software for WormBase is open source and available on the WormBase FTP site. There is no restriction on who uses the software and what purpose they use it for. Why Biological Databases Should be Sticky Stickiness means having hooks for third parties to connect to. It is a hallmark of good biological web sites. NCBI's Entrez is sticky by virtue of having a published URL for incoming links, and the LinkOut system for outgoing connections. The

4 4 UCSC genome browser provides flat file dumps and a call interface for dumping out selected annotation tracks. Ensembl provides a Perl command-line interface for accessing its resources at a high level of abstraction and a SQL interface for issuing low-level queries. Why is this important? Because the era in which biological databases could stand alone has departed, if indeed it existed at all. First is the breakdown of the one-species/one- database model. We are moving away from a world of singlespecies databases towards a situation in which biological databases become specialists for a particular type of information across a wide variety of species. Prominent examples include TIGR's TOGA groups, which cross species boundaries, Interpro's protein families, and the Gene Ontology Consortium's process terms. Another factor is the increasing sophistication of users. As a new generation of computer-savvy biologists appear, biological databases are now called upon to serve the needs of researchers who want to do more than browse web pages. They want to extract the information, transform it, compare it to their own data, and integrate it into data services that they have built themselves. Sticky databases provide the hooks needed to integrate data. They provide the stable interfaces needed to link data sources together, to extract and transform information, and possibly even to submit new data. They get used in creative ways that their inventors did not envision, enriching the community, and enlivening the scientific discourse. Non-sticky databases are pretty web sites; very useful in their own right, but with no potential to grow beyond the immediate goals of their designers. Biological Databases as Service Providers The ultimate sticky database is one that is organized around the idea of a set of data services. Instead of offering a few hooks to the community, a biological data service is nothing but hooks. I'll give a concrete example of what I mean. Imagine a database that contains a number of genetic maps, each of which is composed of a set of markers. A genetic map service would have a published interface which accepts requests for genetic maps and returns lists of markers and their positions. Other interfaces would allow the user to retrieve the list of maps available, and to select certain genetic maps based on the map type and species. Now imagine a marker information service that contains molecular information for genetic markers: the primer pairs, assay conditions, and so forth. It would respond to requests for markers by returning the associated information, and

5 5 provide a query interface for selecting markers by their type, polymorphism rate, and so forth. Using a combination of these two service, a programmer could still put together the classic browsable genetic map interface which draws genetic maps and responds to clicks on markers by returning the corresponding molecular information. By breaking the information into discrete services with a published interface, the data provider has opened this information to the community to use for diverse purposes such as comparative map analysis. There are other benefits. The service model makes it possible for the two databases to be physically separate, and possibly under different administrative control. Data visualization and query tools can now be written to a stable interface, providing modularity. This modularity, in turn, encourages code reuse and sharing, and allows one user interfaces to run on top of many data sources. The DAS Experience The prototype for this type of biological information service is DAS, the Distributed Annotation System. DAS is an experimental client/server protocol designed by Sean Eddy and myself which allows biological databases to become service providers for sequence annotation information. In the core of the protocol, an information consumer asks the server for all or a subset of its annotations in a particular region of the genome, and the server responds by returning a list of its annotations. It is then up to the client to store, analyze or display the information. The protocol deliberately limits the amount of information that can be transmitted by the data source to the bare essentials of a sequence annotation: a reference point on the genome, the sequence range that the annotation covers, and a brief description of the annotation type. For further information on how the annotation was made and its significance, the client is referred to a URL provided by the data source. The DAS protocol allows the same visualization and analysis tool to run on top of any database that provides a DAS service. Data providers can retrofit their databases to provide the DAS service by creating a relatively thin DAS compatibility layer. So far, the DAS experiment seems successful. In recent months, it has proven to be very popular among the model organism databases, and is now used by EBI Ensembl, the human genome browser at UCSC, TIGR, WormBase, University of Cambridge, the Berkeley Drosophila Sequencing Project, and others. There is considerable enthusiasm among the developer community, and an everincreasing number of data providers have indicated their intent to build or install DAS servers.

6 6 Modularizing Biological Databases In addition to the genetic mapping and sequence annotation services described earlier, I see many opportunities for reorganizing biological databases as sets of discrete services. Here are a few ideas for standard services: - A comparative genetic mapping service, which given coordinates on one map, translates those coordinates to another map. This could be used to compare different genetic maps in the same species as well as those in different, but substantially synteneic species. - Along the same lines, a genome assembly translation service which translate coordinates from one version of a sequence assembly to another. - A gene ontology service, which given a protein identifier, returns the gene ontology assignments for that protein. - A protein family service, which given a protein identifier, returns its domains, families and superfamilies according to one or more protein classification systems. - A mutant strain service, which given a phenotype and a species, returns all strains that express that phenotype (this presupposes a phenotype ontology, such as several groups are developing) - A sequence similarity service, which runs searches for nearly identical sequences using one of the new fast algorithms (e.g. SSAHA or BLAT). I envision these services being implemented on top of an industry-standard communications protocol. I lean towards SOAP/XML because of the predominant industry trend in this direction, but other protocols (e.g. CORBA) should be taken under consideration. WormBase as an Anachronism Let s return to WormBase for a moment. Although I think WormBase has done well in serving the needs of the C. elegans community, its original goal to be the authorative one-stop shop for all C. elegans information is an increasingly unrealistic one. We do very well at presenting the C. elegans genome, genetic map and mutants, not so with the proteome and cellular anatomy, and very poorly when it comes to microarray data or transposon-mediated knockouts. The fact is that our expertise is strong in some areas but weak in others, and that the ability of the community to

7 7 develop new types of information outstrips the WormBase curators ability to classify and incorporate it. In recognition of this, WormBase has made alliances with other data providers so that we can draw on each others strengths. The oldest of our alliances is with WormPD, the database of the C. elegans proteome provided by Proteome, Inc. (now a division of Incyte). We have agreed upon a common nomenclature for the proteins and developed a simple calling scheme that allows WormBase to link to Proteome for protein information, and for Proteome to link back to WormBase for genetic mapping and genomic information. More recently, we have made similar arrangements with SwissProt for Gene Ontology and protein family information, and with EuGenes for orthologue clustering. We are using DAS to exchange information with TIGR, and are currently working on data exchange protocols with the C. elegans Microarray project at Stanford, the Orfeome project at Dana Farber, and the Transcriptome project at NCBI. We realize that becoming a component in a network of data service providers allows us to play to our strengths. Ultimately both WormBase and the community benefits. Challenges and Opportunities Transitioning from a species-oriented to a service-oriented mission presents both challenges and opportunities for providers of biological data. As described earlier, this transition would allow a data source that has garnered expertise in, say, the storage and analysis of microarray data from Arabidopsis, to establish itself as a provider of microarray data services for a large number of plant species. However, the reemphasis would also benefit newcomers, who could now focus on setting up a discrete service rather than making the much more challenging leap to become a complete source for species-specific information. The major challenge is integration and standardization. It is very good for a data source to provide external hooks into its database, but the real benefits only kick in when several data sources settle on a standard interface. This allows the same software tools to be used across all data sources, and encourages new sources to create compliant interfaces, a phenomenon known as the "network effect." However, it is notoriously difficult to develop standards among biological databases. There are several reasons for this. One is simply that standardization is hard. There are many potential technical approaches, and reasonable people will reasonably disagree. However, as the Internet sector has shown, it is possible to overcome these technical barriers by adopting well-tested standardization practices. I happen to favor the IETF (Internet Engineering Task

8 8 Force) model, in which proposals for standardization are accompanied by reference implementations that can then be tested head-to-head, but other types of standardization processes work as well. What the NSF Can Do Service standardization won't occur unless there is a strong incentive to do so, and to date the funding practices of NSF and other agencies have discouraged evolution in this direction. By focusing efforts on species-specific databases, and by insisting that these projects become self-sufficient (i.e. profitable) after the initial development is finished, funding agencies encourage data providers to build proprietary, non-portable systems. The next time a database is needed, groups need to start again from scratch. This is a wasteful and inefficient practice. A more parsimonious approach would be to take the long view and provide funding directly for the development of biological data service infrastructure. The deliverables for such projects would be portable, general purpose, software and standards that are made freely available to the academic community as well as industry. The projects should not be tied to a particular species, and should not piggybacked on top of a database delivery project; when a data release deadline looms, the need to push the data out the door always wins out over the portability of the underlying software. The flip side of funding for infrastructure development is funding for infrastructure operations. I feel strongly that biological data services are no different from the stock centers in their need for stable, reliable, long-term funding. In many fields, the data services have become indispensable to the practice of biological research. Let s figure out a way to ensure the ongoing growth and availability of this infrastructure. Finally, it is important for the funding agencies to coordinate their efforts. The NHGRI and NIHGMS have recently announced their intention to jointly fund the development of a "model organism database toolkit." The USDA ARS has called together a working group to create a common set of services for agricultural sequencing projects. The DOE has organized workshops to develop standard XML formats for exchanging biological data. Working together with a consistent vision of the goal, the funding agencies can transform the bioinformatics landscape from a small number of insular database projects, to a large number of open, interoperable data services, together forming the fabric of a new biological data infrastructure.

Software review. Biomolecular Interaction Network Database

Software review. Biomolecular Interaction Network Database Biomolecular Interaction Network Database Keywords: protein interactions, visualisation, biology data integration, web access Abstract This software review looks at the utility of the Biomolecular Interaction