Abstract. of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing

Size: px
Start display at page:

Download "Abstract. of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing"

Transcription

1 Paper ID# SACBIO-129 HAVING A BLAST: ANALYZING GENE SEQUENCE DATA WITH BLASTQUEST WHERE DO WE GO FROM HERE? Abstract In this paper, we pursue two main goals. First, we describe a new tool called BlastQuest, for managing BLAST query results. BlastQuest provides interactive, Web-enabled query, analysis, and visualization facilities beyond what is possible by current BLAST interfaces. Specifically, the BLAST results, which are in XML format, are extracted, structured, and stored persistently in a relational database to support a series of built-in analysis operations that can be used to select, filter, and order data from multiple BLAST results efficiently and without referring to the original result files. In addition, users have the option to interact with the BLAST data through a maskoriented, non-sql query interface. Despite BlastQuest s recognized benefits for biologists, its functionality is limited in several important ways. The second goal of this paper is to analyze these shortcomings and describe a new concept based on two main pillars. (1) A Genomics Algebra, which provides an extensible set of high-level genomic data types (GDTs) together with a comprehensive collection of appropriate genomic functions, and (2) a Unifying Database, which allows us to integrate and manage the semi-structured contents of publicly available genomic repositories and to transfer these data into GDT values. 1. Introduction Biologists are nowadays confronted with two main problems, namely the exponentially growing volume of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing complexity of biological applications and methods afflicted with an inherent lack of biological knowledge. As a result, many and very important challenges in biology and genomics are now challenges in computing and here especially in advanced information management and algorithmic design. The currently most widely used and accepted tool for conducting similarity searches on gene sequences is BLAST (Basic Local Alignment Search Tool) [1]. BLAST comprises a set of similarity search programs that employ heuristic algorithms and techniques to detect relationships between gene sequences and rank the computed hits statistically. An essential problem for the biologist is currently the processing and evaluation of BLAST query results, since a BLAST search yields its result exclusively in a textual format (e.g., ASCII, HTML, XML). This format has the benefit of being application-neutral but at the same time impedes its direct analysis. In this paper, we describe a new powerful tool, called BlastQuest, for managing BLAST results stemming from multiple individual queries. This tool provides the biologist with interactive and Web-enabled query, analysis, and visualization facilities beyond what is possible by current BLAST interfaces. In particular, BLAST results from multiple queries are imported,

2 structured, and stored in a relational database to support a series of built-in analysis operations that can be used to select, filter, group, and order these data efficiently and without referring to the original BLAST result files. In addition, users have the option to interact with the data through a user-tailored, screenmask oriented, non-sql query interface based at a deeper, hidden level on a well-defined subset of SQL. Section 2 elaborates on the current, main challenges in genomics and emphasizes the need for tools capable of processing BLAST results. In Section 3, we describe our BlastQuest system from the system architecture and user interface perspectives. Section 4 describes desired improvements to BlastQuest and why new, sophisticated concepts, tools, and non-standard database technology, which altogether should lead us far beyond BLAST technology, are indispensable in order to advance biological and genomic research and progress. Finally, Section 5 draws some conclusions. 2. The Challenge of Genomics and Its Effect on Computer Science Genomics is a biological discipline focused on understanding living organisms at the level of the whole genome. It goes beyond a gene-by-gene approach and instead takes a global view of the complete genetic system. Genomic scientists examine the full catalog of genes, the process that control them, gene interrelationships and inter-dependencies, and how the organism responds to changes in environment through the expression of genetic information. In order to illustrate the challenges faced by scientists in this field, we first review the most important concepts underlying gene sequencing Gene Sequencing DNA is an information storage macromolecule to encode all of the heritable information passed from generation to generation of living organisms. In biological systems, genetic information flows from DNA (genes) to proteins, which are the molecules responsible for mediating or catalyzing biological processes. In other words, inherited information is selectively converted into active biomolecules in response to changing environmental conditions or demands. The molecular information pathway from gene to protein goes through an intermediate class of molecules known as messenger RNA (mrna). The synthesis of mrna is known as transcription, and the conversion of mrna into protein is a process known as translation. Both transcription and translation are important regulatory steps used to control which genetic information is expressed, and when and where protein molecules will be made by the cell. The 2

3 constellation of mrna molecules in a cell at any moment represents the expressed genome. The expressed genome is also referred to as the transcriptome. Identifying all the genes present in the transcriptome effectively infers the proteins being utilized by the cell (also known as the proteome) and essentially defines the current biochemical process of the cell. While characterizing the global cellular proteome would be most direct and informative, this is not possible using currently available technology. Instead genomics scientists use high throughput DNA sequencing to characterize the genome and the transcriptome. Genome sequencing involves determining the nucleotide sequence of extensive chromosomal regions or in some cases a complete nucleotide sequence of the whole genome. Characterization of the transcriptome on the other hand involves full or partial sequence characterization of mrna molecules. Partial sequences of mrna molecules are known as Expressed Sequence Tags (EST) sequences. While the process of DNA sequencing is routine, nucleotide sequences do not directly reveal their biological meaning or function. The possible biological function of a gene sequence must be determined either through direct empirical experimentation, or more often through inferencing of gene function using nucleotide sequence homology searches of gene databases such as GenBank [5] Gene Homology Searches Gene homology searches most often use the BLAST algorithm [1]. The BLAST search engine takes a query nucleotide sequence and searches it against the database for entries matching the query. The BLAST algorithm calculates statistical scores (bit scores and e-values) making real sequence homology matches easier to distinguish from matches that might happen by chance. Other information included in the BLAST result includes a short text string summarizing the biological properties of the database match, and several unique identification numbers, the GI Number (unique ID for Genbank records) and Accession Number, linking the matched sequence back to the GenBank database and to additional information stored in the full database record. Each nucleotide query sequence submitted to the BLAST search engine returns as few as zero (no matching homologous sequence) to hundreds of matching database records. Results of BLAST searches are usually interpreted by reviewing the text output. However, large-scale genomics projects often generate tens of thousands of nucleotide sequences and the prospect of manually manipulating, summarizing, and interpreting the thousands of BLAST output files is impractical at best. Scientists facing this informatics challenge may become discouraged or 3

4 might overlook important information because they simply cannot find it. Clearly, methods or tools are needed to help manage the process of identifying and evaluating unknown nucleotide sequences and the sometimes-overwhelming information obtained in large-scale nucleotide sequence homology searches BlastQuest as an Answer to Tool Requirements from a Biologists Perspective Genomics requires an information technology infrastructure on a scale previously unheard of and specifically adapted to the unique data collection and analysis demands of biomedical science. The BlastQuest system we describe demonstrates our current approach to management and visualization of genomics information. It is by no means a complete biological data management solution, but our first attempt to develop a prototype tool that can help us manage BLAST results through well-established relational database principles. We are using BlastQuest to test new functionalities and evaluate the strengths and limitations of relational databases as support tools for genomics research. Most important, we believe BlastQuest will lead us to a new integrating data model, language, and tool for processing and querying genomic information enabling scientists to synthesize biological insights through transparent access to genomics information. We have more to say about these planned improvements in Section 4. The BlastQuest Project began with several modest goals: A BLAST results viewing tool accessible to research groups at remote locations. Users should have access to their BLAST results from anywhere on the Web including the ability to share results with colleagues in other locations. Selective browsing of BLAST homology search results. As a first step, biologists want a broad overview of the possible biological functions of the many genes sequences represented in their DNA sequence data. The ability to reduce and summarize BLAST data to only the most significant results is initially very informative. Search capability on a variety of criteria, such as text terms on biological properties or gene functions. As biological scientists identify their most interesting gene sequences they need a way to focus and retrieve only those search results related to the precise topic of interest. Selective data filtering on various BLAST statistical criteria such as e-value or bit score. These statistical parameters help discriminate between real sequence homology matches and matches that might happen by chance. There are no hard limits to the significance of these 4

5 statistical parameters. The user will choose parameters giving either a more relaxed or restricted view as needed. Selective data grouping on criteria such as GI number, or a defined number of top-scoring results. For example, viewing the three statistically best-scoring results for each query sequence is a convenient way to summarize and browse BLAST results for many query sequences. Grouping query sequences by GI number collects all of the query sequences having sequence homology matches with the same sequences from the database. Two or more query sequences sharing the same database homology match imply the query sequences are related to each other and suggest additional analysis of the relationship is warranted. Privacy constrained sharing of results among the scientists. DNA sequence data is often proprietary and may constitute intellectual property. Such data should not be made public until properly protected. A convenient interface for getting queries into and BLAST results out of the system. The interface must be attractive and logically implemented so users will be able to find and use the tools the system provides. We are unaware of an existing BLAST results management system incorporating all the goals stated above. To the best of our knowledge, the functionalities of WebBLAST 2.0 [3] and the Ontario Center for Genomic Computing OCGC BLAST [2] match many of our requirements but fall short in several important aspects. For example, there is no provision in WebBLAST for applying global filtering and grouping operations, or a mechanism for searching all BLAST results on user-supplied text terms. The OCGC BLAST results manager appears closest to BlastQuest in functionality, allowing selected viewing and data filtering on up to five criteria. However, OCGO BLAST is not available to genomics scientists outside of the Province of Ontario, Canada. The BlastQuest Project is designed to meet our immediate specific requirements, but most important, provide a platform we might freely modify to test our notions of Genomics Algebra, an advanced query language for biological information. 3. The BlastQuest System BlastQuest simplifies large-scale analysis in gene sequencing projects by providing scientists with a means to filter, summarize, sort, group, and search BLAST data. BlastQuest extracts gene data from 5

6 XML files, which are returned as the result of homology searches from BLAST engines, and stores them in an underlying relational database. This allows the user to benefit from well-known relational concepts like transactions, controlled sharing, and querying optimization. The most frequently used user operations are hard-wired in the user interface and accessible via command buttons. Their execution rests on SQL that is hidden from the user. To enable data analysis that is not directly supported by the built-in user interface operations, BlastQuest offers a more flexible, maskoriented, and especially non-sql query interface since biologists object to SQL due to its complexity and low-level abstraction (see Section 4). This interface essentially allows the user to construct complex boolean expressions as selection conditions which include logical operators and substring search predicates. The underlying query execution is based on parameterized SQL queries, which are instantiated and automatically translated into executable SQL code by the DBMS. Another interesting feature of BlastQuest is that it can be linked to the so-called SMART (Simple Modular Architecture Research Tool [6]) (see Section 3.1). The integration of BlastQuest output into SMART for querying is in direct response to the desire by scientists for new tools and interfaces capable of accessing and integrating external resources into one system. In Section 4, we describe our plans to develop a Genomics Algebra query software that operates on a unifying database whose contents can include data from existing genomics repositories. Finally, BlastQuest enables to manage BLAST data on a per-project or per-user basis using the security features of the underlying database while at the same time allow controlled sharing of this data in order to support collaboration Architectural Overview Figure 1 depicts a conceptual overview of the 3-tiered BlastQuest system architecture. Tier 1 contains the database backend, which is implemented using an instance of the MySQL 1 RDBMS. Since BlastQuest is mainly a proof-of-concept prototype rather than a production-strength system, our choice for a DBMS was governed by availability of source code and platform compatibility rather than performance and richness in features. The database backend stores and manages BLAST and PHRAP (Phragment Assembly Program) [4] results, which are represented as XML and ACE 2 (ArChivE) 1 See 2 See for an example and documentation on the format. 6

7 documents and whose structure has been mapped into the relations Hit, NoHit, and Assemble shown in Figure 2. Web Browser Client Side GUI Tier 3 Web Server Client Interface Module BLAST XML document XML Loader SQL Constructor ACE Loader Assembly ACE file Tier 2 JDBC Tier 1 MySQL DBMS Figure 1: Conceptual overview of the BlastQuest system architecture. For each gene sequence that produced a match during the BLAST search, the relation Hit stores the XML file name where the original query sequence can be found as well as detailed hit information, such as hit definition, expect value, bit score and so forth. The relation NoHit stores information about those sequences, which have no database match by the homology search criteria. From a biological point of view, sequences with no homologous sequence match often lead to new genes and are analyzed in a different manner (outside of BlastQuest). In addition, the database also stores information about how related gene segments are assembled into single consensus DNA sequences by PHRAP, which is external to BlastQuest and invoked before the DNA sequence results are submitted to BLAST. PHRAP outputs its results in an ACE file, which is mapped into the relation called Assemble. Querying the Assemble relation with a specific consensus sequence name, one can retrieve all segments that are clustered into the query consensus sequence. 7

8 User (UID,First,Last,Password, ) Project (PID,ProjectName,Path, ) UserProj (UID,PID) Hit NoHit (Hit_ID,PID,File_Name,Hit_Def, Evalue,Bit_Score,HSP_query_seq) (Hit_ID,PID,FileName, ) Assemble (AID,PID,ContigName,ReadFile, ) Figure 2: Relational Schema of the BlastQuest database. The database also maintains information about users and their corresponding gene sequencing projects, which are stored in the three remaining relations, User, Project, and UserProj. The relation UserProj represents the many-many relationship between scientists and the projects to which they belong. Since all sequence data is organized by project (using the PID foreign key in each of the relations Hit, NoHit, and Assemble), BlastQuest provides control over who has access to which data. Tier 2 contains the multi-threaded BlastQuest application program, which is divided into four modules: The client interface module, which handles communication with the Web clients in tier 1, the two loader modules for extracting and loading data from the XML and ACE input files into the database, and the SQL constructor for assembling the queries and record sets to be sent to the database. The client interface module is implemented as a series of Java Server pages (JSPs) that execute inside a Tomcat server. The remaining three modules are implemented as Java classes. The XML loader parses each BLAST result file into a Document Object Model (DOM) representation using the Xerces Java Parser The XML loader then extracts the relevant data items needed to populate the Hit and NoHit tables. Specifically, the loader module contains two classes whose structures correspond to the Hit and NoHit tables in the database schema. When the loader collects data from an XML file, it populates the appropriate class objects with the extracted values. At the end, the objects are passed to the SQL manager, which creates the SQL commands to insert the values into the relational database. The ACE loader works in a similar fashion. However, since there was no standard ACE parser available, we created our own. Our event-based parser detects the presence of 8

9 certain keywords in the ACE input file and extracts the information associated with that keyword. It is important to note that other, more efficient loading options are possible, for example by using the bulk loading utilities of the DBMS. However, by making our loader modules part of the Web-based middleware, users can load BLAST results into their BlastQuest accounts from anywhere on the Web as long as they have access to a Web browser. The SQL manager module is the gateway between the database (via the JDBC driver) and the middleware. In addition to creating the SQL load commands, it translates commands from the user interface into SQL queries, which can be executed by the DBMS. Analogously, it processes the resulting record sets and creates the Java objects that are used by the client interface to generate the Web pages. Tier-3 is a (thin) client interface, which is implemented as dynamic Web pages displayed inside a Web browser. Client-side processing is limited to validation of user input, submitting requests to the BlastQuest application and displaying HTML results. 3.2 Sample BlastQuest Session A sample data analysis session shall illustrate some main features of BlastQuest. A page (not shown) of the Web-browser component in BlastQuest facilitates the extraction of gene data from original, external BLAST files into a MySQL database. Due to the large volume of data, a simple page-by-page viewing is not helpful to the user but selection mechanisms are needed to find the data of interest. The overall user interface strategy is to apply a sequence of consecutive operations on the data to approach gradually to the data of interest. In the following we describe the main user interface features for doing this. The first feature is to let BlastQuest create a summary page for selected sequence segments. For each query DNA sequence, only the sequence database match with the best statistical score calculated by BLAST is displayed with a summary of important biological information, usually text terms describing a gene or protein name, and sometimes including possible biological functions. The summary page also contains, for each matching sequence, the GenBank sequence ID, gene definition, and expect value. The second feature is user-controlled selection. Unfortunately, the statistically calculated ranking of matching sequences provided by BLAST does not necessarily correspond to the biological knowledge and experience of the user. The user may apply their biological knowledge or insight to tag a different result as better for expressing the possible function of the query sequence. By manually selecting a 9

10 specific query result, the user can get additional information such as the percentage of identity, or alignment of the query sequence and the matching sequence. Even a detailed display of sequence alignments is available, which is identical to the free-text formatted BLAST result to which most BLAST users are accustomed. The third feature is related to built-in selection facilities, which can be activated by a mouse-click and operate on every query sequences and their query results. Examples are the displays of hits with expect values less than a particular threshold by selecting from a pull-down menu (Figure 3), or restricting the display to the best n database matches for each query sequence. All filtering facilities together give researchers the ability to adjust their analysis process to the particular research focus, project status, and prior knowledge of query sequences, to reduce the original BLAST result to a manageable size, and especially to remove results of low quality. Figure 3: User-defined query construction tool. The fourth feature comprises ordering and grouping functions. These help the user to discover relationships among genes or expression patterns. For example, if the user asks for grouping on GI number or query sequence, related sequences and their BLAST results are grouped together rather than 10

11 appear randomly or out of context. This is also a proven method to identify EST sequences that come from different regions of the same mrna, gene orthologs, or gene paralogs 3. The fifth feature enables user-defined, mask-oriented, non-sql queries. This feature refers to the problem that the built-in functionality of BlastQuest is sometimes insufficient for specific analysis tasks. BlastQuest provides a special Web page which allows the user to click on particular buttons, to manually insert text, and in this way to interactively and textually construct complex boolean filter expressions which may include logical operators like AND and OR as well as substring search predicates like Contains or Not Contains (Figure 3). A search field (like Hit Definition in our example) to which the Boolean expression is compared can be selected by a drop-down menu. Figure 3 shows two textual representations of the same Boolean expression under construction. The second representation expresses the condition in a way nearer to natural language. The first representation is a test mode translating the natural language condition into SQL. In a later version the SQL test mode will disappear. The construction of the Boolean expression and hence of the query is completed by clicking the Commit button. BlastQuest assembles the SQL query, sends it to the MySQL driver, receives the results and displays them. In the example in Figure 3 the user is just specifying a query which focuses on matches that contain the word reverse, but not hypothetical. The sixth and last main feature to be mentioned is interoperability between BlastQuest and other biological information systems. Creating links to other systems in order to make use of their specific functionality becomes more and more important for the biologist. In the context of BlastQuest, after having examined the query sequences and their probable identities, we wish to derive the protein sequences encoded by the nucleotide sequence. Rather than translate the nucleotide sequence directly, BlastQuest takes the best match, which represents a homologous gene closely related to the unknown query sequence, and retrieves the corresponding protein sequence as translated by BLAST. After grouping search results by query sequence (e.g., the best five statistical matches) the user is presented with the screen shown in the top half of Figure 4. Next, the user checks the amino conversion box at the right top of the screen, and the check box adjacent to the query sequence they wish to translate into an amino acid sequence. When the user clicks the Details button, the Sequence Analysis screen shown in 3 Gene orthologs are genes that are derived by divergent evolution, such as the α-hemoglobin gene from human and from mouse. Gene paralogs are genes that are duplications, such as α-hemoglobin and β-hemoglobin. 11

12 the bottom half of Figure 4 appears. The user may submit the derived protein sequence to the SMART protein analysis Web site by simply clicking on the amino acid sequence. Results of the SMART analysis will appear in the browser window. Figure 4: Filtering and grouping BLAST results on a project basis. All described operations can be combined to analyze data generated in a project. For example, the user may ask BlastQuest to retrieve hits with expect value lower than 0.05, followed by grouping on gene ID, and only display the top five matching hits per GI number. The screen snapshot in Figure 4 shows this result. 4 Evaluation and Planned Improvements The BlastQuest system described above has been used successfully by scientists in a gene-sequencing lab at a University for over six months and the feedback from users has been positive. However, we also received important feedback regarding the limitations of the current system. For example, there is a desire for additional, more sophisticated analysis functionality, the ability to integrate data from external repositories, etc. As a starting point for the development of a more sophisticated management system for 12

13 genomics data, we have identified all of the biological needs that are currently not supported in BlastQuest 4. In the interest of space, we provide the readers with an overview of the most important ones: 1. The ability to query, search and analyze data from external genomics repositories (in addition to those accessible through BLAST). An extension of this is the ability to integrate related results from multiple repositories in a meaningful manner, for example, to fill in missing values or correct inconsistencies that exist across different repositories. 2. A representation of the genomics data that is semantically richer than the current textual representation provided by BLAST and most other repositories. For example, BLAST query results are more or less collections of textual strings and numerical values and are not expressed in biological terms such as genes, proteins, and nucleotide sequences. As a result, BLAST and BlastQuest operations are limited to basic string manipulation (e.g., shortest common substring) rather than high-level, gene-specific operations such as transcribe, translate, etc. 3. Integration of new specialty evaluation functions. The possibility to evaluate data from BLAST results as well as self-generated data with publicly available methods is insufficient. Thus, it must be possible to create, use, and integrate user-defined functions that are capable of operating on both kinds of data into the analysis interface of the tool. However, this requires an extensible database management system, query language, and user interface, which is currently not part of BlastQuest. 4. The ability to create and store new knowledge. A biologist generates new biological data from their own research or experimental work, for example, by analyzing BLAST results. Hence, scientists have expressed a strong desire to store and manage this newly created knowledge together with the source data. For example, there is a need to annotate data in BLAST results and to store the annotations persistently so that they can be re-used (e.g., by linking a record in a new BLAST result to an existing annotation in the repository). 5. Support for controlled collaboration among multiple scientists. It is of great value for scientists to share some of the their findings in a controlled manner with colleagues. For example, it 4 In fact, based on a survey of the related literature, we have found that most of the existing integration and management systems for genomics data such as K2/KLEISLI ( Tambis ( SRS ( etc. only support some of the functionality described in this list. 13

14 should be possible among the users of a genomics repository system to grant write access to some of the annotations but read-only or no access to others. 6. The ability to connect DNA sequence identities inferred from BLAST results with geneassociated biological functions described through the efforts of the Gene Ontology (GO) Consortium [7]. This type of cross-referencing is the best way to describe the functionality of a newly, discovered gene. This functionality will help biologists to annotate and catalog the genes by universally accepted GO IDs and hence help them to discover new genes. Based on this list, which illustrates the complexity of the information-related challenges that confront biologists and computer scientists, we decided to redesign our current system from the ground up. For example, to provide users with a semantically rich representation of the genomics data as well as support for specialty functions (needs 2 and 3 above), requires the design of a new data type system and operations, which must be integrated with the underlying database management system for efficient query processing and persistence. Another example, access to multiple genomics repositories (need 1) requires the ability to extract, translate, and reconcile heterogeneous data from multiple sources and store the integrated result using a global schema, which has been constructed either from the local schemas of the sources or based on general knowledge of the domain. In response to our requirements analysis, we are developing a new genomics integration and management system that is based on two fundamental pillars: (1) A Genomics Algebra software system to provide an extensible set of high-level genomic data types (GDTs) (e.g., genome, gene, chromosome, protein, nucleotide) together with a comprehensive collection of appropriate genomic functions (e.g., translate, transcribe, decode). (2) A Unifying Database, which allows us to manage the semi-structured or, ideally, structured contents of publicly available genomic repositories and to transfer these data into GDT values. These values then serve as arguments of Genomics Algebra operations, which can be embedded into a DBMS query language. We believe our new approach will cause a fundamental change in the way biologists analyze genomic data. No longer will biologists be forced to interact with hundreds of independent data repositories each with their own interface. Instead, biologists will work with a unified database through a single user interface specifically designed for biologists. Our high-level Genomics Algebra will allow biologists to pose questions using biological terms, not SQL statements. Managing user data will also 14

15 become much simpler for biologists, since his/her data can also be stored in the Unifying Database and no longer will s/he have to prepare a custom database for each data collection. Biologists should, and indeed want to invest their time being biologists, not computer scientists. From a computer science perspective, our project leverages and extends the benefits and possibilities of current database technology. In particular, we demonstrate the elegance and expressive power of modeling and integrating non-standard and extremely complex data by the concept of abstract data types into databases and query languages. In addition, our approach is independent of a specific underlying DBMS data model. That is, the Genomics Algebra can be embedded in a relational, objectrelational, or object-oriented DBMS as long as it is equipped with the appropriate extensibility mechanisms. In addition, we believe we will gain valuable knowledge about the design and implementation of new, sophisticated data structures and efficient algorithms in the non-standard application field of biology and bioinformatics. 5 Conclusion In this paper we have described BlastQuest, a Web-based and interactive tool for importing and persistently storing genomic data from multiple BLAST queries in a relational database, applying DBMS functionality for processing and querying these data, and visualizing them appropriately. Limitations of the underlying concept, which will inevitably be reached even through some meaningful improvements, require new concepts and advanced tools. The Genomic Algebra briefly sketched at the end is a promising approach in this direction. References [1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, Basic local alignment search tool, Journal of Molecular Biology, 215:2, pp , [2] J. Cuticchia, S. Parameswaran, R. Alexandrova, and E. Crowdy, OCGC Blast, Web site, [3] E. S. Ferlanti, J. F. Ryan, I. Makalowska, and A. D. Baxevanis, WebBLAST 2.0: an integrated solution for organizing and analyzing sequence data, Bioinformatics, 15:5, pp , [4] P. Green, PHRAP- sequence-assembly program, Web Site, [5] National Center for Biotechnology Information (NCBI), GenBank, Web Site, [6] J. Schultz, R. R. Copley, T. Doerks, C. P. Ponting, and P. Bork, SMART: A Web-based tool for the study of genetically mobile domains, Nucleic Acids Research, 28:1, pp , [7] The Gene Ontology Consortium, Gene ontology: tool for the unification of biology, Nature Genetics, 25:1, pp ,

The GenAlg Project: Developing a New Integrating Data Model, Language, and Tool for Managing and Querying Genomic Information

The GenAlg Project: Developing a New Integrating Data Model, Language, and Tool for Managing and Querying Genomic Information The GenAlg Project: Developing a New Integrating Data Model, Language, and Tool for Managing and Querying Genomic Information Joachim Hammer and Markus Schneider Department of Computer and Information

More information

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi

More information

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST A Simple Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at http://www.ncbi.nih.gov/blast/

More information

CS313 Exercise 4 Cover Page Fall 2017

CS313 Exercise 4 Cover Page Fall 2017 CS313 Exercise 4 Cover Page Fall 2017 Due by the start of class on Thursday, October 12, 2017. Name(s): In the TIME column, please estimate the time you spent on the parts of this exercise. Please try

More information

INTRODUCTION TO BIOINFORMATICS

INTRODUCTION TO BIOINFORMATICS Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain

More information

Genome Browsers Guide

Genome Browsers Guide Genome Browsers Guide Take a Class This guide supports the Galter Library class called Genome Browsers. See our Classes schedule for the next available offering. If this class is not on our upcoming schedule,

More information

INTRODUCTION TO BIOINFORMATICS

INTRODUCTION TO BIOINFORMATICS Molecular Biology-2019 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

Bioinformatics Data Distribution and Integration via Web Services and XML

Bioinformatics Data Distribution and Integration via Web Services and XML Letter Bioinformatics Data Distribution and Integration via Web Services and XML Xiao Li and Yizheng Zhang* College of Life Science, Sichuan University/Sichuan Key Laboratory of Molecular Biology and Biotechnology,

More information

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

2) NCBI BLAST tutorial   This is a users guide written by the education department at NCBI. Web resources -- Tour. page 1 of 8 This is a guided tour. Any homework is separate. In fact, this exercise is used for multiple classes and is publicly available to everyone. The entire tour will take

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

Topics of the talk. Biodatabases. Data types. Some sequence terminology... Topics of the talk Biodatabases Jarno Tuimala / Eija Korpelainen CSC What data are stored in biological databases? What constitutes a good database? Nucleic acid sequence databases Amino acid sequence

More information

Computational Detection of CPE Elements Within DNA Sequences

Computational Detection of CPE Elements Within DNA Sequences Computational Detection of CPE Elements Within DNA Sequences Report dated 19 July 2006 Author: Ashutosh Koparkar Graduate Student, CECS Dept., University of Louisville, KY Advisor: Dr. Eric C. Rouchka

More information

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS

ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS Ivan Vogel Doctoral Degree Programme (1), FIT BUT E-mail: xvogel01@stud.fit.vutbr.cz Supervised by: Jaroslav Zendulka E-mail: zendulka@fit.vutbr.cz

More information

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing,

More information

XML in the bipharmaceutical

XML in the bipharmaceutical XML in the bipharmaceutical sector XML holds out the opportunity to integrate data across both the enterprise and the network of biopharmaceutical alliances - with little technological dislocation and

More information

Tutorial 4 BLAST Searching the CHO Genome

Tutorial 4 BLAST Searching the CHO Genome Tutorial 4 BLAST Searching the CHO Genome Accessing the CHO Genome BLAST Tool The CHO BLAST server can be accessed by clicking on the BLAST button on the home page or by selecting BLAST from the menu bar

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

Basic Local Alignment Search Tool (BLAST)

Basic Local Alignment Search Tool (BLAST) BLAST 26.04.2018 Basic Local Alignment Search Tool (BLAST) BLAST (Altshul-1990) is an heuristic Pairwise Alignment composed by six-steps that search for local similarities. The most used access point to

More information

Integration of the Comprehensive Microbial Resource into the BioSPICE Data Warehouse

Integration of the Comprehensive Microbial Resource into the BioSPICE Data Warehouse Integration of the Comprehensive Microbial Resource into the BioSPICE Data Warehouse Department of Biomedical Informatics Stanford University BioMedIn 231: Computational Molecular Biology Final Project

More information

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading: 24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid

More information

Bioinformatics Hubs on the Web

Bioinformatics Hubs on the Web Bioinformatics Hubs on the Web Take a class The Galter Library teaches a related class called Bioinformatics Hubs on the Web. See our Classes schedule for the next available offering. If this class is

More information

BLAST, Profile, and PSI-BLAST

BLAST, Profile, and PSI-BLAST BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources

More information

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege Sequence Alignment GBIO0002 Archana Bhardwaj University of Liege 1 What is Sequence Alignment? A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity.

More information

An Algebra for Protein Structure Data

An Algebra for Protein Structure Data An Algebra for Protein Structure Data Yanchao Wang, and Rajshekhar Sunderraman Abstract This paper presents an algebraic approach to optimize queries in domain-specific database management system for protein

More information

Data Curation Profile Human Genomics

Data Curation Profile Human Genomics Data Curation Profile Human Genomics Profile Author Profile Author Institution Name Contact J. Carlson N. Brown Purdue University J. Carlson, jrcarlso@purdue.edu Date of Creation October 27, 2009 Date

More information

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1 Database Searching In database search, we typically have a large sequence database

More information

Genome Browsers - The UCSC Genome Browser

Genome Browsers - The UCSC Genome Browser Genome Browsers - The UCSC Genome Browser Background The UCSC Genome Browser is a well-curated site that provides users with a view of gene or sequence information in genomic context for a specific species,

More information

ViTraM: VIsualization of TRAnscriptional Modules

ViTraM: VIsualization of TRAnscriptional Modules ViTraM: VIsualization of TRAnscriptional Modules Version 2.0 October 1st, 2009 KULeuven, Belgium 1 Contents 1 INTRODUCTION AND INSTALLATION... 4 1.1 Introduction...4 1.2 Software structure...5 1.3 Requirements...5

More information

In the sense of the definition above, a system is both a generalization of one gene s function and a recipe for including and excluding components.

In the sense of the definition above, a system is both a generalization of one gene s function and a recipe for including and excluding components. 1 In the sense of the definition above, a system is both a generalization of one gene s function and a recipe for including and excluding components. 2 Starting from a biological motivation to annotate

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

Searching the World-Wide-Web using nucleotide and peptide sequences

Searching the World-Wide-Web using nucleotide and peptide sequences 1 Searching the World-Wide-Web using nucleotide and peptide sequences Natarajan Ganesan 1, Nicholas F. Bennett, Bala Kalyanasundaram, Mahe Velauthapillai, and Richard Squier Department of Computer Science,

More information

ViTraM: VIsualization of TRAnscriptional Modules

ViTraM: VIsualization of TRAnscriptional Modules ViTraM: VIsualization of TRAnscriptional Modules Version 1.0 June 1st, 2009 Hong Sun, Karen Lemmens, Tim Van den Bulcke, Kristof Engelen, Bart De Moor and Kathleen Marchal KULeuven, Belgium 1 Contents

More information

Integrated Access to Biological Data. A use case

Integrated Access to Biological Data. A use case Integrated Access to Biological Data. A use case Marta González Fundación ROBOTIKER, Parque Tecnológico Edif 202 48970 Zamudio, Vizcaya Spain marta@robotiker.es Abstract. This use case reflects the research

More information

TBtools, a Toolkit for Biologists integrating various HTS-data

TBtools, a Toolkit for Biologists integrating various HTS-data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 TBtools, a Toolkit for Biologists integrating various HTS-data handling tools with a user-friendly interface Chengjie Chen 1,2,3*, Rui Xia 1,2,3, Hao Chen 4, Yehua

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E Powering Knowledge Discovery Insights from big data with Linguamatics I2E Gain actionable insights from unstructured data The world now generates an overwhelming amount of data, most of it written in natural

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

Using many concepts related to bioinformatics, an application was created to

Using many concepts related to bioinformatics, an application was created to Patrick Graves Bioinformatics Thursday, April 26, 2007 1 - ABSTRACT Using many concepts related to bioinformatics, an application was created to visually display EST s. Each EST was displayed in the correct

More information

User Manual. Ver. 3.0 March 19, 2012

User Manual. Ver. 3.0 March 19, 2012 User Manual Ver. 3.0 March 19, 2012 Table of Contents 1. Introduction... 2 1.1 Rationale... 2 1.2 Software Work-Flow... 3 1.3 New in GenomeGems 3.0... 4 2. Software Description... 5 2.1 Key Features...

More information

Information Management (IM)

Information Management (IM) 1 2 3 4 5 6 7 8 9 Information Management (IM) Information Management (IM) is primarily concerned with the capture, digitization, representation, organization, transformation, and presentation of information;

More information

Min Wang. April, 2003

Min Wang. April, 2003 Development of a co-regulated gene expression analysis tool (CREAT) By Min Wang April, 2003 Project Documentation Description of CREAT CREAT (coordinated regulatory element analysis tool) are developed

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J. BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J. Buhler Prerequisites: BLAST Exercise: Detecting and Interpreting

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data June 2006 Note: This document is for informational purposes. It is not a commitment to deliver any material, code, or functionality,

More information

Using ESML in a Semantic Web Approach for Improved Earth Science Data Usability

Using ESML in a Semantic Web Approach for Improved Earth Science Data Usability Using in a Semantic Web Approach for Improved Earth Science Data Usability Rahul Ramachandran, Helen Conover, Sunil Movva and Sara Graves Information Technology and Systems Center University of Alabama

More information

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be 48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and

More information

User Guide for DNAFORM Clone Search Engine

User Guide for DNAFORM Clone Search Engine User Guide for DNAFORM Clone Search Engine Document Version: 3.0 Dated from: 1 October 2010 The document is the property of K.K. DNAFORM and may not be disclosed, distributed, or replicated without the

More information

Biostatistics and Bioinformatics Molecular Sequence Databases

Biostatistics and Bioinformatics Molecular Sequence Databases . 1 Description of Module Subject Name Paper Name Module Name/Title 13 03 Dr. Vijaya Khader Dr. MC Varadaraj 2 1. Objectives: In the present module, the students will learn about 1. Encoding linear sequences

More information

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations: Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating

More information

Annotating a single sequence

Annotating a single sequence BioNumerics Tutorial: Annotating a single sequence 1 Aim The annotation application in BioNumerics has been designed for the annotation of coding regions on sequences. In this tutorial you will learn how

More information

Massive Automatic Functional Annotation MAFA

Massive Automatic Functional Annotation MAFA Massive Automatic Functional Annotation MAFA José Nelson Perez-Castillo 1, Cristian Alejandro Rojas-Quintero 2, Nelson Enrique Vera-Parra 3 1 GICOGE Research Group - Director Center for Scientific Research

More information

Sequence Alignment & Search

Sequence Alignment & Search Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

Querying Multiple Bioinformatics Information Sources: Can Semantic Web Research Help?

Querying Multiple Bioinformatics Information Sources: Can Semantic Web Research Help? Querying Multiple Bioinformatics Information Sources: Can Semantic Web Research Help? David Buttler, Matthew Coleman 1, Terence Critchlow 1, Renato Fileto, Wei Han, Ling Liu, Calton Pu, Daniel Rocco, Li

More information

Fundamentals of STEP Implementation

Fundamentals of STEP Implementation Fundamentals of STEP Implementation David Loffredo loffredo@steptools.com STEP Tools, Inc., Rensselaer Technology Park, Troy, New York 12180 A) Introduction The STEP standard documents contain such a large

More information

Heuristic methods for pairwise alignment:

Heuristic methods for pairwise alignment: Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic

More information

Hospital System Lowers IT Costs After Epic Migration Flatirons Digital Innovations, Inc. All rights reserved.

Hospital System Lowers IT Costs After Epic Migration Flatirons Digital Innovations, Inc. All rights reserved. Hospital System Lowers IT Costs After Epic Migration 2018 Flatirons Digital Innovations, Inc. All rights reserved. A large hospital system was migrating to the EPIC software product suite and as part of

More information

Genome Browser. Background and Strategy

Genome Browser. Background and Strategy Genome Browser Background and Strategy Contents What is a genome browser? Purpose of a genome browser Examples Structure Extra Features Contents What is a genome browser? Purpose of a genome browser Examples

More information

Gegenees genome format...7. Gegenees comparisons...8 Creating a fragmented all-all comparison...9 The alignment The analysis...

Gegenees genome format...7. Gegenees comparisons...8 Creating a fragmented all-all comparison...9 The alignment The analysis... User Manual: Gegenees V 1.1.0 What is Gegenees?...1 Version system:...2 What's new...2 Installation:...2 Perspectives...4 The workspace...4 The local database...6 Populate the local database...7 Gegenees

More information

Agent-Enabling Transformation of E-Commerce Portals with Web Services

Agent-Enabling Transformation of E-Commerce Portals with Web Services Agent-Enabling Transformation of E-Commerce Portals with Web Services Dr. David B. Ulmer CTO Sotheby s New York, NY 10021, USA Dr. Lixin Tao Professor Pace University Pleasantville, NY 10570, USA Abstract:

More information

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align

More information

A WEB-BASED TOOLKIT FOR LARGE-SCALE ONTOLOGIES

A WEB-BASED TOOLKIT FOR LARGE-SCALE ONTOLOGIES A WEB-BASED TOOLKIT FOR LARGE-SCALE ONTOLOGIES 1 Yuxin Mao 1 School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, P.R. China E-mail: 1 maoyuxin@zjgsu.edu.cn ABSTRACT

More information

SETTING UP AN HCS DATA ANALYSIS SYSTEM

SETTING UP AN HCS DATA ANALYSIS SYSTEM A WHITE PAPER FROM GENEDATA JANUARY 2010 SETTING UP AN HCS DATA ANALYSIS SYSTEM WHY YOU NEED ONE HOW TO CREATE ONE HOW IT WILL HELP HCS MARKET AND DATA ANALYSIS CHALLENGES High Content Screening (HCS)

More information

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Exercise 2: Browser-Based Annotation and RNA-Seq Data Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler July 24, 2018 This exercise continues your introduction to practical issues in comparative annotation. You ll be annotating genomic sequence

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

International Jmynal of Intellectual Advancements and Research in Engineering Computations

International Jmynal of Intellectual Advancements and Research in Engineering Computations www.ijiarec.com ISSN:2348-2079 DEC-2015 International Jmynal of Intellectual Advancements and Research in Engineering Computations VIRTUALIZATION OF DISTIRIBUTED DATABASES USING XML 1 M.Ramu ABSTRACT Objective

More information

Hyper-BLAST: A Parallelized BLAST on Cluster System

Hyper-BLAST: A Parallelized BLAST on Cluster System Hyper-BLAST: A Parallelized BLAST on Cluster System Hong-Soog Kim, Hae-Jin Kim, and Dong-Soo Han School of Engineering Information and Communications University P.O. Box 77, Yusong, Daejeon 305-600, Korea

More information

) I R L Press Limited, Oxford, England. The protein identification resource (PIR)

) I R L Press Limited, Oxford, England. The protein identification resource (PIR) Volume 14 Number 1 Volume 1986 Nucleic Acids Research 14 Number 1986 Nucleic Acids Research The protein identification resource (PIR) David G.George, Winona C.Barker and Lois T.Hunt National Biomedical

More information

IPA: networks generation algorithm

IPA: networks generation algorithm IPA: networks generation algorithm Dr. Michael Shmoish Bioinformatics Knowledge Unit, Head The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion Israel Institute of Technology

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

TUTORIAL: WHITE PAPER. VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS

TUTORIAL: WHITE PAPER. VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS TUTORIAL: WHITE PAPER VERITAS Indepth for the J2EE Platform PERFORMANCE MANAGEMENT FOR J2EE APPLICATIONS 1 1. Introduction The Critical Mid-Tier... 3 2. Performance Challenges of J2EE Applications... 3

More information

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment

Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment Cheshire 3 Framework White Paper: Implementing Support for Digital Repositories in a Data Grid Environment Paul Watry Univ. of Liverpool, NaCTeM pwatry@liverpool.ac.uk Ray Larson Univ. of California, Berkeley

More information

A web application serving queries on renewable energy sources and energy management topics database, built on JSP technology

A web application serving queries on renewable energy sources and energy management topics database, built on JSP technology International Workshop on Energy Performance and Environmental 1 A web application serving queries on renewable energy sources and energy management topics database, built on JSP technology P.N. Christias

More information

Automatic Generation of Workflow Provenance

Automatic Generation of Workflow Provenance Automatic Generation of Workflow Provenance Roger S. Barga 1 and Luciano A. Digiampietri 2 1 Microsoft Research, One Microsoft Way Redmond, WA 98052, USA 2 Institute of Computing, University of Campinas,

More information

QuickSpecs. Compaq NonStop Transaction Server for Java Solution. Models. Introduction. Creating a state-of-the-art transactional Java environment

QuickSpecs. Compaq NonStop Transaction Server for Java Solution. Models. Introduction. Creating a state-of-the-art transactional Java environment Models Bringing Compaq NonStop Himalaya server reliability and transactional power to enterprise Java environments Compaq enables companies to combine the strengths of Java technology with the reliability

More information

CLC Server. End User USER MANUAL

CLC Server. End User USER MANUAL CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark

More information

1. Introduction to the Common Language Infrastructure

1. Introduction to the Common Language Infrastructure Miller-CHP1.fm Page 1 Wednesday, September 24, 2003 1:50 PM to the Common Language Infrastructure The Common Language Infrastructure (CLI) is an International Standard that is the basis for creating execution

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

BovineMine Documentation

BovineMine Documentation BovineMine Documentation Release 1.0 Deepak Unni, Aditi Tayal, Colin Diesh, Christine Elsik, Darren Hag Oct 06, 2017 Contents 1 Tutorial 3 1.1 Overview.................................................

More information

Easy Ed: An Integration of Technologies for Multimedia Education 1

Easy Ed: An Integration of Technologies for Multimedia Education 1 Easy Ed: An Integration of Technologies for Multimedia Education 1 G. Ahanger and T.D.C. Little Multimedia Communications Laboratory Department of Electrical and Computer Engineering Boston University,

More information

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data

e-ccc-biclustering: Related work on biclustering algorithms for time series gene expression data : Related work on biclustering algorithms for time series gene expression data Sara C. Madeira 1,2,3, Arlindo L. Oliveira 1,2 1 Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal

More information

ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster

ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster Bioinformatics Advance Access published January 29, 2004 ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster Gernot Stocker, Dietmar Rieder, and

More information

TITLE OF COURSE SYLLABUS, SEMESTER, YEAR

TITLE OF COURSE SYLLABUS, SEMESTER, YEAR TITLE OF COURSE SYLLABUS, SEMESTER, YEAR Instructor Contact Information Jennifer Weller Jweller2@uncc.edu Office Hours Time/Location of Course Mon 9-11am MW 8-9:15am, BINF 105 Textbooks Needed: none required,

More information

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame 1 When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from

More information

W3P: A Portable Presentation System for the World-Wide Web

W3P: A Portable Presentation System for the World-Wide Web W3P: A Portable Presentation System for the World-Wide Web Christopher R. Vincent Intelligent Information Infrastructure Project MIT Artificial Intelligence Laboratory cvince@ai.mit.edu http://web.mit.edu/cvince/

More information

Chapter 4 Research Prototype

Chapter 4 Research Prototype Chapter 4 Research Prototype According to the research method described in Chapter 3, a schema and ontology-assisted heterogeneous information integration prototype system is implemented. This system shows

More information

Database Searching Using BLAST

Database Searching Using BLAST Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

Supporting Bioinformatic Experiments with A Service Query Engine

Supporting Bioinformatic Experiments with A Service Query Engine Supporting Bioinformatic Experiments with A Service Query Engine Xuan Zhou Shiping Chen Athman Bouguettaya Kai Xu CSIRO ICT Centre, Australia {xuan.zhou,shiping.chen,athman.bouguettaya,kai.xu}@csiro.au

More information

FusionDB: Conflict Management System for Small-Science Databases

FusionDB: Conflict Management System for Small-Science Databases Project Number: MYE005 FusionDB: Conflict Management System for Small-Science Databases A Major Qualifying Project submitted to the faculty of Worcester Polytechnic Institute in partial fulfillment of

More information

ChIP-Seq Tutorial on Galaxy

ChIP-Seq Tutorial on Galaxy 1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data

More information

Core Engine. R XML Specification. Version 5, February Applicable for Core Engine 1.5. Author: cappatec OG, Salzburg/Austria

Core Engine. R XML Specification. Version 5, February Applicable for Core Engine 1.5. Author: cappatec OG, Salzburg/Austria Core Engine R XML Specification Version 5, February 2016 Applicable for Core Engine 1.5 Author: cappatec OG, Salzburg/Austria Table of Contents Cappatec Core Engine XML Interface... 4 Introduction... 4

More information

A Study of Future Internet Applications based on Semantic Web Technology Configuration Model

A Study of Future Internet Applications based on Semantic Web Technology Configuration Model Indian Journal of Science and Technology, Vol 8(20), DOI:10.17485/ijst/2015/v8i20/79311, August 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 A Study of Future Internet Applications based on

More information

WebSphere 4.0 General Introduction

WebSphere 4.0 General Introduction IBM WebSphere Application Server V4.0 WebSphere 4.0 General Introduction Page 8 of 401 Page 1 of 11 Agenda Market Themes J2EE and Open Standards Evolution of WebSphere Application Server WebSphere 4.0

More information

3DProIN: Protein-Protein Interaction Networks and Structure Visualization

3DProIN: Protein-Protein Interaction Networks and Structure Visualization Columbia International Publishing American Journal of Bioinformatics and Computational Biology doi:10.7726/ajbcb.2014.1003 Research Article 3DProIN: Protein-Protein Interaction Networks and Structure Visualization

More information

Appendix A - Glossary(of OO software term s)

Appendix A - Glossary(of OO software term s) Appendix A - Glossary(of OO software term s) Abstract Class A class that does not supply an implementation for its entire interface, and so consequently, cannot be instantiated. ActiveX Microsoft s component

More information