The Galaxy Track Browser: Transforming the Genome Browser from Visualization Tool to Analysis Tool

Size: px
Start display at page:

Download "The Galaxy Track Browser: Transforming the Genome Browser from Visualization Tool to Analysis Tool"

Transcription

1 The Galaxy Track Browser: Transforming the Genome Browser from Visualization Tool to Analysis Tool Jeremy Goecks * Kanwei Li Ω Dave Clements ℵ The Galaxy Team James Taylor ℇ Emory University Emory University Emory University Emory University ABSTRACT The proliferation of next-generation sequencing (NGS) technologies and analysis tools present new challenges to genome browsers. These challenges include supporting very large datasets, integrating analysis tools with data visualization to help reason about and improve analyses, and sharing or publishing fully interactive visualizations. The Galaxy Track Browser (GTB) is a Web-based genome browser integrated into the Galaxy platform that addresses these challenges. GTB is the first Web-based genome browser to provide a full multi-resolution data model; this model supports efficient data retrieval from very large datasets. GTB leverages the Galaxy platform to combine data visualization and data analysis; users can specify parameter values and run tools to produce new data, all within GTB. GTB also provides interactive filters that dynamically show and hide data and can be used to identify data for further investigation. GTB is available on every Galaxy server, and visualizations can be created for both standard and custom genome builds. Fully interactive GTB visualizations can be shared with colleagues and published on the Web using a simple graphical user interface. KEYWORDS: Galaxy Track Browser, genomics, genome browser, visual analytics. INDEX TERMS: H5.m. Information interfaces and presentation (e.g., HCI): Miscellaneous, J.3 Life and Medical Sciences: biology and genetics. 1 INTRODUCTION Genomics the study of DNA and related molecules, their functions, and their impact on health is a growing biomedical science that is highly reliant on computational tools and methods. Visualizations play an important role in genomics by helping scientists understand the numerical and textual data that many genomics tools produce. Using visualizations, scientists can find problems and debug an analysis, identify results for subsequent investigation, and communicate findings, both to colleagues and in publications. Genome browsers are a powerful visualization tool that enables scientists to map numerical and textual data onto a visual representation of the genome [1, 2]. Genome browsers have a long history and remain an active area of research and development. * jeremy.goecks@emory.edu Ω kanwei@gmail.com ℵ clements@galaxyproject.org outreach@galaxyproject.org ℇ james.taylor@emory.edu IEEE Symposium on Biological Data Visualization October 23-24, Providence, Rhode Island, USA /11/$ IEEE Early genome browsers such as the UCSC genome browser [3], Ensembl [4], and GBrowse [5] are regularly updated and extended. Recently, new genome browsers, such as IGB [6], IGV [7], JBrowse [8], and Savant [9], have been developed and demonstrate new models and techniques for genome browsing. Recent activity in genome browser development is motivated in large part by the adoption of next-generation sequencing (NGS) technologies [10] and analysis tools [11] in genomic experiments. NGS experiments produce very large datasets, use complex analyses, and require significant collaboration to complete. These demands have driven recent genome browser research. Despite much progress, challenges remain that limit the usefulness of genome browsers for NGS data. Current desktop browsers use full multi-resolution data models for viewing and customizing data, but Web-based browsers lack this needed functionality. NGS analyses are often long and complex, but it is difficult to use genome browsers to reason about and improve an analysis because analysis tools and genome browsers are not integrated. Finally, sharing and publishing fully interactive visualizations is difficult in current browsers, limiting the usefulness of visualizations for collaboration. The Galaxy Track Browser (GTB) is a Web-based genome browser integrated into the Galaxy platform [12-14] that addresses these challenges. GTB utilizes a full multi-resolution, clientserver data model to support gigabyte-sized datasets. Each dataset is a track in the browser, and all tracks are fully interactive and customizable. GTB leverages the Galaxy platform to combine data visualization and data analysis. Using GTB, users can set parameter values and run tools to produce and visualize new data. Dynamic filters can be used to interactively explore data and find data that meets particular criteria. GTB can be used to visualize genomic data for both standard and custom genome builds. Fully interactive GTB visualizations can be shared with particular individuals or published on the Web through a simple graphical user interface (GUI). While GTB leverages many of Galaxy s features, it is modular and can be configured to work outside of Galaxy and with different data providers. This paper s contributions include: (1) a discussion of three challenges that NGS tools and data pose to current genome browsers; (2) a description of the Galaxy Track Browser and how it addresses these challenges; (3) a comparison between the Galaxy Track Browser and other recent genome browsers. 2 THREE CURRENT CHALLENGES FOR GENOME BROWSERS Genome browsers are an important tool for helping scientists visualize and understand their data [1, 2]. The proliferation of next-generation sequencing (NGS) technologies [10] and analysis tools [11] present new challenges to genome browsers. In this section, we discuss three challenges that genome browsers must address to support scientists doing NGS experiments and analyses and how current genome browsers address these challenges. 39

2 2.1 Challenge: Supporting Very Large Datasets in a Web-based Browser NGS technologies and analysis pipelines often produce data that is tens or hundreds of gigabytes in size or larger. For instance, consider a gene expression experiment using RNA-seq [15]. Currently, an RNA-seq dataset obtained from a single lane of an Illumina HiSeq 2000 system is approximately ten to twenty gigabytes in size. Following a simple pipeline to map reads, assemble transcripts, and perform differential expression testing amongst different RNA-seq samples will often generate many datasets that are 500 to 1000 megabytes in size. A genome browser for NGS experiments, then, must efficiently scale to render large datasets and enable users to customize data display to meet their needs. Scaling to support visualization of gigabyte-sized datasets requires a multi-resolution data model that provides access to data at multiple levels of resolution. To display large genomic regions with many data elements, aggregation is done ahead of time and aggregated data used for rendering. To display individual features in a small region, indices such as bigwig [16] and tabix [17] are used to quickly find the region s data in a large file. A multi-resolution data model provides efficient access to a dataset but does not influence how a dataset is rendered. Genome browsers that support a complete multiresolution data models include the desktop-based browsers IGV [7] and Savant [9]. Desktop-based genome browsers are convenient when there is no access to the Web or when datasets are stored locally and uploading them is undesirable. Web-based browsers complement desktop-based browsers by providing access to data and visualizations over the Web via a standard Web browser and do not require users to download or install any software. No Webbased browser, however, implements a complete multi-resolution model. Anno-J [18] supports a multi-resolution model but does not implement one, and JBrowse [8] uses a partial multiresolution data model. Hence, there is a need for a Web-based browser that supports a full multi-resolution data model. 2.2 Challenge: Integrating Data Visualization and Data Analysis Tools NGS analysis tools and pipelines are often complex and highly parameter-dependent, and it can be difficult to determine how changes in parameter values will impact the output of a tool or pipeline. Using genome browsers to debug or tune parameter values can be useful because it is often straightforward to visually confirm that a tool s output is acceptable or needs improvement. However, moving between running analyses and visualizing output can be a tedious process because genome browsers act as endpoints for an analysis pipeline: a series of tools are applied to produce outputs, which are then visualized in a browser. Genome browsers would be more useful if they were integrated with analysis tools. For example, a browser might enable a user to change a tool s parameter value and then, through the browser, rerun the tool and observe how the change impacts the tool s output. If a user could repeatedly change tool parameter values and visualize the corresponding output, she could tune a tool s parameter values to produce the desired output. Similarly, others have argued that more on the fly computation is needed in browsers [2]. Integrating analyses with visualization is a foundation of visual analytics; visual analytics is the science of using interactive visualizations to support analytic reasoning [19]. There are examples of limited visual analytics functionality in current browsers. In the UCSC genome browser [3], a user can run a BLAT [20] query and then visualize the results as a track in the current browser. Savant provides a plug-in framework that developers can use to prototype and run analysis tools. Savant s framework is limited to tools specifically written for Savant because there is no method for incorporating existing tools as plugins. Principles from visual analytics have been applied to develop Hawkeye, a visualization for performing genome assemblies that uses dynamic filtering and automated clustering to help users identify problematic areas of an assembly [21]. These examples illustrate the value of integrating analysis tools into genomic visualizations. The challenge, then, is developing a genome browser that integrates with an analysis framework so that users can run and rerun many different tools to produce novel output all while in the browser. 2.3 Challenge: Sharing and Publishing Interactive Custom Visualizations At the leading edge of biomedical research are large, collaborative experiments that employ NGS technologies and analysis tools to explore complex biomedical questions. Custom genome browser visualizations visualizations of experimental data, often coupled with public data are useful for collaboration and communication because they can efficiently convey information. Despite the value of custom visualizations, they are currently quite difficult to share. The UCSC genome browser supports sharing custom tracks via URL, but there are limitations, including the deletion of tracks 48 hours after they were last accessed. IGB [6] and IGV support data sharing, but users must set up their own visualization to see the data. Using JBrowse, a fully interactive visualization can be shared via the Web; however, JBrowse requires that a server be configured to support the visualization, a task that can be difficult for scientists without programming experience. Custom visualizations are often prominently featured in publications, yet little attention is paid to their reproducibility. Reproducing experimental results including visualizations is an essential facet of scientific inquiry, providing the foundation for understanding, integrating, and extending results toward new discoveries. Reproducibility of genomics experiments has been shown to be limited [22]; NGS technologies, due to their complexity, have exacerbated experimental reproducibility [23]. It is also often difficult to reproduce custom visualizations because essential details of the analysis or the visualization are lost. There are, then, significant issues that limit the usefulness of custom visualizations for scientific collaboration and public communication. A framework is needed that facilitates sharing and publication of fully interactive and reproducible custom genome browser visualizations. 3 GALAXY TRACK BROWSER The Galaxy Track Browser (Figure 1) is a Web-based browser integrated into the Galaxy platform that addresses the three challenges discussed in the previous section. Hence, it is designed for NGS data and tools. Galaxy is an open Web-based platform for accessible, reproducible, and transparent genomic research [12-14]. The public Galaxy service ( makes analysis tools, genomic data, tutorial demonstrations, persistent graphical workspaces, and publication services available to any scientist that has access to the Internet. Local Galaxy servers can be set up by downloading the Galaxy application and customizing it to meet particular needs. Galaxy has established a significant community of users and developers. Like Galaxy, the Galaxy Track Browser (GTB) is designed to be usable by all scientists, especially those without programming experience. GTB is written in JavaScript, and all of its functionality is available using only a Web browser. As is done in most genome browsers, GTB s horizontal axis denotes genome coordinates and each dataset is displayed as a track; individual 40

3 Figure 1. Using the Galaxy Track Browser to analyze ENCODE RNA-seq data. From top to bottom: (1) UCSC knowngene gene annotation track; (2) UCSC all mrna annotation track; (3) UCSC vertebrate quantitative conservation track (4) mapped RNA-seq reads from ENCODE cell line h1-hesc; (5) a form for running Cufflinks [24], a tool for assembling mapped reads into transcripts; (6) first attempt at transcript assembly; (7-9) improving the assembly using different parameter values for Cufflinks; (10) filtering assembled transcripts from the GM12878 cell line using transcript attributes. 41

4 data elements are drawn at their genome coordinates. GTB supports four types of NGS datasets: mapped reads (SAM/BAM file format), features/annotations/ intervals (BED, GFF), variant (VCF), and continuously valued data (WIG). GTB is available on every Galaxy server and Galaxy users can create, save, and share any number of visualizations. GTB visualizations can be created for both standard and custom genome builds. 3.1 Adding, Navigating, and Customizing Data Adding a dataset to a GTB visualization can be done either within a visualization, from a user s Galaxy workspace, or from Galaxy data libraries. To support visualization of very large datasets, GTB automatically creates multiple indices for each dataset added to a visualization and manages the connections between datasets and indices so that indices can be reused if a dataset is used in more than one visualization. Automatically managing indices is an important usability feature because many users may have difficulty creating indices. GTB s use of multiple indices enables smooth user interactions and complete customization of data: users can freely and continuously pan, zoom, and navigate, and GTB fetches data as needed from the server. When GTB is showing a large genomic region, the data obtained is aggregated and is shown as a coverage histogram or features spans. As a user zooms in, details for individual data points or features becomes available and are shown (Figure 2). GTB always obtains data from the server not precomputed image tiles so that a user can completely customize each data track s display as needed. For instance, users can view a continuouslyvalued track as a histogram, line graph, filled line graph, or heatmap; users can further adjust the maximum, minimum, and total track height as well. For mapped read tracks and feature/annotation tracks, display modes include histogram, dense, squished, and packed. Figure 2. GTB renders data differently depending on whether a large or small region is being viewed. Mapped RNA-seq reads are visible as a histogram when viewing a large region (top); zooming in to view a smaller region shows the read structure (middle) and, finally, the read labels (bottom). Users can set a desired level of detail as well Integrating Analysis and Visualization GTB provides access to analysis tools in its visualizations, enabling users to run tools on currently visualized tracks to produce new datasets and tracks. GTB enables users to rerun the tool used to create a track using different parameter values for the visible genomic region. Rerunning a tool produces new output, and GTB automatically renders the new output when the tool is finished (Figure 3). Parameter values can be repeatedly changed and a tool can be rerun many times (see Figure 1 for an example). Running a tool and viewing its output can be done interactively (i.e. quickly) because a tool runs only on the subset of data visible to the user. This functionality is useful for seeing how parameter choices impact tool output, visually comparing output for different parameter values, and tuning values to return desired results. Once a user has chosen a set of parameter values, he can run the tool on the entire dataset, and Galaxy puts the tool output in his Galaxy analysis workspace. For instance, transcript assembly via the Cufflinks tool [24] is highly parameter dependent, generating different numbers of 6 5 Figure 3. Analysis tools can be run in GTB: (1) UCSC knowngene annotation track; (2) assembled transcripts from RNA-seq data; (3) intersections between tracks 1 and 2 with parameter for at least (overlap) of 1 base that was performed in Galaxy (not GTB); (4) interface for setting parameters and rerunning Intersect tool; (5) for at least (overlap) parameter set to 4000 and tool is run; (6) tool with new parameter values run on visible data and new track is rendered. transcripts with different characteristics depending on parameter values. Rerunning Cufflinks in GTB using different parameter values can help make clear how parameters influence the assembly process. Also, a user can quickly generate different Cufflinks assemblies using different parameter values and visually compare the assemblies to choose the best one. GTB provides a generic framework for integrating tools that requires little or no additional work from tool developers. Using 42

5 this framework, a tool s GTB configuration is specified in its Galaxy definition file. After a tool s GTB configuration is specified, all tracks generated using the tool will automatically provide the option to rerun the tool. This approach does not require any setup or configuration by GTB users, and it ensures that users only run tools that are compatible with GTB. However, it does limit users to running tools that have been explicitly configured to work with GTB and, for security reasons, prevents users from running arbitrary tools within GTB. GTB s approach to tool integration is detailed in Section 4.2 The following tools use GTB s integration framework and are currently available in GTB: (i) genomic interval operations such as intersecting, clustering, or subtracting interval sets; (ii) Cufflinks; and (iii) the Unified Genotyper, a SNP and indel caller [25]. GTB s tool integration framework makes it possible to use any Galaxy tool in a visualization provided it meets the following criteria: (a) it produces output that GTB can render and (b) it produces correct output when a subset of the input dataset is provided. Some bioinformatics tools require the complete input dataset to produce correct output; Section 6 discusses this issue in detail. GTB provides dynamic filters that show and hide data elements in real time as users adjust them (Figure 4). Each filter specifies a range for a particular attribute value; a user can set a filter s range by using a two-handle slider or by clicking on its text label and typing in a new range. Data elements with an attribute values outside the specified range are hidden. Filters are additive so that multiple filters can be applied simultaneously. GTB creates filters based on a track s type. Read tracks can be filtered by quality scores. Feature/annotation/interval tracks can be filtered by their score column. Tracks in GFF/GTF format can be filtered by score and by attributes that have numerical values. Filters are useful for visually identifying data elements that meet certain criteria and for understanding the distribution of attribute values in a dataset. At any time, a user can create a Galaxy dataset of the filtered data that is visible or create a dataset by applying the filter to the whole track. Newly created datasets are available in the user s Galaxy workspace for use or download. Figure 4 demonstrates how a set of transcripts might be meaningfully filtered. First, transcripts are filtered by score, which is a measure of relative expression amongst a set of isoforms; next, transcripts are filtered by FPKM, a measure of overall transcript expression. The remaining transcripts are dominant, highly expressed isoforms. 3.3 Sharing and Publishing Visualizations GTB visualizations are Galaxy objects and, like all Galaxy objects, can be shared or published to the Web using Galaxy s sharing and publication features [14]. Users share GTB visualizations via a GUI; no programming or server configuration is needed to share a GTB visualization. GTB visualizations can be shared in multiple ways. Visualizations can be shared with individuals or can be made accessible on the Web via a URL. A visualization can also be published in Galaxy s public repository, where it is browsable and 1 2 Figure 4. GTB dynamic filters applied to a feature/annotation track: (1) filtered for higher scoring features and (2) filtered for featues with higher FPKM (an abundance measure). searchable. GTB visualizations can also be embedded in Galaxy Pages. Pages are custom Web-based documents that enable users to communicate an entire computational experiment using standard document elements such as text, tables, and figures as well as interactive embedded datasets, workflows, and visualizations. Pages are ideal for an online publication supplement. A shared GTB visualization can be viewed using only a Web browser and is fully functional. A colleague or guest viewing a shared visualization can scroll, zoom, run tools, and dynamically filter data. No configuration is necessary, nor does any data or software need to be downloaded. 4 IMPLEMENTATION The Galaxy Track Browser uses an asynchronous HTTP clientserver model where the server is a Galaxy instance. The GTB client communicates with the server using seven distinct actions: (a) get chromosomes lengths; (b) get available tracks; (c) get track definition; (d) get data; (e) get reference genome data; (f) run a tool on a track subset; and (g) save. Each action corresponds to an asynchronous HTTP request, and all data exchanged between client and server is encoded in JSON format 1. Both the GTB and the Galaxy platform are open source under the Academic Free License [26]. 4.1 Client The GTB client is written entirely in object-oriented JavaScript. The client leverages JavaScript s ecosystem of libraries, APIs, and tools. The GTB client uses several jquery 2 libraries and adheres to CommonJS 3 encapsulation and modularity principles so that it can repurposed by other JavaScript applications. The GTB client s most frequent action is drawing tracks. Hence, the client is optimized to draw tracks as quickly as possible while ensuring that each track is completely

6 customizable. To meet these goals, the client fetches and caches track data from the server and draws tracks itself. As discussed below, drawing track data is very fast. By starting with the track data, the client can draw a track using any configuration specified by the user. Redrawing a track due to a configuration change is also fast because cached data can very often be reused and data need not be fetched from the server. The client renders genomic tracks as a set of adjacent tiles. This is advantageous because as the user zooms, pans, and scrolls, only new tiles need to be drawn; the client caches and reuses existing tiles when possible. Track tiles are drawn in the background and in parallel. Each tile drawn uses its own request to obtain the data to be drawn on the tile. Drawing tiles in the background ensures that the GTB client is responsive to user interaction while drawing. Drawing tiles in parallel ensures that delays encountered when drawing a tile, such as network latency or drawing a large number of elements, does not impact the drawing of other tiles or tracks. However, additional code is needed to determine when all of a track s tiles have been drawn so that post-draw action can be taken. Post-draw actions are used to animate the showing and hiding of data elements when a user is filtering data and to quickly fetch data when running tools. The GTB client renders each tile as an HTML <canvas> element. Canvas elements provide the ability to dynamically and precisely draw using a 2D API; the canvas element is supported by recent versions of all major Web browsers. The main advantages of using the canvas element are the speed, precision, and scale at which data elements can be drawn as compared to using HTML elements to represent data elements. Using JavaScript, the GTB client can render up to 5000 elements per tile while supporting smooth navigation throughout the visualization. The GTB client is modular and can be used with any server that implements the seven actions required by the client (listed previously). Tracks are added to a client by specifying a track definition that includes its name, dataset id, dataset type, filters, and tool (if there is one). Track filters and tools are structured so that the client can render and use them without requiring any a priori knowledge. Each filter includes an index into the track s data that denotes the value to use for filtering. A tool s parameters and inputs are encapsulated in an HTML form that the client can use as an interface that enables a user to set parameter values and run the tool. 4.2 GTB Server The GTB server is integrated into the Galaxy framework as a controller in Galaxy s model-view-controller architecture and implements the actions needed to support the client. In order to provide data quickly to the client, the GTB server creates and uses multiple indices for each dataset in a GTB visualization. The GTB server manages index creation and associations between datasets and indices so that neither users nor the client are burdened. Indices for a dataset are created when a client requests them or when a client requests data from the dataset. BAI indices are created for SAM/BAM datasets using SAMtools [27], tabix [17] indices are created for interval files (including BED, GFF, and VCF), and bigwig [16] indices are created for continuously-valued datasets. In addition, a summary tree index is created for all visualized datasets; a summary tree is a custom Galaxy format used to provide aggregated data over large genomic regions. The server stores associations between datasets and their indices so that, regardless of how often a dataset is used in visualizations, indices are created only once. For library datasets shared amongst all users of a Galaxy instance, indices are created once and shared as well. The GTB server takes multiple actions so that running a tool on a subset of data can be done quickly. The first time a tool is rerun, the server identifies all input datasets needed to produce the alternative dataset and creates indices for them. Indices are used to quickly extract data from the input datasets whenever they are used as inputs for a tool and hence reduce a tool s total execution time. Oftentimes indices will already be present because it is common to visualize both a tool s inputs and outputs, and the server will have created indices for all visualized datasets. To run a tool, the server uses indices to create small input datasets that contain only the data in the visible genomic region, runs the tool on these datasets via the Galaxy framework, and returns a track definition for the new output to the client. A tool integrated with Galaxy can be used in GTB with minimal additional work. Any tool run from the command line can be integrated into Galaxy via a tool wrapper; a tool wrapper is an XML file that specifies a tool s parameters, inputs and outputs. Adding the <trackster-conf> tag to a tool s wrapper indicates that it works correctly in GTB and will make the tool available in GTB. Tools compatible with GTB produce the same output for a particular genomic region regardless of whether the input is the subset of data from the region or the complete dataset. Section 6 discusses this issue in more detail. Each track definition that the GTB server sends to the client includes both the track s filters and its tool. The server determines filters based on the dataset s type; for GTF datasets, the server includes all attribute values that are numerical and hence are filterable. The server reuses the Galaxy framework to generate tool definitions, including an interface for specifying parameters. 4.3 Performance GTB s performance is dependent on numerous factors: client Web browser speed and memory, network latency, and server load and speed. Performance profiling and analysis of GTB has largely focused on the client because Galaxy, acting as GTB s server, can be scaled as needed to effectively support significant Web traffic. A full evaluation of the GTB client s performance will be undertaken soon. However, informal evaluation indicates that (a) rendering takes less than 0.5 seconds per track for all track types at all levels of detail, including data-intensive tracks such as mapped reads and ESTs; and (b) data transfer time is significantly larger than rendering time. These observations suggest that optimizing GTB to mitigate data transfers may lead to large performance gains. 5 CONTRASTING GENOME BROWSERS The Galaxy Track Browser complements other genome browser research by exploring an alternative approach that leverages Galaxy to provide novel genome browser functionality. This section contrasts GTB with six popular genome browsers that are compatible with NGS data: Ensembl [4], IGB [6], IGV [7], JBrowse [8], Savant [9], and UCSC [3]. Table 1 summarizes the functionality of GTB and these six browsers along several important dimensions. The similarities amongst all browsers are indicative of core browser functionality. All browsers can obtain data both from a user s local computer and from remote sources via HTTP. Ensembl, IGB, and IGV also support visualizing data via the DAS protocol [28] as well. All browsers provide support for viewing mapped reads in BAM format, interval or annotations in BED and GFF format, and continuously-valued data in WIG format. Full multi-resolution data models that support complete display customization are available in GTB, IGV, and Savant. IGB and JBrowse both use precomputed image data for some track types. IGB uses precomputed data to render tracks and does not provide customization options. JBrowse uses precomputed images rather 44

7 than data for quantitative (WIG) tracks. The three most recently developed browsers GTB, IGV, Can obtain data locally and from public sources and Savant all use full via HTTP and/or DAS multi-scale resolution data models. This trend Can view mapped reads (BAM), interval files is driven by user (BED, GFF), and continuously-valued data (WIG) demand for more Multi-resolution data model for all datatypes that interactive visualization supports complete display customization tools and by technological advances Run bioinformatics tools to produce and visualize in data indices. new data GTB s integration of analysis tools and its Filters for dynamically showing and hiding data dynamic filters are based on attribute values novel features not Share and customize fully interactive present in other genome visualizations browsers. Savant s plug-in framework has the potential to produce user interactions similar to GTB s tool use and filtering. Using Savant s framework, developers can write plug-ins that operate on visualized data; an example plugin is a SNP finder that highlights potential SNPs based on mapped reads. Both GTB and Savant aim to integrate analysis tools into a browser environment and each approach has advantages. Savant s plug-in framework is useful for prototyping new functionality because plug-in development is straightforward and requires little programming. Integrating tools into GTB/Galaxy is equally simple and potentially no programming is needed, but Galaxy requires that a tool run from the command line. GTB s strength is its integration into Galaxy, a fully functional analysis environment. Integration with Galaxy enables GTB to automatically leverage the tools already available in Galaxy without additional work by users or tool developers. Galaxy has a significant community of developers that have integrated many tools into Galaxy [12]. Making tools available in GTB is beneficial to tool developers because GTB provides access to their tools in a visualization environment. Users benefit because GTB is very usable: users can run tools and visualize output datasets without using a command line or installing any software. As tools are increasingly integrated into Savant and GTB, it may be appropriate to compare them not only with genome browsers but also with analysis environments that incorporate visualization components. Many genome browsers enable users to share data or even complete visualizations. The UCSC genome browser supports sharing custom tracks but not complete visualizations via URL, but there are limitations, including the deletion of tracks 48 hours after they were last accessed. IGB and IGV servers can be set up to make data publicly available or accessible via password. To use this data, IGB and IGV users must download and install the genome browser software and then find and add data from a public server. JBrowse and GTB make sharing data simpler because they are Web-based, and hence fully interactive visualizations can be shared via URL. However, GTB s approach for sharing and modifying visualizations is more flexible and user-friendly than JBrowse. JBrowse visualizations are publicly accessible once they are created; GTB visualizations can be shared with individual users, made accessible via URL, or published to Galaxy s repository. Adding datasets to a JBrowse visualization requires that a user access the server to install and preprocess the new data. GTB users add datasets to a visualization using a GUI. GTB users need no programming experience to share and modify visualizations. Table 1. Comparing genome browsers. Ensembl GTB IGB IGV JBrows e In summary, GTB is the only Web-based browser to implement current best practices amongst genome browsers for fetching, representing, and displaying data. Also, GTB s interface provides unique features, enabling users to run tools to produce new data, dynamically filter data, and share fully interactive and customizable visualizations. 6 DISCUSSION Savant UCSC 6.1 Making Visualization-compatible Analysis Tools Interactive visualizations such as genome browsers are useful primarily because users can, in real time, manipulate data and receive feedback. Analysis tools, then, must run quickly if they are to be useful additions to genome browsers. Currently, many bioinformatics analysis tools can run for minutes or hours, especially when processing large datasets, and are unsuitable for an interactive visualization. GTB addresses this issue by running an analysis tool on the subset of data for the genomic region being viewed. Seeing a tool s output for a chosen genomic region provides useful information while minimizing a tool s execution time. For many analysis tools, this approach is sufficient to ensure that runtime is on the order of seconds. However, this approach is not compatible with all analysis tools. Some tools require not only input data for a particular genomic region but also data from other regions or from all regions. There are two common reasons that data outside a region of interest may be needed. First, a tool may build a global model as part of its execution; for instance, both the transcript assembly tool Cufflinks [24] and the ChIP-seq analysis program MACS [29] use global models to perform normalization. Second, a tool may require a complete input dataset because it is not possible to identify, prior to runtime, a subset of input data needed to produce correct output in a particular genomic region. This is true for the NGS read mapping tools Bowtie [30] and Tophat [31]. We are exploring two approaches to address this issue. One approach is to augment tools to store a global model for an input dataset and use the stored model when processing a subset of the input. We have successfully applied this approach to modify Cufflinks to work with subsets of input data. As the use of analysis tools in genome browsers becomes more common, tool developers are likely to support stored global models and other approaches that enable tools to be run on subsets of data or particular genomic regions. For tools that cannot be modified to run on subsets of input data, an alternative approach is to use dynamic filtering to 45

8 simulate running a tool using different parameters. In this approach, a tool s parameters are relaxed so that all potential output is produced and attribute values are attached to output data. A user can then use attribute values to filter data and observe which data points would be produced for particular parameter values. This approach is ideal for tools that use parameter settings to omit data from their output, including Bowtie and Tophat. 6.2 Towards a Visual Analysis Environment Currently, GTB users can rerun tools to both understand how tool parameter values influence tool output and to refine their selection of parameter values to achieve a desired output. This a first step toward creating a general purpose visual analysis environment. GUI environments for bioinformatics analyses such as Galaxy and GenePattern [32] have proven very popular because they make it easier for non-programmers to run bioinformatics tools and perform complete analyses. GUI environments represent inputs, tools, and outputs using text and utilize GUI widgets such as drop-down boxes and text fields that enable users to choose inputs and tools. A complementary approach to GUI environments is a visual analysis environment where a user performs the same actions visually. A visual analysis environment is another approach for making bioinformatics tools and especially outputs more accessible. We are extending GTB so that users can run any compatible tool on any set of input tracks. For example, to create a track of the intersecting regions for two annotation tracks, a user might select the intersect tool and then drag two tracks onto the new track. GTB would then run the tool and renders the tool output. Repeated use of tools would produce more tracks and tracks could be organized so that a user could scroll through the tracks to visually see the steps in her analysis. When inputs and tools can be selected and used in GTB, GTB will function very much like a standard GUI bioinformatics environment. 7 CONCLUDING THOUGHTS Next-generation sequencing technologies and analysis tools present new challenges for genome browsers. NGS datasets are extremely large and growing, and full multi-resolution data models are needed to enable smooth interaction with such large datasets. NGS tools and analyses are complicated and often require use of both tools and visualizations to understand, debug, and improve. To support use of analysis and visualization together, genome browsers need to integrate tools and enable tools to be used to produce new data, all within a browser. NGS experiments are highly collaborative, and genome browsers should facilitate fast and simple sharing of visualizations; shared visualizations can also play a large role when publishing analyses. The Galaxy Track Browser is a Web-based genome browser integrated into the Galaxy platform that addresses these challenges. GTB is the only Web-based genome browser that provides a full multi-resolution data model. GTB provides tools and dynamic filters that enable users to produce, visualize, and analyze data all within GTB. Using the Galaxy framework, fully-functional GTB visualizations can be shared with colleagues or published to the Web. GTB is user friendly; GTB requires no programming experience to use and all GTB functionality is available using only a Web browser. REFERENCES [1] M. S. Cline and W. J. Kent, "Understanding genome browsing," Nat. Biotechnol., vol. 27, pp , [2] C. B. Nielsen, et al., "Visualizing genomes: techniques and challenges," Nature Methods, vol. 7, pp. S5-S15-S5-S15, [3] W. J. Kent, "The human genome browser at UCSC," Genome Res., vol. 12, pp , [4] P. Flicek, et al., "Ensembl 2011," Nucleic Acids Research, vol. 39, pp. D800-D806, January 1, [5] L. D. Stein, et al., "The Generic Genome Browser: A Building Block for a Model Organism System Database," Genome Research, vol. 12, pp , [6] J. W. Nicol, et al., "The Integrated Genome Browser: free software for distribution and exploration of genome-scale data sets," Bioinformatics, vol. 25, pp , [7] J. T. Robinson, et al., "Integrative genomics viewer," Nat Biotech, vol. 29, pp , [8] M. E. Skinner, et al., "JBrowse: A next-generation genome browser," Genome Research, vol. 19, pp , [9] M. Fiume, et al., "Savant: genome browser for high-throughput sequencing data," Bioinformatics, vol. 26, pp , [10] E. R. Mardis, "Next-generation DNA sequencing methods," Annu. Rev. Genomics Hum. Genet., vol. 9, pp , [11] S. Pepke, et al., "Computation for ChIP-seq and RNA-seq studies," Nat Meth, vol. 6, pp. S22-S32-S22-S32, [12] D. Blankenberg, et al., "A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly," Genome Research, vol. 17, pp , [13] B. Giardine, "Galaxy: a platform for interactive large-scale genome analysis," Genome Res., vol. 15, pp , [14] J. Goecks, et al., "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences," Genome Biology, vol. 11, pp. R86-R86, [15] Z. Wang, et al., "RNA-Seq: a revolutionary tool for transcriptomics," Nat Rev Genet, vol. 10, pp , Jan [16] W. J. Kent, et al., "BigWig and BigBed: enabling browsing of large distributed datasets," Bioinformatics, vol. 26, pp , [17] H. Li, "Tabix: fast retrieval of sequence features from generic TABdelimited files," Bioinformatics, vol. 27, pp , [18] J. Tonti-Filippini. (April 21, 2011). Anno-J: Annotation Browsing 2.0. Available: [19] K. A. Cook and J. J. Thomas, Illuminating the Path: The Research and Development Agenda for Visual Analytics, [20] W. J. Kent, "BLAT--The BLAST-Like Alignment Tool," Genome Research, vol. 12, pp , [21] M. C. Schatz, et al., "Hawkeye: an interactive visual analytics tool for genome assemblies," Genome Biol., vol. 8, p. R34, [22] J. P. A. Ioannidis, et al., "Repeatability of published microarray gene expression analyses," Nat Genet, vol. 41, pp , [23] "Devil in the details," Nature, vol. 470, pp , [24] C. Trapnell, et al., "Transcript assembly and quantification by RNA- Seq reveals unannotated transcripts and isoform switching during cell differentiation," Nat Biotech, vol. 28, pp , [25] A. McKenna, et al., "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data," Genome Res, vol. 20, pp , Sep [26] Open Source Initiative. (April 21, 2011). The Academic Free License 3.0. Available: [27] H. Li, et al., "The Sequence Alignment/Map format and SAMtools," Bioinformatics, vol. 25, pp , [28] R. Dowell, et al., "The Distributed Annotation System," BMC Bioinformatics, vol. 2, p. 7, [29] Y. Zhang, et al., "Model-based Analysis of ChIP-Seq (MACS)," Genome Biology, vol. 9, pp. R137-R137, [30] B. Langmead, et al., "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome," Genome Biology, vol. 10, pp. R25-R25, [31] C. Trapnell, et al., "TopHat: discovering splice junctions with RNA- Seq," Bioinformatics, vol. 25, pp , [32] M. Reich, et al., "GenePattern 2.0," Nature Genetics, vol. 38, pp ,

NGS Data Visualization and Exploration Using IGV

NGS Data Visualization and Exploration Using IGV 1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians

More information

Galaxy. Daniel Blankenberg The Galaxy Team

Galaxy. Daniel Blankenberg The Galaxy Team Galaxy Daniel Blankenberg The Galaxy Team http://galaxyproject.org Overview What is Galaxy? What you can do in Galaxy analysis interface, tools and datasources data libraries workflows visualization sharing

More information

Galaxy Platform For NGS Data Analyses

Galaxy Platform For NGS Data Analyses Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account

More information

Genome Browser Background and Strategy

Genome Browser Background and Strategy Genome Browser Background and Strategy April 12th, 2017 BIOL 7210 - Faction I (Outbreak) - Genome Browser Group Adam Dabrowski Mrunal Dehankar Shareef Khalid Hubert Pan Ajay Ramakrishnan Ankit Srivastava

More information

A short Introduction to UCSC Genome Browser

A short Introduction to UCSC Genome Browser A short Introduction to UCSC Genome Browser Elodie Girard, Nicolas Servant Institut Curie/INSERM U900 Bioinformatics, Biostatistics, Epidemiology and computational Systems Biology of Cancer 1 Why using

More information

Genomic Analysis with Genome Browsers.

Genomic Analysis with Genome Browsers. Genomic Analysis with Genome Browsers http://barc.wi.mit.edu/hot_topics/ 1 Outline Genome browsers overview UCSC Genome Browser Navigating: View your list of regions in the browser Available tracks (eg.

More information

Genome Browser. Background & Strategy. Spring 2017 Faction II

Genome Browser. Background & Strategy. Spring 2017 Faction II Genome Browser Background & Strategy Spring 2017 Faction II Outline Beginning of the Last Phase Goals State of Art Applicable Genome Browsers Not So Genome Browsers Storing Data Strategy for the website

More information

Web-Based Visualization and Visual Analysis for High-Throughput Genomics. Jeremy Goecks! Computational Biology Institute

Web-Based Visualization and Visual Analysis for High-Throughput Genomics. Jeremy Goecks! Computational Biology Institute Web-Based Visualization and Visual Analysis for High-Throughput Genomics with Galaxy! Jeremy Goecks! Computational Biology Institute Topics Galaxy Visualization framework Large-scale visualization Integrated

More information

Reproducible & Transparent Computational Science with Galaxy. Jeremy Goecks The Galaxy Team

Reproducible & Transparent Computational Science with Galaxy. Jeremy Goecks The Galaxy Team Reproducible & Transparent Computational Science with Galaxy Jeremy Goecks The Galaxy Team 1 Doing Good Science Previous talks: performing an analysis setting up and scaling Galaxy adding tools libraries

More information

ChIP-seq hands-on practical using Galaxy

ChIP-seq hands-on practical using Galaxy ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling

More information

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) Genome Informatics (Part 1) https://bioboot.github.io/bggn213_f17/lectures/#14 Dr. Barry Grant Nov 2017 Overview: The purpose of this lab session is

More information

NGS Analysis Using Galaxy

NGS Analysis Using Galaxy NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises

More information

Introduction to Galaxy

Introduction to Galaxy Introduction to Galaxy Dr Jason Wong Prince of Wales Clinical School Introductory bioinformatics for human genomics workshop, UNSW Day 1 Thurs 28 th January 2016 Overview What is Galaxy? Description of

More information

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

NGS FASTQ file format

NGS FASTQ file format NGS FASTQ file format Line1: Begins with @ and followed by a sequence idenefier and opeonal descripeon Line2: Raw sequence leiers Line3: + Line4: Encodes the quality values for the sequence in Line2 (see

More information

Accessible, Transparent and Reproducible Analysis with Galaxy

Accessible, Transparent and Reproducible Analysis with Galaxy Accessible, Transparent and Reproducible Analysis with Galaxy Application of Next Generation Sequencing Technologies for Whole Transcriptome and Genome Analysis ABRF 2013 Saturday, March 2, 2013 Palm Springs,

More information

RNA-seq. Manpreet S. Katari

RNA-seq. Manpreet S. Katari RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene

More information

Advanced UCSC Browser Functions

Advanced UCSC Browser Functions Advanced UCSC Browser Functions Dr. Thomas Randall tarandal@email.unc.edu bioinformatics.unc.edu UCSC Browser: genome.ucsc.edu Overview Custom Tracks adding your own datasets Utilities custom tools for

More information

ChIP-Seq Tutorial on Galaxy

ChIP-Seq Tutorial on Galaxy 1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data

More information

Analyzing Variant Call results using EuPathDB Galaxy, Part II

Analyzing Variant Call results using EuPathDB Galaxy, Part II Analyzing Variant Call results using EuPathDB Galaxy, Part II In this exercise, we will work in groups to examine the results from the SNP analysis workflow that we started yesterday. The first step is

More information

Genome Browsers - The UCSC Genome Browser

Genome Browsers - The UCSC Genome Browser Genome Browsers - The UCSC Genome Browser Background The UCSC Genome Browser is a well-curated site that provides users with a view of gene or sequence information in genomic context for a specific species,

More information

JBrowse. To get started early: Double click VirtualBox on the desktop Click JBrowse 2016 Tutorial Click Start

JBrowse. To get started early: Double click VirtualBox on the desktop Click JBrowse 2016 Tutorial Click Start JBrowse To get started early: Double click VirtualBox on the desktop Click JBrowse 2016 Tutorial Click Start JBrowse PAG 2015 Scott Cain GMOD Coordinator scott@scottcain.net What is GMOD? A set of interoperable

More information

Sequence Analysis Pipeline

Sequence Analysis Pipeline Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation

More information

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis

More information

Advanced genome browsers: Integrated Genome Browser and others Heiko Muller Computational Research

Advanced genome browsers: Integrated Genome Browser and others Heiko Muller Computational Research Genomic Computing, DEIB, 4-7 March 2013 Advanced genome browsers: Integrated Genome Browser and others Heiko Muller Computational Research IIT@SEMM heiko.muller@iit.it List of Genome Browsers Alamut Annmap

More information

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012 David Crossman, Ph.D. UAB Heflin Center for Genomic Science GCC2012 Wednesday, July 25, 2012 Galaxy Splash Page Colors Random Galaxy icons/colors Queued Running Completed Download/Save Failed Icons Display

More information

Analyzing ChIP- Seq Data in Galaxy

Analyzing ChIP- Seq Data in Galaxy Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...

More information

ChIP-seq hands-on practical using Galaxy

ChIP-seq hands-on practical using Galaxy ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling

More information

Galaxy workshop at the Winter School Igor Makunin

Galaxy workshop at the Winter School Igor Makunin Galaxy workshop at the Winter School 2016 Igor Makunin i.makunin@uq.edu.au Winter school, UQ, July 6, 2016 Plan Overview of the Genomics Virtual Lab Introduce Galaxy, a web based platform for analysis

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

Integrative Genomics Viewer. Prat Thiru

Integrative Genomics Viewer. Prat Thiru Integrative Genomics Viewer Prat Thiru 1 Overview User Interface Basics Browsing the Data Data Formats IGV Tools Demo Outline Based on ISMB 2010 Tutorial by Robinson and Thorvaldsdottir 2 Why IGV? IGV

More information

BIOINFORMATICS. Savant: Genome Browser for High Throughput Sequencing Data

BIOINFORMATICS. Savant: Genome Browser for High Throughput Sequencing Data BIOINFORMATICS Vol. 00 no. 00 2010 Pages 1 6 Savant: Genome Browser for High Throughput Sequencing Data Marc Fiume 1,, Vanessa Williams 1, and Michael Brudno 1,2 1 Department of Computer Science, University

More information

Background and Strategy. Smitha, Adrian, Devin, Jeff, Ali, Sanjeev, Karthikeyan

Background and Strategy. Smitha, Adrian, Devin, Jeff, Ali, Sanjeev, Karthikeyan Background and Strategy Smitha, Adrian, Devin, Jeff, Ali, Sanjeev, Karthikeyan What is a genome browser? A web/desktop based graphical tool for rapid and reliable display of any requested portion of the

More information

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Table of Contents Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification

More information

ChIP-seq (NGS) Data Formats

ChIP-seq (NGS) Data Formats ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/

More information

Helpful Galaxy screencasts are available at:

Helpful Galaxy screencasts are available at: This user guide serves as a simplified, graphic version of the CloudMap paper for applicationoriented end-users. For more details, please see the CloudMap paper. Video versions of these user guides and

More information

Today's outline. Resources. Genome browser components. Genome browsers: Discovering biology through genomics. Genome browser tutorial materials

Today's outline. Resources. Genome browser components. Genome browsers: Discovering biology through genomics. Genome browser tutorial materials Today's outline Genome browsers: Discovering biology through genomics BaRC Hot Topics April 2013 George Bell, Ph.D. http://jura.wi.mit.edu/bio/education/hot_topics/ Genome browser introduction Popular

More information

Genome Browsers Guide

Genome Browsers Guide Genome Browsers Guide Take a Class This guide supports the Galter Library class called Genome Browsers. See our Classes schedule for the next available offering. If this class is not on our upcoming schedule,

More information

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Tophat Gene expression estimation cufflinks Confidence intervals Gene expression changes (separate use case) Sample

More information

Web-based visual analysis for high-throughput genomics

Web-based visual analysis for high-throughput genomics Web-based visual analysis for high-throughput genomics Jeremy Goecks, Emory University Carl Eberhard, Emory University Tomithy Too, National University of Singapore The Galaxy Team, Emory University Anton

More information

RNA-Seq Analysis With the Tuxedo Suite

RNA-Seq Analysis With the Tuxedo Suite June 2016 RNA-Seq Analysis With the Tuxedo Suite Dena Leshkowitz Introduction In this exercise we will learn how to analyse RNA-Seq data using the Tuxedo Suite tools: Tophat, Cuffmerge, Cufflinks and Cuffdiff.

More information

de.nbi and its Galaxy interface for RNA-Seq

de.nbi and its Galaxy interface for RNA-Seq de.nbi and its Galaxy interface for RNA-Seq Jörg Fallmann Thanks to Björn Grüning (RBC-Freiburg) and Sarah Diehl (MPI-Freiburg) Institute for Bioinformatics University of Leipzig http://www.bioinf.uni-leipzig.de/

More information

The UCSC Gene Sorter, Table Browser & Custom Tracks

The UCSC Gene Sorter, Table Browser & Custom Tracks The UCSC Gene Sorter, Table Browser & Custom Tracks Advanced searching and discovery using the UCSC Table Browser and Custom Tracks Osvaldo Graña Bioinformatics Unit, CNIO 1 Table Browser and Custom Tracks

More information

ChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima

ChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima ChIP-seq practical: peak detection and peak annotation Mali Salmon-Divon Remco Loos Myrto Kostadima March 2012 Introduction The goal of this hands-on session is to perform some basic tasks in the analysis

More information

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Although a little- bit long, this is an easy exercise

More information

Analysis of ChIP-seq data

Analysis of ChIP-seq data Before we start: 1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for the class (see separate hand out) 3. Connect to your lab space through the wihtdata network and

More information

Using Galaxy to provide a NGS Analysis Platform

Using Galaxy to provide a NGS Analysis Platform 11/15/11 Using Galaxy to provide a NGS Analysis Platform Friedrich Miescher Institute - part of the Novartis Research Foundation - affiliated institute of Basel University - member of Swiss Institute of

More information

Getting Started. April Strand Life Sciences, Inc All rights reserved.

Getting Started. April Strand Life Sciences, Inc All rights reserved. Getting Started April 2015 Strand Life Sciences, Inc. 2015. All rights reserved. Contents Aim... 3 Demo Project and User Interface... 3 Downloading Annotations... 4 Project and Experiment Creation... 6

More information

Easy visualization of the read coverage using the CoverageView package

Easy visualization of the read coverage using the CoverageView package Easy visualization of the read coverage using the CoverageView package Ernesto Lowy European Bioinformatics Institute EMBL June 13, 2018 > options(width=40) > library(coverageview) 1 Introduction This

More information

Sequencing Data. Paul Agapow 2011/02/03

Sequencing Data. Paul Agapow 2011/02/03 Webservices for Next Generation Sequencing Data Paul Agapow 2011/02/03 Aims Assumed parameters: Must have a system for non-technical users to browse and manipulate their Next Generation Sequencing (NGS)

More information

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational

More information

CLC Server. End User USER MANUAL

CLC Server. End User USER MANUAL CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark

More information

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary Cufflinks RNA-Seq analysis tools - Getting Started 1 of 6 14.07.2011 09:42 Cufflinks Transcript assembly, differential expression, and differential regulation for RNA-Seq Site Map Home Getting started

More information

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure TM DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure About DRAGEN Edico Genome s DRAGEN TM (Dynamic Read Analysis for GENomics) Bio-IT Platform provides ultra-rapid secondary analysis of

More information

Genome Browser. Background and Strategy

Genome Browser. Background and Strategy Genome Browser Background and Strategy Contents What is a genome browser? Purpose of a genome browser Examples Structure Extra Features Contents What is a genome browser? Purpose of a genome browser Examples

More information

Single/paired-end RNAseq analysis with Galaxy

Single/paired-end RNAseq analysis with Galaxy October 016 Single/paired-end RNAseq analysis with Galaxy Contents: 1. Introduction. Quality control 3. Alignment 4. Normalization and read counts 5. Workflow overview 6. Sample data set to test the paired-end

More information

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment: Cyverse tutorial 1 Logging in to Cyverse and data management Open an Internet browser window and navigate to the Cyverse discovery environment: https://de.cyverse.org/de/ Click Log in with your CyVerse

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2014 UNIVERSITY OF KENTUCKY AGTC Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta

More information

Simile Tools Workshop Summary MacKenzie Smith, MIT Libraries

Simile Tools Workshop Summary MacKenzie Smith, MIT Libraries Simile Tools Workshop Summary MacKenzie Smith, MIT Libraries Intro On June 10 th and 11 th, 2010 a group of Simile Exhibit users, software developers and architects met in Washington D.C. to discuss the

More information

epigenomegateway.wustl.edu

epigenomegateway.wustl.edu Everything can be found at epigenomegateway.wustl.edu REFERENCES 1. Zhou X, et al., Nature Methods 8, 989-990 (2011) 2. Zhou X & Wang T, Current Protocols in Bioinformatics Unit 10.10 (2012) 3. Zhou X,

More information

Using the Galaxy Local Bioinformatics Cloud at CARC

Using the Galaxy Local Bioinformatics Cloud at CARC Using the Galaxy Local Bioinformatics Cloud at CARC Lijing Bu Sr. Research Scientist Bioinformatics Specialist Center for Evolutionary and Theoretical Immunology (CETI) Department of Biology, University

More information

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples. Services Performed The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples. SERVICE Sample Received Sample Quality Evaluated Sample Prepared for Sequencing

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 14 2011, pages 1889 1893 doi:10.1093/bioinformatics/btr309 Genome analysis Advance Access publication May 19, 2011 GenPlay, a multipurpose genome analyzer and

More information

GALAXY BIOINFORMATICS WORKFLOW ENVIRONMENT. Rutger Vos, 3 April 2012

GALAXY BIOINFORMATICS WORKFLOW ENVIRONMENT. Rutger Vos, 3 April 2012 GALAXY BIOINFORMATICS WORKFLOW ENVIRONMENT Rutger Vos, 3 April 2012 Overview Informatics in the post-genomic era The past (?) Analyses glued together using scripting languages, directly on the CLI or in

More information

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Exercise 2: Browser-Based Annotation and RNA-Seq Data Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler July 24, 2018 This exercise continues your introduction to practical issues in comparative annotation. You ll be annotating genomic sequence

More information

How to store and visualize RNA-seq data

How to store and visualize RNA-seq data How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk EBI is an Outstation of the European Molecular Biology Laboratory. Talk summary How do we archive RNA-seq

More information

Aligners. J Fass 21 June 2017

Aligners. J Fass 21 June 2017 Aligners J Fass 21 June 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-06-21

More information

Tutorial. RNA-Seq Analysis of Breast Cancer Data. Sample to Insight. November 21, 2017

Tutorial. RNA-Seq Analysis of Breast Cancer Data. Sample to Insight. November 21, 2017 RNA-Seq Analysis of Breast Cancer Data November 21, 2017 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com AdvancedGenomicsSupport@qiagen.com

More information

Illumina Next Generation Sequencing Data analysis

Illumina Next Generation Sequencing Data analysis Illumina Next Generation Sequencing Data analysis Chiara Dal Fiume Sr Field Application Scientist Italy 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life,

More information

UCSC Genome Browser ASHG 2014 Workshop

UCSC Genome Browser ASHG 2014 Workshop UCSC Genome Browser ASHG 2014 Workshop We will be using human assembly hg19. Some steps may seem a bit cryptic or truncated. That is by design, so you will think about things as you go. In this document,

More information

RNA-seq Data Analysis

RNA-seq Data Analysis Seyed Abolfazl Motahari RNA-seq Data Analysis Basics Next Generation Sequencing Biological Samples Data Cost Data Volume Big Data Analysis in Biology تحلیل داده ها کنترل سیستمهای بیولوژیکی تشخیص بیماریها

More information

Introduction to Genome Browsers

Introduction to Genome Browsers Introduction to Genome Browsers Rolando Garcia-Milian, MLS, AHIP (Rolando.milian@ufl.edu) Department of Biomedical and Health Information Services Health Sciences Center Libraries, University of Florida

More information

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome. Supplementary Figure 1 Fast read-mapping algorithm of BrowserGenome. (a) Indexing strategy: The genome sequence of interest is divided into non-overlapping 12-mers. A Hook table is generated that contains

More information

Genome Environment Browser (GEB) user guide

Genome Environment Browser (GEB) user guide Genome Environment Browser (GEB) user guide GEB is a Java application developed to provide a dynamic graphical interface to visualise the distribution of genome features and chromosome-wide experimental

More information

Expression Analysis with the Advanced RNA-Seq Plugin

Expression Analysis with the Advanced RNA-Seq Plugin Expression Analysis with the Advanced RNA-Seq Plugin May 24, 2016 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com

More information

Using Galaxy to provide a NGS Analysis Platform GTC s NGS & Bioinformatics Summit Europe October 7-8, 2013 in Berlin, Germany.

Using Galaxy to provide a NGS Analysis Platform GTC s NGS & Bioinformatics Summit Europe October 7-8, 2013 in Berlin, Germany. Using Galaxy to provide a NGS Analysis Platform GTC s NGS & Bioinformatics Summit Europe October 7-8, 2013 in Berlin, Germany. (public version) Hans-Rudolf Hotz ( hrh@fmi.ch ) Friedrich Miescher Institute

More information

Galaxy. Data intensive biology for everyone. / #usegalaxy

Galaxy. Data intensive biology for everyone. / #usegalaxy Galaxy Data intensive biology for everyone. www.galaxyproject.org @jxtx / #usegalaxy Engineering Dannon Baker Dan Blankenberg Dave Bouvier Nate Coraor Carl Eberhard Jeremy Goecks Sam Guerler Greg von Kuster

More information

BovineMine Documentation

BovineMine Documentation BovineMine Documentation Release 1.0 Deepak Unni, Aditi Tayal, Colin Diesh, Christine Elsik, Darren Hag Oct 06, 2017 Contents 1 Tutorial 3 1.1 Overview.................................................

More information

m6aviewer Version Documentation

m6aviewer Version Documentation m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.

More information

Introduction to Galaxy

Introduction to Galaxy Introduction to Galaxy Saint Louis University St. Louis, Missouri April 30, 2013 Dave Clements, Emory University http://galaxyproject.org/ Agenda 9:00 Welcome 9:20 Basic Analysis with Galaxy 10:30 Basic

More information

Mapping NGS reads for genomics studies

Mapping NGS reads for genomics studies Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization

More information

Anaquin - Vignette Ted Wong January 05, 2019

Anaquin - Vignette Ted Wong January 05, 2019 Anaquin - Vignette Ted Wong (t.wong@garvan.org.au) January 5, 219 Citation [1] Representing genetic variation with synthetic DNA standards. Nature Methods, 217 [2] Spliced synthetic genes as internal controls

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,

More information

SolexaLIMS: A Laboratory Information Management System for the Solexa Sequencing Platform

SolexaLIMS: A Laboratory Information Management System for the Solexa Sequencing Platform SolexaLIMS: A Laboratory Information Management System for the Solexa Sequencing Platform Brian D. O Connor, 1, Jordan Mendler, 1, Ben Berman, 2, Stanley F. Nelson 1 1 Department of Human Genetics, David

More information

GenomeStudio Software Release Notes

GenomeStudio Software Release Notes GenomeStudio Software 2009.2 Release Notes 1. GenomeStudio Software 2009.2 Framework... 1 2. Illumina Genome Viewer v1.5...2 3. Genotyping Module v1.5... 4 4. Gene Expression Module v1.5... 6 5. Methylation

More information

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi

More information

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were

More information

INTRODUCTION TO BIOINFORMATICS

INTRODUCTION TO BIOINFORMATICS Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain

More information

W ASHU E PI G ENOME B ROWSER

W ASHU E PI G ENOME B ROWSER W ASHU E PI G ENOME B ROWSER Keystone Symposium on DNA and RNA Methylation January 23 rd, 2018 Fairmont Hotel Vancouver, Vancouver, British Columbia, Canada Presenter: Renee Sears and Josh Jang Tutorial

More information

High-throughout sequencing and using short-read aligners. Simon Anders

High-throughout sequencing and using short-read aligners. Simon Anders High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel

More information

Tutorial 1: Exploring the UCSC Genome Browser

Tutorial 1: Exploring the UCSC Genome Browser Last updated: May 12, 2011 Tutorial 1: Exploring the UCSC Genome Browser Open the homepage of the UCSC Genome Browser at: http://genome.ucsc.edu/ In the blue bar at the top, click on the Genomes link.

More information

Distributed Visualization for Genomic Analysis

Distributed Visualization for Genomic Analysis Distributed Visualization for Genomic Analysis Alyssa Morrow Anthony D. Joseph, Ed. Nir Yosef, Ed. Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No.

More information

Our typical RNA quantification pipeline

Our typical RNA quantification pipeline RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality

More information

Using Galaxy to Perform Large-Scale Interactive Data Analyses

Using Galaxy to Perform Large-Scale Interactive Data Analyses Using Galaxy to Perform Large-Scale Interactive Data Analyses Jennifer Hillman-Jackson, 1 Dave Clements, 2 Daniel Blankenberg, 1 James Taylor, 2 Anton Nekrutenko, 1 and Galaxy Team 1,2 UNIT 10.5 1 Penn

More information

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics

More information

ViTraM: VIsualization of TRAnscriptional Modules

ViTraM: VIsualization of TRAnscriptional Modules ViTraM: VIsualization of TRAnscriptional Modules Version 1.0 June 1st, 2009 Hong Sun, Karen Lemmens, Tim Van den Bulcke, Kristof Engelen, Bart De Moor and Kathleen Marchal KULeuven, Belgium 1 Contents

More information

Biosphere: the interoperation of web services in microarray cluster analysis

Biosphere: the interoperation of web services in microarray cluster analysis Biosphere: the interoperation of web services in microarray cluster analysis Kei-Hoi Cheung 1,2,*, Remko de Knikker 1, Youjun Guo 1, Guoneng Zhong 1, Janet Hager 3,4, Kevin Y. Yip 5, Albert K.H. Kwan 5,

More information