The Galaxy Track Browser: Transforming the Genome Browser from Visualization Tool to Analysis Tool

Size: px

Start display at page:

Download "The Galaxy Track Browser: Transforming the Genome Browser from Visualization Tool to Analysis Tool"

James Holland
5 years ago
Views:

1 The Galaxy Track Browser: Transforming the Genome Browser from Visualization Tool to Analysis Tool Jeremy Goecks * Kanwei Li Ω Dave Clements ℵ The Galaxy Team James Taylor ℇ Emory University Emory University Emory University Emory University ABSTRACT The proliferation of next-generation sequencing (NGS) technologies and analysis tools present new challenges to genome browsers. These challenges include supporting very large datasets, integrating analysis tools with data visualization to help reason about and improve analyses, and sharing or publishing fully interactive visualizations. The Galaxy Track Browser (GTB) is a Web-based genome browser integrated into the Galaxy platform that addresses these challenges. GTB is the first Web-based genome browser to provide a full multi-resolution data model; this model supports efficient data retrieval from very large datasets. GTB leverages the Galaxy platform to combine data visualization and data analysis; users can specify parameter values and run tools to produce new data, all within GTB. GTB also provides interactive filters that dynamically show and hide data and can be used to identify data for further investigation. GTB is available on every Galaxy server, and visualizations can be created for both standard and custom genome builds. Fully interactive GTB visualizations can be shared with colleagues and published on the Web using a simple graphical user interface. KEYWORDS: Galaxy Track Browser, genomics, genome browser, visual analytics. INDEX TERMS: H5.m. Information interfaces and presentation (e.g., HCI): Miscellaneous, J.3 Life and Medical Sciences: biology and genetics. 1 INTRODUCTION Genomics the study of DNA and related molecules, their functions, and their impact on health is a growing biomedical science that is highly reliant on computational tools and methods. Visualizations play an important role in genomics by helping scientists understand the numerical and textual data that many genomics tools produce. Using visualizations, scientists can find problems and debug an analysis, identify results for subsequent investigation, and communicate findings, both to colleagues and in publications. Genome browsers are a powerful visualization tool that enables scientists to map numerical and textual data onto a visual representation of the genome [1, 2]. Genome browsers have a long history and remain an active area of research and development. * jeremy.goecks@emory.edu Ω kanwei@gmail.com ℵ clements@galaxyproject.org outreach@galaxyproject.org ℇ james.taylor@emory.edu IEEE Symposium on Biological Data Visualization October 23-24, Providence, Rhode Island, USA /11/$ IEEE Early genome browsers such as the UCSC genome browser [3], Ensembl [4], and GBrowse [5] are regularly updated and extended. Recently, new genome browsers, such as IGB [6], IGV [7], JBrowse [8], and Savant [9], have been developed and demonstrate new models and techniques for genome browsing. Recent activity in genome browser development is motivated in large part by the adoption of next-generation sequencing (NGS) technologies [10] and analysis tools [11] in genomic experiments. NGS experiments produce very large datasets, use complex analyses, and require significant collaboration to complete. These demands have driven recent genome browser research. Despite much progress, challenges remain that limit the usefulness of genome browsers for NGS data. Current desktop browsers use full multi-resolution data models for viewing and customizing data, but Web-based browsers lack this needed functionality. NGS analyses are often long and complex, but it is difficult to use genome browsers to reason about and improve an analysis because analysis tools and genome browsers are not integrated. Finally, sharing and publishing fully interactive visualizations is difficult in current browsers, limiting the usefulness of visualizations for collaboration. The Galaxy Track Browser (GTB) is a Web-based genome browser integrated into the Galaxy platform [12-14] that addresses these challenges. GTB utilizes a full multi-resolution, clientserver data model to support gigabyte-sized datasets. Each dataset is a track in the browser, and all tracks are fully interactive and customizable. GTB leverages the Galaxy platform to combine data visualization and data analysis. Using GTB, users can set parameter values and run tools to produce and visualize new data. Dynamic filters can be used to interactively explore data and find data that meets particular criteria. GTB can be used to visualize genomic data for both standard and custom genome builds. Fully interactive GTB visualizations can be shared with particular individuals or published on the Web through a simple graphical user interface (GUI). While GTB leverages many of Galaxy s features, it is modular and can be configured to work outside of Galaxy and with different data providers. This paper s contributions include: (1) a discussion of three challenges that NGS tools and data pose to current genome browsers; (2) a description of the Galaxy Track Browser and how it addresses these challenges; (3) a comparison between the Galaxy Track Browser and other recent genome browsers. 2 THREE CURRENT CHALLENGES FOR GENOME BROWSERS Genome browsers are an important tool for helping scientists visualize and understand their data [1, 2]. The proliferation of next-generation sequencing (NGS) technologies [10] and analysis tools [11] present new challenges to genome browsers. In this section, we discuss three challenges that genome browsers must address to support scientists doing NGS experiments and analyses and how current genome browsers address these challenges. 39

2 2.1 Challenge: Supporting Very Large Datasets in a Web-based Browser NGS technologies and analysis pipelines often produce data that is tens or hundreds of gigabytes in size or larger. For instance, consider a gene expression experiment using RNA-seq [15]. Currently, an RNA-seq dataset obtained from a single lane of an Illumina HiSeq 2000 system is approximately ten to twenty gigabytes in size. Following a simple pipeline to map reads, assemble transcripts, and perform differential expression testing amongst different RNA-seq samples will often generate many datasets that are 500 to 1000 megabytes in size. A genome browser for NGS experiments, then, must efficiently scale to render large datasets and enable users to customize data display to meet their needs. Scaling to support visualization of gigabyte-sized datasets requires a multi-resolution data model that provides access to data at multiple levels of resolution. To display large genomic regions with many data elements, aggregation is done ahead of time and aggregated data used for rendering. To display individual features in a small region, indices such as bigwig [16] and tabix [17] are used to quickly find the region s data in a large file. A multi-resolution data model provides efficient access to a dataset but does not influence how a dataset is rendered. Genome browsers that support a complete multiresolution data models include the desktop-based browsers IGV [7] and Savant [9]. Desktop-based genome browsers are convenient when there is no access to the Web or when datasets are stored locally and uploading them is undesirable. Web-based browsers complement desktop-based browsers by providing access to data and visualizations over the Web via a standard Web browser and do not require users to download or install any software. No Webbased browser, however, implements a complete multi-resolution model. Anno-J [18] supports a multi-resolution model but does not implement one, and JBrowse [8] uses a partial multiresolution data model. Hence, there is a need for a Web-based browser that supports a full multi-resolution data model. 2.2 Challenge: Integrating Data Visualization and Data Analysis Tools NGS analysis tools and pipelines are often complex and highly parameter-dependent, and it can be difficult to determine how changes in parameter values will impact the output of a tool or pipeline. Using genome browsers to debug or tune parameter values can be useful because it is often straightforward to visually confirm that a tool s output is acceptable or needs improvement. However, moving between running analyses and visualizing output can be a tedious process because genome browsers act as endpoints for an analysis pipeline: a series of tools are applied to produce outputs, which are then visualized in a browser. Genome browsers would be more useful if they were integrated with analysis tools. For example, a browser might enable a user to change a tool s parameter value and then, through the browser, rerun the tool and observe how the change impacts the tool s output. If a user could repeatedly change tool parameter values and visualize the corresponding output, she could tune a tool s parameter values to produce the desired output. Similarly, others have argued that more on the fly computation is needed in browsers [2]. Integrating analyses with visualization is a foundation of visual analytics; visual analytics is the science of using interactive visualizations to support analytic reasoning [19]. There are examples of limited visual analytics functionality in current browsers. In the UCSC genome browser [3], a user can run a BLAT [20] query and then visualize the results as a track in the current browser. Savant provides a plug-in framework that developers can use to prototype and run analysis tools. Savant s framework is limited to tools specifically written for Savant because there is no method for incorporating existing tools as plugins. Principles from visual analytics have been applied to develop Hawkeye, a visualization for performing genome assemblies that uses dynamic filtering and automated clustering to help users identify problematic areas of an assembly [21]. These examples illustrate the value of integrating analysis tools into genomic visualizations. The challenge, then, is developing a genome browser that integrates with an analysis framework so that users can run and rerun many different tools to produce novel output all while in the browser. 2.3 Challenge: Sharing and Publishing Interactive Custom Visualizations At the leading edge of biomedical research are large, collaborative experiments that employ NGS technologies and analysis tools to explore complex biomedical questions. Custom genome browser visualizations visualizations of experimental data, often coupled with public data are useful for collaboration and communication because they can efficiently convey information. Despite the value of custom visualizations, they are currently quite difficult to share. The UCSC genome browser supports sharing custom tracks via URL, but there are limitations, including the deletion of tracks 48 hours after they were last accessed. IGB [6] and IGV support data sharing, but users must set up their own visualization to see the data. Using JBrowse, a fully interactive visualization can be shared via the Web; however, JBrowse requires that a server be configured to support the visualization, a task that can be difficult for scientists without programming experience. Custom visualizations are often prominently featured in publications, yet little attention is paid to their reproducibility. Reproducing experimental results including visualizations is an essential facet of scientific inquiry, providing the foundation for understanding, integrating, and extending results toward new discoveries. Reproducibility of genomics experiments has been shown to be limited [22]; NGS technologies, due to their complexity, have exacerbated experimental reproducibility [23]. It is also often difficult to reproduce custom visualizations because essential details of the analysis or the visualization are lost. There are, then, significant issues that limit the usefulness of custom visualizations for scientific collaboration and public communication. A framework is needed that facilitates sharing and publication of fully interactive and reproducible custom genome browser visualizations. 3 GALAXY TRACK BROWSER The Galaxy Track Browser (Figure 1) is a Web-based browser integrated into the Galaxy platform that addresses the three challenges discussed in the previous section. Hence, it is designed for NGS data and tools. Galaxy is an open Web-based platform for accessible, reproducible, and transparent genomic research [12-14]. The public Galaxy service ( makes analysis tools, genomic data, tutorial demonstrations, persistent graphical workspaces, and publication services available to any scientist that has access to the Internet. Local Galaxy servers can be set up by downloading the Galaxy application and customizing it to meet particular needs. Galaxy has established a significant community of users and developers. Like Galaxy, the Galaxy Track Browser (GTB) is designed to be usable by all scientists, especially those without programming experience. GTB is written in JavaScript, and all of its functionality is available using only a Web browser. As is done in most genome browsers, GTB s horizontal axis denotes genome coordinates and each dataset is displayed as a track; individual 40

1 2 3 4 5 7 6 8 9 10 Figure 1. Using the Galaxy Track Browser to analyze ENCODE RNA-seq data.

3 Figure 1. Using the Galaxy Track Browser to analyze ENCODE RNA-seq data. From top to bottom: (1) UCSC knowngene gene annotation track; (2) UCSC all mrna annotation track; (3) UCSC vertebrate quantitative conservation track (4) mapped RNA-seq reads from ENCODE cell line h1-hesc; (5) a form for running Cufflinks [24], a tool for assembling mapped reads into transcripts; (6) first attempt at transcript assembly; (7-9) improving the assembly using different parameter values for Cufflinks; (10) filtering assembled transcripts from the GM12878 cell line using transcript attributes. 41

4 data elements are drawn at their genome coordinates. GTB supports four types of NGS datasets: mapped reads (SAM/BAM file format), features/annotations/ intervals (BED, GFF), variant (VCF), and continuously valued data (WIG). GTB is available on every Galaxy server and Galaxy users can create, save, and share any number of visualizations. GTB visualizations can be created for both standard and custom genome builds. 3.1 Adding, Navigating, and Customizing Data Adding a dataset to a GTB visualization can be done either within a visualization, from a user s Galaxy workspace, or from Galaxy data libraries. To support visualization of very large datasets, GTB automatically creates multiple indices for each dataset added to a visualization and manages the connections between datasets and indices so that indices can be reused if a dataset is used in more than one visualization. Automatically managing indices is an important usability feature because many users may have difficulty creating indices. GTB s use of multiple indices enables smooth user interactions and complete customization of data: users can freely and continuously pan, zoom, and navigate, and GTB fetches data as needed from the server. When GTB is showing a large genomic region, the data obtained is aggregated and is shown as a coverage histogram or features spans. As a user zooms in, details for individual data points or features becomes available and are shown (Figure 2). GTB always obtains data from the server not precomputed image tiles so that a user can completely customize each data track s display as needed. For instance, users can view a continuouslyvalued track as a histogram, line graph, filled line graph, or heatmap; users can further adjust the maximum, minimum, and total track height as well. For mapped read tracks and feature/annotation tracks, display modes include histogram, dense, squished, and packed. Figure 2. GTB renders data differently depending on whether a large or small region is being viewed. Mapped RNA-seq reads are visible as a histogram when viewing a large region (top); zooming in to view a smaller region shows the read structure (middle) and, finally, the read labels (bottom). Users can set a desired level of detail as well Integrating Analysis and Visualization GTB provides access to analysis tools in its visualizations, enabling users to run tools on currently visualized tracks to produce new datasets and tracks. GTB enables users to rerun the tool used to create a track using different parameter values for the visible genomic region. Rerunning a tool produces new output, and GTB automatically renders the new output when the tool is finished (Figure 3). Parameter values can be repeatedly changed and a tool can be rerun many times (see Figure 1 for an example). Running a tool and viewing its output can be done interactively (i.e. quickly) because a tool runs only on the subset of data visible to the user. This functionality is useful for seeing how parameter choices impact tool output, visually comparing output for different parameter values, and tuning values to return desired results. Once a user has chosen a set of parameter values, he can run the tool on the entire dataset, and Galaxy puts the tool output in his Galaxy analysis workspace. For instance, transcript assembly via the Cufflinks tool [24] is highly parameter dependent, generating different numbers of 6 5 Figure 3. Analysis tools can be run in GTB: (1) UCSC knowngene annotation track; (2) assembled transcripts from RNA-seq data; (3) intersections between tracks 1 and 2 with parameter for at least (overlap) of 1 base that was performed in Galaxy (not GTB); (4) interface for setting parameters and rerunning Intersect tool; (5) for at least (overlap) parameter set to 4000 and tool is run; (6) tool with new parameter values run on visible data and new track is rendered. transcripts with different characteristics depending on parameter values. Rerunning Cufflinks in GTB using different parameter values can help make clear how parameters influence the assembly process. Also, a user can quickly generate different Cufflinks assemblies using different parameter values and visually compare the assemblies to choose the best one. GTB provides a generic framework for integrating tools that requires little or no additional work from tool developers. Using 42

5 this framework, a tool s GTB configuration is specified in its Galaxy definition file. After a tool s GTB configuration is specified, all tracks generated using the tool will automatically provide the option to rerun the tool. This approach does not require any setup or configuration by GTB users, and it ensures that users only run tools that are compatible with GTB. However, it does limit users to running tools that have been explicitly configured to work with GTB and, for security reasons, prevents users from running arbitrary tools within GTB. GTB s approach to tool integration is detailed in Section 4.2 The following tools use GTB s integration framework and are currently available in GTB: (i) genomic interval operations such as intersecting, clustering, or subtracting interval sets; (ii) Cufflinks; and (iii) the Unified Genotyper, a SNP and indel caller [25]. GTB s tool integration framework makes it possible to use any Galaxy tool in a visualization provided it meets the following criteria: (a) it produces output that GTB can render and (b) it produces correct output when a subset of the input dataset is provided. Some bioinformatics tools require the complete input dataset to produce correct output; Section 6 discusses this issue in detail. GTB provides dynamic filters that show and hide data elements in real time as users adjust them (Figure 4). Each filter specifies a range for a particular attribute value; a user can set a filter s range by using a two-handle slider or by clicking on its text label and typing in a new range. Data elements with an attribute values outside the specified range are hidden. Filters are additive so that multiple filters can be applied simultaneously. GTB creates filters based on a track s type. Read tracks can be filtered by quality scores. Feature/annotation/interval tracks can be filtered by their score column. Tracks in GFF/GTF format can be filtered by score and by attributes that have numerical values. Filters are useful for visually identifying data elements that meet certain criteria and for understanding the distribution of attribute values in a dataset. At any time, a user can create a Galaxy dataset of the filtered data that is visible or create a dataset by applying the filter to the whole track. Newly created datasets are available in the user s Galaxy workspace for use or download. Figure 4 demonstrates how a set of transcripts might be meaningfully filtered. First, transcripts are filtered by score, which is a measure of relative expression amongst a set of isoforms; next, transcripts are filtered by FPKM, a measure of overall transcript expression. The remaining transcripts are dominant, highly expressed isoforms. 3.3 Sharing and Publishing Visualizations GTB visualizations are Galaxy objects and, like all Galaxy objects, can be shared or published to the Web using Galaxy s sharing and publication features [14]. Users share GTB visualizations via a GUI; no programming or server configuration is needed to share a GTB visualization. GTB visualizations can be shared in multiple ways. Visualizations can be shared with individuals or can be made accessible on the Web via a URL. A visualization can also be published in Galaxy s public repository, where it is browsable and 1 2 Figure 4. GTB dynamic filters applied to a feature/annotation track: (1) filtered for higher scoring features and (2) filtered for featues with higher FPKM (an abundance measure). searchable. GTB visualizations can also be embedded in Galaxy Pages. Pages are custom Web-based documents that enable users to communicate an entire computational experiment using standard document elements such as text, tables, and figures as well as interactive embedded datasets, workflows, and visualizations. Pages are ideal for an online publication supplement. A shared GTB visualization can be viewed using only a Web browser and is fully functional. A colleague or guest viewing a shared visualization can scroll, zoom, run tools, and dynamically filter data. No configuration is necessary, nor does any data or software need to be downloaded. 4 IMPLEMENTATION The Galaxy Track Browser uses an asynchronous HTTP clientserver model where the server is a Galaxy instance. The GTB client communicates with the server using seven distinct actions: (a) get chromosomes lengths; (b) get available tracks; (c) get track definition; (d) get data; (e) get reference genome data; (f) run a tool on a track subset; and (g) save. Each action corresponds to an asynchronous HTTP request, and all data exchanged between client and server is encoded in JSON format 1. Both the GTB and the Galaxy platform are open source under the Academic Free License [26]. 4.1 Client The GTB client is written entirely in object-oriented JavaScript. The client leverages JavaScript s ecosystem of libraries, APIs, and tools. The GTB client uses several jquery 2 libraries and adheres to CommonJS 3 encapsulation and modularity principles so that it can repurposed by other JavaScript applications. The GTB client s most frequent action is drawing tracks. Hence, the client is optimized to draw tracks as quickly as possible while ensuring that each track is completely

6 customizable. To meet these goals, the client fetches and caches track data from the server and draws tracks itself. As discussed below, drawing track data is very fast. By starting with the track data, the client can draw a track using any configuration specified by the user. Redrawing a track due to a configuration change is also fast because cached data can very often be reused and data need not be fetched from the server. The client renders genomic tracks as a set of adjacent tiles. This is advantageous because as the user zooms, pans, and scrolls, only new tiles need to be drawn; the client caches and reuses existing tiles when possible. Track tiles are drawn in the background and in parallel. Each tile drawn uses its own request to obtain the data to be drawn on the tile. Drawing tiles in the background ensures that the GTB client is responsive to user interaction while drawing. Drawing tiles in parallel ensures that delays encountered when drawing a tile, such as network latency or drawing a large number of elements, does not impact the drawing of other tiles or tracks. However, additional code is needed to determine when all of a track s tiles have been drawn so that post-draw action can be taken. Post-draw actions are used to animate the showing and hiding of data elements when a user is filtering data and to quickly fetch data when running tools. The GTB client renders each tile as an HTML <canvas> element. Canvas elements provide the ability to dynamically and precisely draw using a 2D API; the canvas element is supported by recent versions of all major Web browsers. The main advantages of using the canvas element are the speed, precision, and scale at which data elements can be drawn as compared to using HTML elements to represent data elements. Using JavaScript, the GTB client can render up to 5000 elements per tile while supporting smooth navigation throughout the visualization. The GTB client is modular and can be used with any server that implements the seven actions required by the client (listed previously). Tracks are added to a client by specifying a track definition that includes its name, dataset id, dataset type, filters, and tool (if there is one). Track filters and tools are structured so that the client can render and use them without requiring any a priori knowledge. Each filter includes an index into the track s data that denotes the value to use for filtering. A tool s parameters and inputs are encapsulated in an HTML form that the client can use as an interface that enables a user to set parameter values and run the tool. 4.2 GTB Server The GTB server is integrated into the Galaxy framework as a controller in Galaxy s model-view-controller architecture and implements the actions needed to support the client. In order to provide data quickly to the client, the GTB server creates and uses multiple indices for each dataset in a GTB visualization. The GTB server manages index creation and associations between datasets and indices so that neither users nor the client are burdened. Indices for a dataset are created when a client requests them or when a client requests data from the dataset. BAI indices are created for SAM/BAM datasets using SAMtools [27], tabix [17] indices are created for interval files (including BED, GFF, and VCF), and bigwig [16] indices are created for continuously-valued datasets. In addition, a summary tree index is created for all visualized datasets; a summary tree is a custom Galaxy format used to provide aggregated data over large genomic regions. The server stores associations between datasets and their indices so that, regardless of how often a dataset is used in visualizations, indices are created only once. For library datasets shared amongst all users of a Galaxy instance, indices are created once and shared as well. The GTB server takes multiple actions so that running a tool on a subset of data can be done quickly. The first time a tool is rerun, the server identifies all input datasets needed to produce the alternative dataset and creates indices for them. Indices are used to quickly extract data from the input datasets whenever they are used as inputs for a tool and hence reduce a tool s total execution time. Oftentimes indices will already be present because it is common to visualize both a tool s inputs and outputs, and the server will have created indices for all visualized datasets. To run a tool, the server uses indices to create small input datasets that contain only the data in the visible genomic region, runs the tool on these datasets via the Galaxy framework, and returns a track definition for the new output to the client. A tool integrated with Galaxy can be used in GTB with minimal additional work. Any tool run from the command line can be integrated into Galaxy via a tool wrapper; a tool wrapper is an XML file that specifies a tool s parameters, inputs and outputs. Adding the <trackster-conf> tag to a tool s wrapper indicates that it works correctly in GTB and will make the tool available in GTB. Tools compatible with GTB produce the same output for a particular genomic region regardless of whether the input is the subset of data from the region or the complete dataset. Section 6 discusses this issue in more detail. Each track definition that the GTB server sends to the client includes both the track s filters and its tool. The server determines filters based on the dataset s type; for GTF datasets, the server includes all attribute values that are numerical and hence are filterable. The server reuses the Galaxy framework to generate tool definitions, including an interface for specifying parameters. 4.3 Performance GTB s performance is dependent on numerous factors: client Web browser speed and memory, network latency, and server load and speed. Performance profiling and analysis of GTB has largely focused on the client because Galaxy, acting as GTB s server, can be scaled as needed to effectively support significant Web traffic. A full evaluation of the GTB client s performance will be undertaken soon. However, informal evaluation indicates that (a) rendering takes less than 0.5 seconds per track for all track types at all levels of detail, including data-intensive tracks such as mapped reads and ESTs; and (b) data transfer time is significantly larger than rendering time. These observations suggest that optimizing GTB to mitigate data transfers may lead to large performance gains. 5 CONTRASTING GENOME BROWSERS The Galaxy Track Browser complements other genome browser research by exploring an alternative approach that leverages Galaxy to provide novel genome browser functionality. This section contrasts GTB with six popular genome browsers that are compatible with NGS data: Ensembl [4], IGB [6], IGV [7], JBrowse [8], Savant [9], and UCSC [3]. Table 1 summarizes the functionality of GTB and these six browsers along several important dimensions. The similarities amongst all browsers are indicative of core browser functionality. All browsers can obtain data both from a user s local computer and from remote sources via HTTP. Ensembl, IGB, and IGV also support visualizing data via the DAS protocol [28] as well. All browsers provide support for viewing mapped reads in BAM format, interval or annotations in BED and GFF format, and continuously-valued data in WIG format. Full multi-resolution data models that support complete display customization are available in GTB, IGV, and Savant. IGB and JBrowse both use precomputed image data for some track types. IGB uses precomputed data to render tracks and does not provide customization options. JBrowse uses precomputed images rather 44

7 than data for quantitative (WIG) tracks. The three most recently developed browsers GTB, IGV, Can obtain data locally and from public sources and Savant all use full via HTTP and/or DAS multi-scale resolution data models. This trend Can view mapped reads (BAM), interval files is driven by user (BED, GFF), and continuously-valued data (WIG) demand for more Multi-resolution data model for all datatypes that interactive visualization supports complete display customization tools and by technological advances Run bioinformatics tools to produce and visualize in data indices. new data GTB s integration of analysis tools and its Filters for dynamically showing and hiding data dynamic filters are based on attribute values novel features not Share and customize fully interactive present in other genome visualizations browsers. Savant s plug-in framework has the potential to produce user interactions similar to GTB s tool use and filtering. Using Savant s framework, developers can write plug-ins that operate on visualized data; an example plugin is a SNP finder that highlights potential SNPs based on mapped reads. Both GTB and Savant aim to integrate analysis tools into a browser environment and each approach has advantages. Savant s plug-in framework is useful for prototyping new functionality because plug-in development is straightforward and requires little programming. Integrating tools into GTB/Galaxy is equally simple and potentially no programming is needed, but Galaxy requires that a tool run from the command line. GTB s strength is its integration into Galaxy, a fully functional analysis environment. Integration with Galaxy enables GTB to automatically leverage the tools already available in Galaxy without additional work by users or tool developers. Galaxy has a significant community of developers that have integrated many tools into Galaxy [12]. Making tools available in GTB is beneficial to tool developers because GTB provides access to their tools in a visualization environment. Users benefit because GTB is very usable: users can run tools and visualize output datasets without using a command line or installing any software. As tools are increasingly integrated into Savant and GTB, it may be appropriate to compare them not only with genome browsers but also with analysis environments that incorporate visualization components. Many genome browsers enable users to share data or even complete visualizations. The UCSC genome browser supports sharing custom tracks but not complete visualizations via URL, but there are limitations, including the deletion of tracks 48 hours after they were last accessed. IGB and IGV servers can be set up to make data publicly available or accessible via password. To use this data, IGB and IGV users must download and install the genome browser software and then find and add data from a public server. JBrowse and GTB make sharing data simpler because they are Web-based, and hence fully interactive visualizations can be shared via URL. However, GTB s approach for sharing and modifying visualizations is more flexible and user-friendly than JBrowse. JBrowse visualizations are publicly accessible once they are created; GTB visualizations can be shared with individual users, made accessible via URL, or published to Galaxy s repository. Adding datasets to a JBrowse visualization requires that a user access the server to install and preprocess the new data. GTB users add datasets to a visualization using a GUI. GTB users need no programming experience to share and modify visualizations. Table 1. Comparing genome browsers. Ensembl GTB IGB IGV JBrows e In summary, GTB is the only Web-based browser to implement current best practices amongst genome browsers for fetching, representing, and displaying data. Also, GTB s interface provides unique features, enabling users to run tools to produce new data, dynamically filter data, and share fully interactive and customizable visualizations. 6 DISCUSSION Savant UCSC 6.1 Making Visualization-compatible Analysis Tools Interactive visualizations such as genome browsers are useful primarily because users can, in real time, manipulate data and receive feedback. Analysis tools, then, must run quickly if they are to be useful additions to genome browsers. Currently, many bioinformatics analysis tools can run for minutes or hours, especially when processing large datasets, and are unsuitable for an interactive visualization. GTB addresses this issue by running an analysis tool on the subset of data for the genomic region being viewed. Seeing a tool s output for a chosen genomic region provides useful information while minimizing a tool s execution time. For many analysis tools, this approach is sufficient to ensure that runtime is on the order of seconds. However, this approach is not compatible with all analysis tools. Some tools require not only input data for a particular genomic region but also data from other regions or from all regions. There are two common reasons that data outside a region of interest may be needed. First, a tool may build a global model as part of its execution; for instance, both the transcript assembly tool Cufflinks [24] and the ChIP-seq analysis program MACS [29] use global models to perform normalization. Second, a tool may require a complete input dataset because it is not possible to identify, prior to runtime, a subset of input data needed to produce correct output in a particular genomic region. This is true for the NGS read mapping tools Bowtie [30] and Tophat [31]. We are exploring two approaches to address this issue. One approach is to augment tools to store a global model for an input dataset and use the stored model when processing a subset of the input. We have successfully applied this approach to modify Cufflinks to work with subsets of input data. As the use of analysis tools in genome browsers becomes more common, tool developers are likely to support stored global models and other approaches that enable tools to be run on subsets of data or particular genomic regions. For tools that cannot be modified to run on subsets of input data, an alternative approach is to use dynamic filtering to 45

8 simulate running a tool using different parameters. In this approach, a tool s parameters are relaxed so that all potential output is produced and attribute values are attached to output data. A user can then use attribute values to filter data and observe which data points would be produced for particular parameter values. This approach is ideal for tools that use parameter settings to omit data from their output, including Bowtie and Tophat. 6.2 Towards a Visual Analysis Environment Currently, GTB users can rerun tools to both understand how tool parameter values influence tool output and to refine their selection of parameter values to achieve a desired output. This a first step toward creating a general purpose visual analysis environment. GUI environments for bioinformatics analyses such as Galaxy and GenePattern [32] have proven very popular because they make it easier for non-programmers to run bioinformatics tools and perform complete analyses. GUI environments represent inputs, tools, and outputs using text and utilize GUI widgets such as drop-down boxes and text fields that enable users to choose inputs and tools. A complementary approach to GUI environments is a visual analysis environment where a user performs the same actions visually. A visual analysis environment is another approach for making bioinformatics tools and especially outputs more accessible. We are extending GTB so that users can run any compatible tool on any set of input tracks. For example, to create a track of the intersecting regions for two annotation tracks, a user might select the intersect tool and then drag two tracks onto the new track. GTB would then run the tool and renders the tool output. Repeated use of tools would produce more tracks and tracks could be organized so that a user could scroll through the tracks to visually see the steps in her analysis. When inputs and tools can be selected and used in GTB, GTB will function very much like a standard GUI bioinformatics environment. 7 CONCLUDING THOUGHTS Next-generation sequencing technologies and analysis tools present new challenges for genome browsers. NGS datasets are extremely large and growing, and full multi-resolution data models are needed to enable smooth interaction with such large datasets. NGS tools and analyses are complicated and often require use of both tools and visualizations to understand, debug, and improve. To support use of analysis and visualization together, genome browsers need to integrate tools and enable tools to be used to produce new data, all within a browser. NGS experiments are highly collaborative, and genome browsers should facilitate fast and simple sharing of visualizations; shared visualizations can also play a large role when publishing analyses. The Galaxy Track Browser is a Web-based genome browser integrated into the Galaxy platform that addresses these challenges. GTB is the only Web-based genome browser that provides a full multi-resolution data model. GTB provides tools and dynamic filters that enable users to produce, visualize, and analyze data all within GTB. Using the Galaxy framework, fully-functional GTB visualizations can be shared with colleagues or published to the Web. GTB is user friendly; GTB requires no programming experience to use and all GTB functionality is available using only a Web browser. REFERENCES [1] M. S. Cline and W. J. Kent, "Understanding genome browsing," Nat. Biotechnol., vol. 27, pp , [2] C. B. Nielsen, et al., "Visualizing genomes: techniques and challenges," Nature Methods, vol. 7, pp. S5-S15-S5-S15, [3] W. J. Kent, "The human genome browser at UCSC," Genome Res., vol. 12, pp , [4] P. Flicek, et al., "Ensembl 2011," Nucleic Acids Research, vol. 39, pp. D800-D806, January 1, [5] L. D. Stein, et al., "The Generic Genome Browser: A Building Block for a Model Organism System Database," Genome Research, vol. 12, pp , [6] J. W. Nicol, et al., "The Integrated Genome Browser: free software for distribution and exploration of genome-scale data sets," Bioinformatics, vol. 25, pp , [7] J. T. Robinson, et al., "Integrative genomics viewer," Nat Biotech, vol. 29, pp , [8] M. E. Skinner, et al., "JBrowse: A next-generation genome browser," Genome Research, vol. 19, pp , [9] M. Fiume, et al., "Savant: genome browser for high-throughput sequencing data," Bioinformatics, vol. 26, pp , [10] E. R. Mardis, "Next-generation DNA sequencing methods," Annu. Rev. Genomics Hum. Genet., vol. 9, pp , [11] S. Pepke, et al., "Computation for ChIP-seq and RNA-seq studies," Nat Meth, vol. 6, pp. S22-S32-S22-S32, [12] D. Blankenberg, et al., "A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly," Genome Research, vol. 17, pp , [13] B. Giardine, "Galaxy: a platform for interactive large-scale genome analysis," Genome Res., vol. 15, pp , [14] J. Goecks, et al., "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences," Genome Biology, vol. 11, pp. R86-R86, [15] Z. Wang, et al., "RNA-Seq: a revolutionary tool for transcriptomics," Nat Rev Genet, vol. 10, pp , Jan [16] W. J. Kent, et al., "BigWig and BigBed: enabling browsing of large distributed datasets," Bioinformatics, vol. 26, pp , [17] H. Li, "Tabix: fast retrieval of sequence features from generic TABdelimited files," Bioinformatics, vol. 27, pp , [18] J. Tonti-Filippini. (April 21, 2011). Anno-J: Annotation Browsing 2.0. Available: [19] K. A. Cook and J. J. Thomas, Illuminating the Path: The Research and Development Agenda for Visual Analytics, [20] W. J. Kent, "BLAT--The BLAST-Like Alignment Tool," Genome Research, vol. 12, pp , [21] M. C. Schatz, et al., "Hawkeye: an interactive visual analytics tool for genome assemblies," Genome Biol., vol. 8, p. R34, [22] J. P. A. Ioannidis, et al., "Repeatability of published microarray gene expression analyses," Nat Genet, vol. 41, pp , [23] "Devil in the details," Nature, vol. 470, pp , [24] C. Trapnell, et al., "Transcript assembly and quantification by RNA- Seq reveals unannotated transcripts and isoform switching during cell differentiation," Nat Biotech, vol. 28, pp , [25] A. McKenna, et al., "The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data," Genome Res, vol. 20, pp , Sep [26] Open Source Initiative. (April 21, 2011). The Academic Free License 3.0. Available: [27] H. Li, et al., "The Sequence Alignment/Map format and SAMtools," Bioinformatics, vol. 25, pp , [28] R. Dowell, et al., "The Distributed Annotation System," BMC Bioinformatics, vol. 2, p. 7, [29] Y. Zhang, et al., "Model-based Analysis of ChIP-Seq (MACS)," Genome Biology, vol. 9, pp. R137-R137, [30] B. Langmead, et al., "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome," Genome Biology, vol. 10, pp. R25-R25, [31] C. Trapnell, et al., "TopHat: discovering splice junctions with RNA- Seq," Bioinformatics, vol. 25, pp , [32] M. Reich, et al., "GenePattern 2.0," Nature Genetics, vol. 38, pp ,

NGS Data Visualization and Exploration Using IGV

1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians