A short Introduction to UCSC Genome Browser

A short Introduction to UCSC Genome Browser Elodie Girard, Nicolas Servant Institut Curie/INSERM U900 Bioinformatics, Biostatistics, Epidemiology and computational Systems Biology of Cancer 1

Why using the UCSC Genome Browser? Hosts genomes from a variety of organisms as well as different assembly versions for each genome Access to a huge database of annotations and public data Possibility to load your own data from your computer or from an URL. You can then make your tracks publicly available, or keep them private. You can create your own session on UCSC, which allows you to save your loaded tracks and share them with colleagues easily. How to access the UCSC Genome Browser? Open your web browser and go to: http://genome.ucsc.edu/index.html How to learn more about the UCSC Genome Browser? Go to : http://genome.ucsc.edu/training.html How to create an account? Go to : http://genome.ucsc.edu/cgi-bin/hgsession 2

Creating your session If you want to upload custom tracks to the UCSC genome browser, you might want to save them on UCSC or to share them with colleagues. In this case, registration is required. Then log in each time and save your current session with an appropriate name. This is not mandatory: you can use UCSC without having a session. Signing in: in the bar menu, click on My Data, then on Session. Click on Create an account and follow the instructions. Once your account is activated, log in. You ll be redirected to the Session page. Once you re logged-in: On the same page you can now see two sections, one called My Sessions where all saved sessions are available and the other one called Save Settings. To save your current session (public tracks, custom tracks, all with their visibility status and order), enter a name then uncheck allow this session to be loaded by others if you want it to be private and submit. It will then appear under the section My Sessions, under its name, with a creation date and other options such as use and delete. If you decide to share this session with a colleague, you have to modify the sharing propriety by checking the option shared with others?. You can then click on send email and an email will be created with the URL of your session. If you want to reset the browser to default parameters, click on Click here to reset. 3

UCSC: a powerful Genome Browser In the left menu, click on Genome Browser. You can then choose the reference genome by selecting a group (ex: Mammal), a genome (ex: Human) and then an assembly (ex: Feb.2009 also known as hg19). In position, you can see the default viewpoint of the browser for this genome. To change it, you can either enter a position or a gene symbol in the search term box (ex: chrx: 200000-210000, or Xist). Click on Submit to access the browser. Available Tracks Scroll down on the page to see all available defaults tracks 4

Navigation A lot of tracks are already shown by default. And many others are hidden. A track can have different visibility status, representing different ways of visualization. hide, which renders the track invisible dense, which collapses all the features into a single line squish, which displays each feature on a separate line, but at 50% of the height of full mode and without labels pack, which displays several features on each line with labels full, which shows each feature on a separate line with labels A word of advice: If you want to see a large range of coordinates, set the visibility of your tracks first to dense to avoid seeing too much data to see anything relevant. Scroll down under the browser to see the list of tracks you can choose to display (for more information on the track, click on its name). For each track, you have to select the visibility status then hit the button Refresh. You can also right click on the track directly on the browser and choose the visibility you want. A couple of additional buttons are available under the browser frame: Track search : go to a search page to select additional public tracks. Default tracks : display only default tracks. Default order : display current tracks in their default order. You can choose the order of the tracks by clicking on the track you want to move, at the left side of the browser where the name of the tracks are written, then by dragging it up or down. Hide all : set all current tracks visibility to hide. Add custom tracks : upload your own data or a public track of your choice. How to create adequate files and upload them will be explained in details later. Track Hubs : Track data hubs are collections of tracks from outside of UCSC that can be imported into the Genome Browser. By example you can add the ENCODE Analysis Hub to your current tracks, by checking it and then clicking on Use selected Hubs. A lot of tracks will be added to your browser: click on Default tracks to go back to the default parameters. 5

Several options exist to zoom in/out and move along the chromosome. Above the image, you can use the arrows buttons to move up to 95% your range to the left or to the right (<<<: 95%; <<: 50 %; <: 10%). By example, if you have a range of 1kb, you are going to have a shift of 950bp (95%) or 500bp (50%) or 100bp (10%) to the left/right. You can also click inside the image and then drag to the left or right to move. To zoom in/out, you can use the buttons above the image or you can click on a base position to zoom in on it. Finally, you can also use this combination: holding the Shift key on your keyboard, you have to click on the tracks and drag to the left or right. 6

UCSC: Accessing the ENCODE data Track search : Enter the H3K27me3 keyword to see all related ENCODE tracks. You can then choose the tracks to visualize by clicking on the check-box at the beginning of the line(s). Go back to the home page and click on the ENCODE link. You can then explore the ENCODE data available. 7

UCSC: Loading custom tracks In the menu bellow the browser, click on Add Custom Track. The description of all the supported format is available. The first thing to do is to pick a reference genome (clade: Mammal, genome: Human and assembly: Feb2009). Then you can see a list of formats supported by UCSC: BED, bigbed, bedgraph, GFF, WIG, bigwig Note that these formats are now standard for NGS data analysis. All examples files used in this section can be downloaded from Galaxy: in the bar menu click on Shared data > Published Histories > UCSC visualization. Files can be created in Excel or with the text editor of your choice. If you decide to do it with Excel, keep in mind to save your file as a tab-delimited text file. Otherwise, always separate the different columns by a tabulation and not a space. Warning: With the BED format, the first base of the genome is numbered 0 (and not 1 as with the GFF or GTF format). End position remains 1-based. BED format: With a BED file, you can draw lines that are displayed like in an annotation track. The BED format requires a minimum of 3 columns: Chromosome name Starting position Ending position Each row will draw one line from the starting position until the base before the ending position. You can create your own rows or use this one (called basic.bed ): chr9 20000 20100 chr9 20050 20080 chr9 24040 24077 The data can be either directly paste inside the box or uploading through a local file. You can see your added track in Custom tracks. The name and description are the default ones: User Track and User Supplied Track. Any additional track will therefore overwrite the current one. To create another custom track, the name and description have to be specified. 8

The browser goes at the first interval's coordinates (i.e.: chr9:20000-20100) and the visualization parameters are the default ones: dense visibility and black color. You can navigate into the browser to have a better look at your data, or you can add some optional columns to your BED. Adding an header in the BED file: browser position chr9:19500-25000 : defines the coordinates to focus on browser hide all : will hide every other tracks (optional) browser pack refgene encoderegions: put all tracks you want to see packed separated by a space (you can add one line for every visibility status) track type= bed name= Example description= NGS training example visibility= full itemrgb= On : by adding a name you will be able to identify your track. The description is optional but alows to add more information about the track. Specify the visibility to the one you want (by example full ) and set the color parameter itemrgb on On to add color to your track. Adding a itemrgb. The UCSC browser will expect to work with a BED file composed 9 columns: Chromoso me name Starting position Ending position Name of the line Score (<1000) Strand thickstart thickend itemrgb Name of the line: the name of the line is displayed when the visibility is set to pack or full. Score: if you don t want to add color but only levels of gray (higher numbers=darker gray). It requires to add usescore= 1 to the header line starting with track. Set this column to 0 by default. thickstart / thickend: beginning and end position at which the line is drawn thickly. By default, set them to the starting and ending position. itemrgb: an RGB value of the form Red, Green, Blue (0 to 255). By example, Red= 255,0,0 ; Green= 0,255,0 ; Blue= 0,0,255 ; Black= 0,0,0 ; White= 255,255,255. You can mix the 3 colors together with different values: Orange 255,102,51. Here s a more complete BED (called elaborate.bed ): browser position chr9:19500-25000 track type= bed itemrgb= On name= Example description= NGS training example visibility= full chr9 20000 20100 Line1 0 + 20000 20100 255,0,0 chr9 20050 20080 Line2 0-20050 20080 0,255,0 chr9 24040 24077 Line3 0 + 24040 24077 0,0,255 9

BedGraph format: This display type is useful to represent quantitative data such as expression level. It draws bar graph at specified chromosome segment region. A bedgraph is composed of 4 columns: Chromosome name Starting position Ending position Data Value (+ or -) A header is mandatory to specify at least the BEDgraph track type: browser position chr9:19500-25000 browser hide all browser pack refgene encoderegions track type= bedgraph name= Example bedgraph description= NGS training example bedgraph visibility= full color=255,0,0 altcolor=125,125,125 Note that only two colors can be specified for a bedgraph: color for the positive values and altcolor for the negative ones. As an exemple, open the file elaborate.bedgraph : browser position :49302001-49304701 browser hide all browser pack refgene encoderegions track type= bedgraph name= Example bedgraph description= NGS training example bedgraph visibility= full color=255,0,0 altcolor=125,125,125 10

49302000 49302300-1.0 49302300 49302600-0.75 49302600 49302900-0.50 49302900 49303200-0.25 49303200 49303500 0.0 49303500 49303800 0.25 49303800 49304100 0.50 49304100 49304400 0.75 49304400 49304700 1.00 Wiggle format (Wig): It is useful to display dense, continuous data such as GC percent, probability scores and coverage data. It draws bar graph similarly to the bedgraph format but has one constraint: data elements must have the same size (the width of the bar is fixed). There are two options to format wiggle data: variablestep and fixedstep. variablestep is used for data with irregular intervals between bars. It is composed of one headline specifying the chromosome and the span (width of the bar) followed by 2 columns: variablestep chrom= span=150 Starting position of the bar Data Value (+ or -) fixedstep is used for data with regular intervals between bars. It is composed of one headline specifying the chromosome, the starting coordinate, the step and the span (width of the bar) followed by 1 column: fixedstep chrom= start=40000 step=300 span=150 Data Value (+ or -) As for bedgraph, you have to add a header specifying at least the track type: browser position :40000-50000 browser hide all browser pack refgene encoderegions track type= wiggle_0 name= Example wig description= NGS training example wig visibility= full color=255,0,0 altcolor=125,125,125 2 examples: fixedstep.wig and variablestep.wig browser position :40000-46000 browser hide all 11

browser pack refgene encoderegions track type= wiggle_0 name= Example fixedstep wig description= NGS training example fixedstep wig visibility= full color=255,0,0 altcolor=125,125,125 fixedstep chrom= start=40000 step=300 span=200 30 25 18 2 0-10 -2 0 20 browser position :40000-46000 browser hide all browser pack refgene encoderegions track type= wiggle_0 name= Example variablestep wig description= NGS training example variablestep wig visibility= full color=255,0,0 altcolor=125,125,125 variablestep chrom= span=200 40150 10 40666-12.5 40963 15 42598 17.5 42892 20 45000-17.5 45200-15 fixedstep variablestep 12

bigbed and bigwig format: In case you have a very large dataset, uploading it on UCSC can be too long and can even fail. A more practical way to add data to UCSC is to use the indexed binary format, bigbed or bigwig formats. Data remain on your web server (http, ftp, https) and only the portions of the file needed to display a particular region are transferred to UCSC. To generate these files, command line tools and Galaxy utilities are usually available. Once the files are available through a web serveur, you simply have to paste the file URL. track type=bigbed name= bigbed ex description= bigbed visibility= full bigdataurl= http://mywebsite/wgencodehaibtfbsa549atf3v0422111etoh02pkrep1.bb UCSC: Extracting information Using the ftp website http://hgdownload.soe.ucsc.edu/downloads.html Using the Table Browser 13

A concrete example You are going to use the data of a ChIP-seq experiment from the project ENCODE to observe the binding sites of the same transcription factor GATA-3 visualized in IGV but in a different cell line, T-47D. These binding sites are determined by first mapping the reads on the genome then by finding the enriched regions of high read density (peaks). ENCODE tracks need to be loaded so go in My Data > Track Hubs and select the ENCODE Analysis Hub. Scroll down under the browser and set all tracks to hide. Be sure to put the RefSeq Genes and the ENCODE Transcription Factors tracks on dense. Then Click on ENCODE Transcription Factors in the ENCODE Analysis Hub section. Just under Select subtracks by factor and cell line, click on the minus icon before All. Check the box corresponding to the GATA-3 Factor and the T-47D cell line. Scroll down, set the visibility of the Signal on Full and click on Submit. 3 tracks are visible on your browser: 2 of peaks and 1 of signal. Go to chr2:190351700-190353200 to zoom on 2 nice peaks. Peaks that are characterized as enriched regions have a bar in the peaks tracks. You can see that both peaks seem to be cut off. You can change the parameter of the track by right clicking on it then on Configure T-47D GATA3 Sg and enter these parameters (track height and vertical range): As you can see, these peaks are not intersecting with genes. To extract all peaks from this track intersecting at least at 80% with RefSeq genes on the chromosome 2, you can use the tool Tools > Table browser. First, select the right track in the ENCODE Analysis hub group (all parameters are shown below if you need some help). Scroll down to at least half the list to find the track corresponding to the peaks of GATA-3 in the T-47D cell line. Create an intersection. Choose the track corresponding to the RefSeq genes and an intersection of at least 80%. Choose the output format hyperlinks to Genome Browser. This way, you ll see every intersection found as a hyperlink going directly at the right position on the genome browser. Then click on Get output. 14

The previous peaks we found at position chr2:190351700-190353200 are not present in the list of coordinates: they do not interact with a RefSeq gene at least at 80%. Click on any one you want to open it in the genome browser and play with the parameters to have a better visualization (vertical range, zoom, track visibility...) Exercise: open the intersection at position chr2:86693491-96693887 and find which gene is intersecting with this peak. 15