Relationship Between BED and WIG Formats

Size: px

Start display at page:

Download "Relationship Between BED and WIG Formats"

Barbara Lane
5 years ago
Views:

1 Relationship Between BED and WIG Formats Pete E. Pascuzzi July 2, 2015 This example will illustrate the similarities and differences between the various ways to represent ranged data in R. In bioinformatics, ranged data is exemplified by any feature that can map to a position in a specific genome, e.g. annotated genes, DNA methylation sites, or mapped RNA-seq data. Ranged data shoud have start and end coordinates in a specific genome and a chromosome number. Ideally a specific build for that genome should be indicated as well, perhaps in leading lines in a file or a data field reserved for metadata. Below, I am going create a simple data frame that has some ranged data similar to short reads from an NGS experiment. I am going to make several data sets that are derived from the original data frame. Why shoud you do this? Because, to take advantage of some functions in Bioconductor, the data will often have to be stored as a particular class. For example coverage will only work on data stored as an IRange, including classes like IRanges or Views. Similarly, slice will only function a Rle object. Finally, there are several standard file formats that are used to display data on genome browsers, including BED and WIG. I will illustrate how you can create files in this format as well. > ## Create data frame with short reads in 2 steps > shortread.df <- data.frame(seq="chr4", start=c(6, 11, 21, 26, 41, 51, 56, 61, + 66, 76, 151, 156)) > ## make them all the same length > shortread.df$end <- shortread.df$start + 49 > print(shortread.df) seq start end 1 chr chr chr chr chr chr chr chr chr chr chr chr > ## can we calculate the coverage from the data frame? No! > ## shortread.cov <- coverage(shortread.df$start, shortread.df$end) > ## create IRanges object from short read and give a name that > ## indicates chr and read number. 1

2 > shortread.ir <- IRanges(start=shortread.df$start, end= + shortread.df$end, names=paste("chr4", + 1:nrow(shortread.df), sep="_")) > print(shortread.ir) IRanges of length 12 [1] chr4_1 [2] chr4_2 [3] chr4_3 [4] chr4_4 [5] chr4_ [8] chr4_8 [9] chr4_9 [10] chr4_10 [11] chr4_11 [12] chr4_12 > ## Coverage will work now > shortread.cov <- coverage(shortread.ir) > print(shortread.cov) integer-rle of length 205 with 18 runs Lengths: Values : > ## we can slice-out regions in our data where the coverage > ## is above, below or between a specified range. The result is an > ## IRanges object. > my.peaks <- slice(shortread.cov, lower=2, rangesonly=t) > names(my.peaks) <- paste("chr4_peak",1:length(my.peaks),sep="_") > print(my.peaks) IRanges of length 2 [1] chr4_peak_1 [2] chr4_peak_2 > ## What if we wanted to count the number of reads mapping to a > ## specific gene? > my.gene <- IRanges(start=30, end=50, names="my.gene") > print(my.gene) IRanges of length 1 [1] my.gene > my.gene.count <- countoverlaps(my.gene, shortread.ir) > print(my.gene.count) 2

3 my.gene 5 > ## You could also calculate the coverage across the gene, > ## but coverage is NOT the way you should handle RNA-seq data > ## for differential expression! > my.gene.cov <- mean(shortread.cov[my.gene]) > print(my.gene.cov) [1] > ## Can we do this on a vector of gene start and end positions? > another.gene <- IRanges(start=100, end=200, names="another_gene") > my.genes <- c(my.gene, another.gene) > print(my.genes) IRanges of length 2 [1] my.gene [2] another_gene > my.genes.cov <- mean(shortread.cov[my.genes[1:2]]) > print(my.genes.cov) [1] > ## No, we only get a single value, not two values. The value is > ## different than the coverage of the first gene, so this must > ## be the average coverage for both genes combined. > ## To do multiple genes, we need to use lapply. We can split > ## the IRanges object based on its names > genes.ir.list <- split(my.genes, f=names(my.genes)) > my.genes.cov <- unlist(lapply(my.genes,function(x)(mean(shortread.cov[x])))) > print(my.genes.cov) my.gene another_gene > ## Now, format data for BED file so that the reads can be > ## displayed on a genome browser. Remember the zero-based > ## indexing and remember to convert back if you want to use data > ## derived from a BED file in R! > shortreads.bed <- data.frame(chrom="chr4", chromstart= + shortread.df$start - 1, chromend= + shortread.df$end) > bed.header <- "track type=bed name=short_reads description=simulated" > write(bed.header, file="shortreads.bed", ncolumns=1) > write.table(shortreads.bed, file="shortreads.bed", append=t, sep="\t", + quote=f, row.names=f, col.names=f) > peaks.bed <- data.frame(chrom="chr4", chromstart= 3

4 + start(my.peaks) - 1, chromend= + end(my.peaks)) > bed.header <- "track type=bed name=short_reads description=simulated" > write(bed.header, file="peaks.bed", ncolumns=1) > write.table(peaks.bed, file="peaks.bed", append=t, sep="\t", + quote=f, row.names=f, col.names=f) > ## We can also produce a WIG file easily from the Rle object > ## produce by coverage. Probably some way to do this from the > ## data frame but this way is easy. We can calculate the positions > ## where our values change from the runlengths, but we need to add > ## the first position to the runlength vectors to get accurate > ## positions on the chromosome. Also need to add a runvalue of zero > ## to the end of the runvalues vector to let the WIG track fall > ## back to zero. > shortreads.wig <- data.frame(position=cumsum(c(1, runlength( + shortread.cov))), values=c(runvalue( + shortread.cov), 0)) > wig.header <- "track type=wiggle_0 name=shortreads description=simulated" > write(wig.header, file="shortreads.wig", ncolumns=1) > my.declare <- "variablestep chrom=chr4" > write(my.declare, file="shortreads.wig", ncolumns=1, append=t) > write.table(shortreads.wig, file="shortreads.wig", append=t, sep="\t", + quote=f,row.names=f,col.names=f) > ## Now we can plot some of this data to see how they compare > #pdf(file="example.pdf",w=6,h=6) > par(lend="butt") > plot(c(1,250),y=c(0,10),type="n", main="", xlab="position", + ylab="coverage") > segments(start(shortread.ir), c(1:7,1:3,1:2), end(shortread.ir), + c(1:7,1:3,1:2), lwd=4) > segments(peaks.bed$chromstart + 1, 9, peaks.bed$chromend, 9, lwd=4, col="blue") > points(shortreads.wig, type="s", lwd=2, col="red") > abline(v=seq(from=0, to=250, by=10), lty=2) > legend("right", legend=c("read", "wig", "peak"), horiz=f, lwd=2, col=c("black", "red", "blue"), bty="o > #dev.off() 4

5 coverage read wig peak position 5

From genomic regions to biology

From genomic regions to biology Before we start: 1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for the class (see separate hand out) 3. Connect to your lab space through the wihtdata network and