IRanges and GenomicRanges An introduction

Similar documents
A quick introduction to GRanges and GRangesList objects

Handling genomic data using Bioconductor II: GenomicRanges and GenomicFeatures

Ranges (and Data Integration)

Outline Introduction Sequences Ranges Views Interval Datasets. IRanges. Bioconductor Infrastructure for Sequence Analysis.

Range-based containers in Bioconductor

High-level S4 containers for HTS data

Using the GenomicFeatures package

10 things (maybe) you didn t know about GenomicRanges, Biostrings, and Rsamtools

Representing sequencing data in Bioconductor

Saying Hello to the Bioconductor Ranges Infrastructure

Some Basic ChIP-Seq Data Analysis

Useful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017

featuredb storing and querying genomic annotation

Package MethylSeekR. January 26, 2019

segmentseq: methods for detecting methylation loci and differential methylation

segmentseq: methods for detecting methylation loci and differential methylation

Working with aligned nucleotides (WORK- IN-PROGRESS!)

ExomeDepth. Vincent Plagnol. May 15, What is new? 1. 4 Load an example dataset 4. 6 CNV calling 5. 9 Visual display 9

Counting with summarizeoverlaps

Package bsseq. April 9, 2015

An Introduction to the GenomicRanges Package

Package rgreat. June 19, 2018

An Introduction to the genoset Package

ChIP-Seq Tutorial on Galaxy

Package DASiR. R topics documented: October 4, Type Package. Title Distributed Annotation System in R. Version

Bioconductor packages for short read analyses

Package dmrseq. September 14, 2018

An Overview of the S4Vectors package

Practical: Read Counting in RNA-seq

Visualisation, transformations and arithmetic operations for grouped genomic intervals

Using generxcluster. Charles C. Berry. April 30, 2018

Triform: peak finding in ChIP-Seq enrichment profiles for transcription factors

BadRegionFinder an R/Bioconductor package for identifying regions with bad coverage

Package CAGEr. September 4, 2018

Biostrings Lab (BioC2008)

An Introduction to VariantTools

Biostrings/BSgenome Lab (BioC2009)

Package ChIPseeker. July 12, 2018

ChIP-seq hands-on practical using Galaxy

Package seqcat. March 25, 2019

Prepare input data for CINdex

Package DMCHMM. January 19, 2019

Package BubbleTree. December 28, 2018

Package roar. August 31, 2018

Package EnrichedHeatmap

Relationship Between BED and WIG Formats

Introduction to Biocondcutor tools for second-generation sequencing analysis

Introduction to GenomicFiles

Advanced UCSC Browser Functions

An Introduction to IRanges

Package MIRA. October 16, 2018

Package gwascat. December 30, 2018

Package bsseq. April 14, 2017

Package genomeintervals

Overview of the intervals package

Overview of ensemblvep

Adam Mark, Ryan Thompson, Chunlei Wu

Working with ChIP-Seq Data in R/Bioconductor

panda Documentation Release 1.0 Daniel Vera

Package HTSeqGenie. April 16, 2019

Package GreyListChIP

epigenomegateway.wustl.edu

Using the Streamer classes to count genomic overlaps with summarizeoverlaps

PICS: Probabilistic Inference for ChIP-Seq

Package metagene. November 1, 2018

riboseqr Introduction Getting Data Workflow Example Thomas J. Hardcastle, Betty Y.W. Chung October 30, 2017

The rtracklayer package

Annotation resources - ensembldb

Package valr. November 18, 2018

Package FunciSNP. November 16, 2018

Package ChIC. R topics documented: May 4, Title Quality Control Pipeline for ChIP-Seq Data Version Author Carmen Maria Livi

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

Package methyanalysis

PODKAT. An R Package for Association Testing Involving Rare and Private Variants. Ulrich Bodenhofer

Rsubread package: high-performance read alignment, quantification and mutation discovery

fastseg An R Package for fast segmentation Günter Klambauer and Andreas Mitterecker Institute of Bioinformatics, Johannes Kepler University Linz

From genomic regions to biology

Rsubread package: high-performance read alignment, quantification and mutation discovery

Overview of ensemblvep

Package bnbc. R topics documented: April 1, Version 1.4.0

Copy number variant detection in exome sequencing data using exomecopy

Goldmine: Integrating information to place sets of genomic ranges into biological contexts Jeff Bhasin

MMDiff2: Statistical Testing for ChIP-Seq Data Sets

regioner: Association analysis of genomic regions based on permutation tests

Package TxRegInfra. March 17, 2019

Package DNAshapeR. February 9, 2018

Making and Utilizing TxDb Objects

Package Rsubread. July 21, 2013

Package MEAL. August 16, 2018

Adam Mark, Ryan Thompson, Chunlei Wu

Simple karyotypes visualization using chromdraw Jan Janečka Research group Plant Cytogenomics CEITEC, Masaryk University of Brno

MethylSeekR. Lukas Burger, Dimos Gaidatzis, Dirk Schübeler and Michael Stadler. Modified: July 25, Compiled: April 30, 2018.

Package VSE. March 21, 2016

Package GenomicRanges

Package Organism.dplyr

The UCSC Gene Sorter, Table Browser & Custom Tracks

Building and Using Ensembl Based Annotation Packages with ensembldb

Package TnT. November 22, 2017

CS313 Exercise 4 Cover Page Fall 2017

Package cleanupdtseq

Transcription:

IRanges and GenomicRanges An introduction Kasper Daniel Hansen <khansen@jhsph.edu> CSAMA, Brixen 2011 1 / 28

Why you should care IRanges and GRanges are data structures I use often to solve a variety of problems, typically related to annotating the genome. These data structures are very fast and efficient. Every R user who deals with genomic data should master these packages. These two packages are themselves compelling arguments for learning R! 2 / 28

Aims We provide an overview of the two packages IRanges and GenomicRanges. At first, these two packages can appear daunting, with many classes and functions. Luckily, things are not as hard as the first appear. Both packages now contain very extensive and useful vignettes. Read them! 3 / 28

Idea A surprising amount of objects/tasks in computational biology can be formulated in terms of integer intervals, manipulation of integer intervals and overlap of integer intervals. Objects: A transcript (a union of integer intervals), a collection of SNPs (intervals of width 1), TF binding sites, a collection of aligned short reads. Tasks: Which TF binding sites hit the promoter of genes (overlap between two sets of intervals), which SNPs hit a collection of exons, which short reads hit a predetermined set of exons. IRanges are collections of integer intervals. GRanges are like IRanges, but with an associated chromosome and strand, taking care of some book keeping. 4 / 28

Basic IRanges Specify IRanges by 2 of start, end, width (SEW). > library(iranges) > ir1 <- IRanges(start = c(1,3,5), end = c(3,5,7)) > ir1 IRanges of length 3 start end width [1] 1 3 3 [2] 3 5 3 [3] 5 7 3 > ir2 <- IRanges(start = c(1,3,5), width = 3) > all.equal(ir1, ir2) [1] TRUE Assessor methods: start(), end(), width() and also replacement methods > width(ir2) <- 1 > ir2 5 / 28

Basic IRanges IRanges of length 3 start end width [1] 1 1 1 [2] 3 3 1 [3] 5 5 1 They may have names > names(ir1) <- paste("a", 1:3, sep = "") > ir1 IRanges of length 3 start end width names [1] 1 3 3 A1 [2] 3 5 3 A2 [3] 5 7 3 A3 and has a single dimension > dim(ir1) NULL > length(ir1) [1] 3 6 / 28

Aside: GRanges GRanges are like IRanges but with a strand and a chromosome (seqnames) > gr <- GRanges(seqnames = "chr1", strand = c("+", "-", "+"), ran > gr GRanges with 3 ranges and 0 elementmetadata values seqnames ranges strand <Rle> <IRanges> <Rle> A1 chr1 [1, 3] + A2 chr1 [3, 5] - A3 chr1 [5, 7] + seqlengths chr1 NA There are some additional differences, we will return to GRanges later. 7 / 28

Normal IRanges A normal IRanges is a minimal representation of the IRanges viewed as a set. Each integer only occur in a single range and there are as few ranges as possible. In addition, it is ordered. Created by reduce(). ir 2 4 6 8 10 reduce(ir) 2 4 6 8 10 8 / 28

Normal IRanges > ir IRanges of length 4 start end width [1] 1 4 4 [2] 3 4 2 [3] 7 8 2 [4] 9 10 2 > reduce(ir) IRanges of length 2 start end width [1] 1 4 4 [2] 7 10 4 Which bases belong to an exon? 9 / 28

Disjoin ir 2 4 6 8 10 disjoin(ir) 2 4 6 8 10 Which bases belong to the same set of exons? 10 / 28

Manipulating IRanges shift(), narrow(), flank(), shift(), narrow(), restrict() union(), intersect(), setdiff() gaps() (returns normalized IRanges) punion(), pintersect(), psetdiff(), pgap() 11 / 28

Finding Overlaps Finding (pairwise) overlaps between two IRanges is done by findoverlaps. This function is amazingly fast! > ir1 <- IRanges(start = c(1,4,8), end = c(3,7,10)) > ir2 <- IRanges(start = c(3,4), width = 3) > ov <- findoverlaps(ir1, ir2) > ov An object of class "RangesMatching" Slot "matchmatrix": query subject [1,] 1 1 [2,] 2 1 [3,] 2 2 Slot "DIM": [1] 3 2 > matchmatrix(ov) 12 / 28

Finding Overlaps query subject [1,] 1 1 [2,] 2 1 [3,] 2 2 queryhits(), subjecthits() (often used with unique()) For efficiency, there is also countoverlaps() > countoverlaps(ir1, ir2) [1] 1 2 0 > args(findoverlaps) function (query, subject, maxgap = 0L, minoverlap = 1L, type = c( "start", "end", "within", "equal"), select = c("all", "first" "last", "arbitrary"),...) NULL 13 / 28

Finding nearest IRanges nearest(), precede(), follow(). Watch out for ties! > ir1 IRanges of length 3 start end width [1] 1 3 3 [2] 4 7 4 [3] 8 10 3 > ir2 IRanges of length 2 start end width [1] 3 5 3 [2] 4 6 3 > nearest(ir1, ir2) [1] 1 1 2 14 / 28

IRangesList An IRangesList is simply a list of IRanges. Why a separate class for this? Methods dispatching, we know what we get. Easy to specify input and output. Many XXList exists. One biology use case: the set of transcripts of a gene. Each transcript is an IRanges, but we need something to glue the transcripts together. 15 / 28

Rle Rle are run-length encoded vectors, which are very useful for representing coverage vectors efficiently. ir 2 4 6 8 10 > coverage(ir) 'integer' Rle of length 10 with 4 runs Lengths: 2 2 2 4 Values : 1 2 0 1 > as.vector(coverage(ir)) [1] 1 1 2 2 0 0 1 1 1 1 runlengths(), runvalues() 16 / 28

Computing on Rle Rle s can be extremely efficient, especially for really long vectors. They support are a wide variety of operations > coverage(ir1) + coverage(ir2) 'integer' Rle of length 10 with 7 runs Lengths: 2 1 2 1 2 1 1 Values : 1 2 3 2 1 2 3 > range(coverage(ir2)) [1] 0 2 > log(coverage(ir2)) 'numeric' Rle of length 6 with 4 runs Lengths: 2... Values : -Inf... aggregate() allows you to apply functions to the Rle inside an IRanges > aggregate(coverage(ir2), IRanges(start = 3, width = 4), + FUN = mean) [1] 1.5 17 / 28

Views Often we have a big object (think genome) and we are interested in subsets of this object. These subsets may be thought of as IRanges inside the object. Think exons and genome sequence. Key idea: store big object and just keep the ranges. > library(bsgenome.scerevisiae.ucsc.saccer1) > vi <- Views(Scerevisiae$chr1, + start = seq(1, by = 10000, length = 20), + width = 1000) > head(vi, n = 4) Views on a 230208-letter DNAString subject subject: CCACACCACACCCACACAC...TGGGTGTGGTGTGTGTGGG views: start end width [1] 1 1000 1000 [CCACACCACACC...GGGTGCTATGA] [2] 10001 11000 1000 [AGAATGAATCGG...TGGATATAAAT] [3] 20001 21000 1000 [ATTATAATCCTC...CGGTCTGGTGT] [4] 30001 31000 1000 [GATTTGTTCAAA...TTTTTTTTGTT] 18 / 28

Views Views can also be made on coverage vectors > coverage(ir) 'integer' Rle of length 10 with 4 runs Lengths: 2 2 2 4 Values : 1 2 0 1 > Views(coverage(ir), start = 3, width = 4) Views on a 10-length Rle subject views: start end width [1] 3 6 4 [2 2 0 0] 19 / 28

IRanges: important types of objects IRanges (and normal IRanges), IRangesList Rle Views RangedData: IRanges with metadata, think IRanges + data.frame These types of objects allow you to accomplish many useful tasks in a very efficient manner, but they sometimes require a bit creativity. 20 / 28

Return to GRanges GRanges are like IRanges with strand and chromosome. Strand can be +, - and * (unknown strand or unstranded). > gr GRanges with 3 ranges and 0 elementmetadata values seqnames ranges strand <Rle> <IRanges> <Rle> A1 chr1 [1, 3] + A2 chr1 [3, 5] - A3 chr1 [5, 7] + seqlengths chr1 NA strand(), seqnmes(), ranges(), start(), end(), width(). They operate within a universe of seqlengths. > seqlengths(gr) <- c("chr1" = 10) > gaps(gr) 21 / 28

Return to GRanges GRanges with 5 ranges and 0 elementmetadata values seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr1 [4, 4] + [2] chr1 [8, 10] + [3] chr1 [1, 2] - [4] chr1 [6, 10] - [5] chr1 [1, 10] * seqlengths chr1 10 And because the have strand, some operations are now relative to the direction of transcription (upstream(), downstream()): > flank(gr, 2, start = FALSE) 22 / 28

Return to GRanges GRanges with 3 ranges and 0 elementmetadata values seqnames ranges strand <Rle> <IRanges> <Rle> A1 chr1 [4, 5] + A2 chr1 [1, 2] - A3 chr1 [8, 9] + seqlengths chr1 10 Finally, GRanges may have associated metadata. > values(gr) <- DataFrame(score = c(0.1, 0.5, 0.3)) > gr 23 / 28

Return to GRanges GRanges with 3 ranges and 1 elementmetadata value seqnames ranges strand score <Rle> <IRanges> <Rle> <numeric> A1 chr1 [1, 3] + 0.1 A2 chr1 [3, 5] - 0.5 A3 chr1 [5, 7] + 0.3 seqlengths chr1 10 values(), elementmetadata() 24 / 28

Usecase I Suppose we want to identify TF binding sites that overlaps known SNPs. snps: GRanges (of width 1) TF: GRanges > findoverlaps(snps, TF) (watch out for strand) 25 / 28

Usecase II Suppose we have a set of DMRs (think genomic regions) and a set of CpG Islands and we want to find all DMRs within 10kb of a CpG Island. dmrs: GRanges islands: GRanges > findoverlaps(dmrs, resize(islands, + width = 20000 + width(islands), fix = "center")) (watch out for strand) 26 / 28

Usecase III Suppose we want to compute the average coverage of bases belonging to (known) exons. reads: GRanges exons: GRanges > bases <- reduce(exons) > nbases <- sum(width(bases)) > ncoverage <- sum(views(coverage(reads), bases)) > ncoverage / nbases (watch out for strand) 27 / 28

Session Info R version 2.14.0 Under development (unstable) (2011-06-20 r56188), x86_64-apple-darwin10.7.4 Locale: en_us.utf-8/en_us.utf-8/en_us.utf-8/c/en_us.utf- 8/en_US.utf-8 Base packages: base, datasets, graphics, grdevices, methods, stats, utils Other packages: Biostrings 2.21.6, BSgenome 1.21.3, BSgenome.Scerevisiae.UCSC.sacCer1 1.3.17, GenomicRanges 1.5.12, IRanges 1.11.10 Loaded via a namespace (and not attached): tools 2.14.0 28 / 28