panda Documentation Release 1.0 Daniel Vera

Size: px

Start display at page:

Download "panda Documentation Release 1.0 Daniel Vera"

Darcy Long
5 years ago
Views:

1 panda Documentation Release 1.0 Daniel Vera February 12, 2014

3 Contents 1 mat.make Usage and option summary Arguments Dependencies Examples mat.heatmap Usage and option summary Default behavior - plot scores aligned at the 5 end of intervals in featurefiles count Report the count of hits from the annotation files both Report both the count of hits and the fraction covered from the annotation files s Restrict the reporting to overlaps on the same strand S Restrict the reporting to overlaps on the opposite strand Indices and tables 11 i

4 ii

5 functions: Contents 1

6 2 Contents

7 CHAPTER 1 mat.make mat.make creates two-dimensional matrices of scores aligned at user-specified genomic intervals. Each row in a matrix corresponds to a genomic interval, and each column correspond to distances from the aligned features. An example of a matrix created by this function is a matrix of ChIP-seq signals relative to transcription start sites. The matrices produced by this function can be used to generate heatmaps (mat.heatmap), aggregate plots (mat.plotaverages), and other useful tasks. A matrix is a convenient way of storing genomic data when a given analysis of such data can be focused on particular regions. When genomic intervals are used to generate scores, apart from calculating their density, mat.make can also create two-dimensional matrix listing the sizes of these intervals. This type of matrix is useful with paired-end MNase-seq or DNase-seq data which can be used to create fragment size vs distance plots (V-plots, Henikoff et al., 2009). Input: The scores can be supplied in bedgraph, wiggle, or bigwig format. Alternatively, mat.make can create scores from genomic intervals in the form of interval densities (such as calculating the density of reads supplied in bed or bam format). Feature densities can be calculated from bed, bigbed, sam, bam, narrowpeak, or broadpeak files. Output: Each matrix will be names with the following convention: scorefile_featurefile.mat[windowsize]. For example, if scores in H3K36me3signal.bw are aligned relative to TSSs.bed (feature), and the window size is 10bp, the file for this matrix will be names H3K36me3signal_TSSs.mat10. For a given feature, each matrix for a given dataset (scores) will have identical dimensions. Thus, different scores aligned at a given set of features can be easily compared. Score matrices aligned at a given set of features will be saved in a directory bearing the name of the feature, with a suffix of _mat[windowsize]. 1.1 Usage and option summary Usage: mat.make (scorefiles, features, closest = NULL, cores = "max", meta = FALSE, metaflank = 1000, maskbe 3

8 1.2 Arguments Alignment Options strand start stop A list of file names of which to calculate scores from. Can be one of the following formats: bed, bigbed, narrowpeaks, broadpeaks, bedgraph, wig, bigwig, bam. A list of file names of which to create matrices from. Can be one of the following formats: bed, bigbed, narrowpeaks Distance (in bp) around feature to create matrix from. When meta = TRUE, size (in bp) of meta-feature. Defaults to Size (in bp) to bin data in nonoverlapping windows. Increasing windowsize proportionally decreases columns in matrix. Must be a factor of regionsize. Defaults to 10 (bp). When TRUE, reverses scores in rows of minus-stranded features. Column in featurefiles to align the center of matrix around. When strand = TRUE, minus-stranded intervals are aligned by column defined by stop. Ignored when meta = TRUE or narrowpeak = TRUE. Defaults to 2. When strand = TRUE, column in featurefiles to align the center of matrix for minus-stranded features. Ignored when meta = TRUE or narrowpeak = TRUE. bed file name of intervals to intersect featurefiles to before creating the matrix. Useful to remove features outside microarray or sequence-capture regions. Default NULL. When TRUE, if a featurefile is a narrowpeak file, aligns the center of the matrix on the peak summit defined by column 10 in narrowpeak file. Defaults to FALSE. Score Options maskbed bgfiller bed file name who s intervals are the only intervals which should have scores. Assigned NA s to all regions outside these intervals in the matrix. Used to prevent regions with no information from artificially being assigned zeroes (e.g., regions without microarray probes or sequence capture probes). Default NULL. When a scorefile is a bedgraph, wig, or bigwig, value in matrix to assign windows which have no overlapping scores in scorefile. Useful to prevent windows outside probed regions from artificially being assigned zeroes. Defaults to 0. NA suggested for probe-specific data. Main options scorefiles featurefiles regionsize windowsize prunefeaturesto narrowpeak prunescoreswhen TRUE, removes scores in scorefiles that do not overlap with regions included in matrices. May speed up matrix creation for very large scorefiles. Defaults to FALSE. featurecenter meta = TRUE or narrowpeak = TRUE. Defaults to FALSE. When TRUE, aligns the center of matrix on the center of the intervals in featurefiles. Ignored when rpm When a scorefile is a bed, bigbed, or bam file, adjusts the calculated coverage in the matrix to RPM defined by the number of lines in the scorefile (RPM = coverage * / file lines). Defaults to FALSE. scoremat When TRUE, creates score matrices. Set to FALSE to only create fragment-size matrices (fragmats). Defaults to TRUE. Misc. Options fragmats closest cores Indices of scorefiles to create fragment-size matrices for creating v-plots. Defaults to NULL. bed file name who s intervals are used to assign names to rows in the matrix based on the nearest interval in featurefiles. Used for example to assign the closest gene s name to each transcription-factor binding site defined in featurefiles. Default NULL Number of scorefiles to process simultaneously for each featurefile. Defaults to max, or all but one core. 4 Chapter 1. mat.make

9 Metafeature Options meta metaflank Create a meta-matrix, a matrix that aligned features by their 5 and 3 ends by scaling all features to the same size, as defined by regionsize (bp). Defaults to FALSE. When meta = TRUE, determines the distance from the meta features (in bp) to define the matrix boundary. Defaults to Dependencies bedtools 2.18: mat.make heavily relies on bedtools, a suite of genomic calculation software created by Aaron Quinlan. kent source utilities: mat.make requires kent source utilities if wiggle or bigwig formats are used. Written by Jim Kent of UCSC. 1.4 Examples make a list of files which you would like to plot signal from > scorefiles <- c( "h3k27me3-signal.bw", "ctcf-chipseq-signal.bg", "polii-chipseq-reads.bed", "htt make a list of files which specify where to align scores at > featurefiles <- c( "protein-coding-genes.bed", "start-codons.bed", "ctcf-binding-sites.narrowpeak make matrix of data in scorefiles aligned at featurefiles > mat.make ( scorefiles, featurefiles ) $ cat variants.bed chr nasty 1 - chr ugly 2 + chr big 3 - $ cat genes.bed chr genea 1 + chr geneb 2 + chr genec 3 - $ cat conserve.bed chr cons1 1 + chr cons2 2 - chr cons3 3 + $ cat known_var.bed chr known1 - chr known2 - chr known3 + $ bedtools annotate -i variants.bed -files genes.bed conserve.bed known_var.bed chr nasty chr ugly chr big Dependencies 5

10 6 Chapter 1. mat.make

11 CHAPTER 2 mat.heatmap mat.heatmap draws heatmaps of matrices, sorted and/or grouped with user-defined criterea. also plots aggregate profiles and optionally fragment-size plots. 2.1 Usage and option summary Usage: make a list of files which you would like to plot signal from scorefiles <- c( "h3k27me3-signal.bw", "ctcf-chipseq-signal.bg", "polii-chipseq-reads.bed", "http: make a list of files which specify where to align scores at featurefiles <- c( "protein-coding-genes.bed", "start-codons.bed", "ctcf-binding-sites.narrowpeaks" make matrix of data in scorefiles aligned at featurefiles mat.make ( scorefiles, featurefiles ) Main options mats sorting numgroups genegroups normalize cores a vector of matrix file names from which to draw heatmaps. a vector defining how to sequentially group and/or sort the data, and when applicable, which portion of the matrix to use in determining how to sort the data. methods include kmeans,mean,median,min,max,minloc,maxloc,sd,chrom, the left and right distances (in bp) from the center of the matrix to which to limit the sorting/clustering method. In each string in sorting, sorting/clustering methods, left distance, and right distance, must be separated by a comma and not contain spaces. Clustering/grouping methods include kmeans and chromosome. Defaults to none (no sorting/clustering). a numeric vector which defines how many groups or clusters rows are sequentially divided into corresponding to the sorting/clustering methods defined in sorting. For kmeans, defines how many kmeans clusters are created. For all sorting methods, divides genes into equally-sized groups based on the corresponding value in numgroups. a list of character vectors of gene names, defining the genes belonging to each group. Only used when sorting[1] is genelist. logical. When TRUE, normalizes each score matrix row. For 1-tailed data, each row is divided by the mean of the row, and the entire matrix is multiplied by the mean of the matrix. For 2-tailed data, for each row, the mean is subtracted by the scores and divided by the standard deviation (z-score normalization). Defaults to FALSE. a natural number defining the number of scorefiles to process simultaneously for each featurefile. Defaults to max, or all but one core. 7

12 a character string or vector of character strings that define the range of values that correspond to the color gradient edges in heatmaps defined in plotcolors. Defaults to c( auto, auto ), which uses the 3 and 97 percentiles of each data set. forcescore logical. When TRUE, before drawing the heatmap, NAs are converted to zeroes. This prevents regions with no scores from showing as white, which may hinder or distract visualization of other colors in heatmap. Defaults to TRUE. a character string defining colors used to create color gradient to draw heatmap (from low scores to high scores). Colors must be separated by spaces. Defaults to white black where white is the lowest score and black is the highest score. a character string of names of features to which matrices are aligned to, which is used to label the x-axis on aggregate plots. View options defaultlims plotcolors centername V-plot options fragmats fragrange vdefaultlims vplotcolors rpm a vector of fragment-matrix file names from which to draw v-plots. range of fragment sizes to define y-axis in vplots. numeric vector of length 2 that define the range of values that correspond to the color gradient edges in v-plots defined in vplotcolors. Defaults to c( auto, auto ), which uses the 3 and 97 percentiles of each data set. string defining colors used to create color gradient to draw v-plot (from low scores to high scores). Colors must be separated by spaces. Defaults to black blue yellow red where black is the lowest scores and red is the highest score. logical. when TRUE, normalizes v-plot scores to RPM. 2.2 Default behavior - plot scores aligned at the 5 end of intervals in featurefiles By default, the fraction of each feature covered by each annotation file is reported after the complete feature in the file to be annotated. $ cat variants.bed chr nasty 1 - chr ugly 2 + chr big 3 - $ cat genes.bed chr genea 1 + chr geneb 2 + chr genec 3 - $ cat conserve.bed chr cons1 1 + chr cons2 2 - chr cons3 3 + $ cat known_var.bed chr known1 - chr known2 - chr known3 + 8 Chapter 2. mat.heatmap

13 $ bedtools annotate -i variants.bed -files genes.bed conserve.bed known_var.bed chr nasty chr ugly chr big count Report the count of hits from the annotation files $ bedtools annotate -counts -i variants.bed -files genes.bed conserve.bed known_var.bed chr nasty chr ugly chr big both Report both the count of hits and the fraction covered from the annotation files $ bedtools annotate -both -i variants.bed -files genes.bed conserve.bed known_var.bed #chr start end name score +/- cnt1 pct1 cnt2 pct2 cnt3 pct3 chr nasty chr ugly chr big s Restrict the reporting to overlaps on the same strand. $ bedtools annotate -s -i variants.bed -files genes.bed conserve.bed known_var.bed chr nasty chr ugly chr big S Restrict the reporting to overlaps on the opposite strand. $ bedtools annotate -S -i variants.bed -files genes.bed conserve.bed known_var.bed chr nasty chr ugly chr big workflows: count Report the count of hits from the annotation files 9

14 10 Chapter 2. mat.heatmap

15 CHAPTER 3 Indices and tables genindex modindex search 11

A short Introduction to UCSC Genome Browser

A short Introduction to UCSC Genome Browser Elodie Girard, Nicolas Servant Institut Curie/INSERM U900 Bioinformatics, Biostatistics, Epidemiology and computational Systems Biology of Cancer 1 Why using