CNV-seq Manual. Xie Chao. May 26, 2011

Similar documents
HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

seqcna: A Package for Copy Number Analysis of High-Throughput Sequencing Cancer DNA

Ensembl RNASeq Practical. Overview

Running SNAP. The SNAP Team October 2012

An Introduction to Linux and Bowtie

Handling sam and vcf data, quality control

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

ChIP-Seq Tutorial on Galaxy

Running SNAP. The SNAP Team February 2012

Practical Bioinformatics for Life Scientists. Week 4, Lecture 8. István Albert Bioinformatics Consulting Center Penn State

Tiling Assembly for Annotation-independent Novel Gene Discovery

CONSERTING User Documentation

Part 1: How to use IGV to visualize variants

Assignment 7: Single-cell genomics. Bio /02/2018

Identiyfing splice junctions from RNA-Seq data

Sequence Analysis Pipeline

Genomic Files. University of Massachusetts Medical School. October, 2014

Advanced UCSC Browser Functions

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-

1. mirmod (Version: 0.3)

Bioinformatics in next generation sequencing projects

EPITENSOR MANUAL. Version 0.9. Yun Zhu

m6aviewer Version Documentation

Analysis of ChIP-seq data

protrac version Documentation -

INTRODUCTION AUX FORMATS DE FICHIERS

RNA-seq. Manpreet S. Katari

NGS Data Analysis. Roberto Preste

metilene - a tool for fast and sensitive detection of differential DNA methylation

Rsubread package: high-performance read alignment, quantification and mutation discovery

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

MetaPhyler Usage Manual

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

ChIP-seq (NGS) Data Formats

panda Documentation Release 1.0 Daniel Vera

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1

Programming Languages and Uses in Bioinformatics

Rsubread package: high-performance read alignment, quantification and mutation discovery

User's guide to ChIP-Seq applications: command-line usage and option summary

Annotating a single sequence

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

Easy visualization of the read coverage using the CoverageView package

Helpful Galaxy screencasts are available at:

ASAP - Allele-specific alignment pipeline

protrac version Documentation -

Read Naming Format Specification

cgatools Installation Guide

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

Copy Number Variations Detection - TD. Using Sequenza under Galaxy

Triform: peak finding in ChIP-Seq enrichment profiles for transcription factors

Lecture 12. Short read aligners

Genomic Files. University of Massachusetts Medical School. October, 2015

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Browser Exercises - I. Alignments and Comparative genomics

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting

Package GOTHiC. November 22, 2017

RNA-Seq data analysis software. User Guide 023UG050V0200

Online Backup Client User Manual

Welcome to GenomeView 101!

Using Pipeline Output Data for Whole Genome Alignment

Genetics 211 Genomics Winter 2014 Problem Set 4

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

replace my_user_id in the commands with your actual user ID

Package conumee. February 17, 2018

ChIP-seq hands-on practical using Galaxy

Package saascnv. May 18, 2016

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

Genome Browsers - The UCSC Genome Browser

Mar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Variant calling using SAMtools

RNA Alternative Splicing and Structures

Huber & Bulyk, BMC Bioinformatics MS ID , Additional Methods. Installation and Usage of MultiFinder, SequenceExtractor and BlockFilter

This module contains three plugins: Decouple.pl, Add.pl and Delete.pl.

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Tn-seq Explorer 1.2. User guide

Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

User Guide for Tn-seq analysis software (TSAS) by

2 Algorithm. Algorithms for CD-HIT were described in three papers published in Bioinformatics.

Read mapping with BWA and BOWTIE

v0.3.0 May 18, 2016 SNPsplit operates in two stages:

Circ-Seq User Guide. A comprehensive bioinformatics workflow for circular RNA detection from transcriptome sequencing data

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

CLC Server. End User USER MANUAL

RNA-Seq data analysis software. User Guide 023UG050V0100

KGBassembler Manual. A Karyotype-based Genome Assembler for Brassicaceae Species. Version 1.2. August 16 th, 2012

Exome sequencing. Jong Kyoung Kim

CGA Tools User Guide Software Version 1.8

Next-Generation Sequencing applied to adna

Data Walkthrough: Background

How to use the DEGseq Package

Release Note. Agilent Genomic Workbench 6.5 Lite

User Manual of ChIP-BIT 2.0

Under the Hood of Alignment Algorithms for NGS Researchers

Sequence Mapping and Assembly

The analysis of acgh data: Overview

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Introduction to Matlab. Sasha Lukyanov, 2018 Xenopus Bioinformatics Workshop, MBL, Woods Hole

A short Introduction to UCSC Genome Browser

Practical Course in Genome Bioinformatics

Transcription:

CNV-seq Manual Xie Chao May 26, 20 Introduction acgh CNV-seq Test genome X Genomic fragments Reference genome Y Test genome X Genomic fragments Reference genome Y 2 Sampling & sequencing Whole genome microarray Hybridization 3 Mapping Template genome Test Fluorescence measurement in each array feature Ref 4 Mapped read count in each sliding window Test Ref Data analysis 5 Data analysis log 2 ratio 0 Genomic Position 6 log 2 ratio 0 Genomic Position CNV-seq is a method for detecting DNA copy number variation (CNV) using highthroughput sequencing, described by []. This method provides a statistical framework to estimate parameters (more details can be found in []). The package here is an xie at nus.edu.sg or xiechaos at gmail.com

implementation of the CNV-seq method. configurations: We tested this package in the following OS: Mac OS X leopard, Ubuntu Hardy, CentOS 4.3, Debian 5 Perl: 5.8.8, 5.0.0 R: 2.0. ggplot2: 0.8.5 2 Install This package contains two Perl scripts and one R package. To use the CNV-seq package, you need to install Perl and R first. Then you can download the CNV-seq package from http://tiger.dbs.nus.edu.sg/cnv-seq. After downloading the package, open a command line console and run: $ tar xzf cnv-seq.tar.gz $ cd cnv-seq $ ls There are several Perl scripts (best-hit.*.pl and cnv-seq.pl) and one R package cnv in this directory. To install the two Perl scripts, simply move or copy the them to your desired location. To install the R package run: $ R CMD INSTALL cnv/ Please note that the R package cnv requires package ggplot2 for plotting CNV graphs, you may want to install ggplot2 if you need the plotting function in the cnv package. 3 Usage 3. best-hit.*.pl The only requirement for CNV-seq is best-hit location files for each mapped sequence read, in the following format: 234 23456 chr2 999 chr2 999 X 234 2

We provide best-hit.*.pl for obtaining the best mapping locations for several alignment tools. Currently it only support BLAT psl file and SOLiD matching pipeline (in corona_lite) output as input. To extract map locations from a *.bam file (by SAM tools), try this command: samtools view -F 4 my.bam perl -lane 'print "$F[2]\t$F[3]"' > my.hits More input formats from various alignment or mapping programs will be included in the future. You need to generate your own best-hit file if you are using other alignment tools. 3.2 cnv-seq.pl cnv-seq.pl is used to calculate sliding window size, to count number of mapped hits in each window, and to call cnv R package to calculate log2 ratios and annotate CNV. You can get the usage of cnv-seq.pl by run the script without any arguments: $./cnv-seq.pl usage: cnv-seq.pl [options] --test = test.hits.file --ref = ref.hits.file --log2-threshold = number (default=0.6) --p-value = number (default=0.00) --bigger-window = number (default=2, in order to use larger window than minimum) --genome = (human, chicken, chrom, autosome, sex, chromx, chromy) (higher priority than --genome-size) --genome-size = number (in bases, overwiten by --genome option) --window-size = number (overwrite window size calculation) --global-normalization (if used, normalization on whole genome, instead of on each chromosome) --annotate --no-annotate 3

(default do annotation) --minimum-windows-required = number (default=4; only for annotation of CNV) --Rexe = path to your R program (default=r) --help The option --test and --ref are required. Either --genome or --genome-size option must be specified too. All other options have default values as shown above. The test and ref options accept the best hit files outputed from best-hit.*.pl for the test and reference individuals. 3.3 R package cnv The cnv-seq.pl will call the R package cnv by default, and a tab-delimited file containing the log2 ratios and (optionally) CNV annotation will be ouputed. However, in order to achieve the full power of the cnv package, you are strongly recommended to run the cnv package from R by yourself. The cnv package contains several functions: cnv.cal <- function (file, log2.threshold = NA, chromosomal.normalization = TRUE, annotate = FALSE, minimum.window = 4) cnv.print <- function (cnv, file = "") cnv.summary <- function (cnv) plot.cnv <- function (data, chromosome = NA, CNV = NA,...) plot.cnv.all <- function (data, chrom.gap = 2e+07, colour = 5, title = NA, ylim = c(-2,2), xlabel = "Chromosome") plot.cnv.chr <- function (data, chromosome = NA, from = NA, to = NA, title = NA, ylim = c(-4, 4), glim = c(na, NA), xlabel = "Position (bp)") plot.cnv.cnv <- function (data, CNV, upstream = NA, downstream = NA,...) 4

4 Demonstration We provide some sample data for demonstration. You can download Sample.tar.gz from our website. Sample.tar.gz contains BLAT output for simulated Solexa reads with X coverage on human chromosome. (Warning: samlple.tar.gz is huge) After download the file, open command line console, and run: $ cd DOWNLOAD_DIR $ tar xzf Sample.tar.gz $ PATH-TO/best-hit.BLAT.pl ref.psl > ref.hits $ PATH-TO/best-hit.BLAT.pl test.psl > test.hits $ ### NOTE: there are several other versions of $ ### best-hit.*.pl for different input formats The above lines will generate two output files: test.hits and ref.hits, which are the genomic locations of the best BLAT hits. We also provided the two files (test.hits and ref.hits) on our website as Sample2.tar.gz, which is much smaller than Sample.tar.gz. After obtaining the two hits files, you can run cnv-seq.pl: $ cnv-seq.pl --test test.hits --ref ref.hits --genome chrom --log2 0.6 --p 0.00 --bigger-window.5 #default --annotate --minimum-windows 4 #default This will give you output like this: genome size used for calculation is 24724979 test.hits: 874797 reads ref.hits: 878852 reads The minimum window size for detecting log2>= 0.6 should be 7676.9728733869 The minimum window size for detecting log2<=-0.6 should be 7692.00645466 window size to use is 7692.00645466 x.5 = 26538 window size to be used: 26538 read 874797 test reads, out of 874797 lines read 878852 ref reads, out of 878852 lines write read-counts into file: test.hits-vs-ref.hits.log2-0.6.pvalue-0.00.count R package cnv output: test.hits-vs-ref.hits.log2-0.6.pvalue-0.00.minw-4.cnv... [] "chromosome: " 5

[] "cnv_id: of 50" [] "cnv_id: 2 of 50" [] "cnv_id: 3 of 50" [] "cnv_id: 4 of 50"... The sliding window size used is 26.5Kb. This will give a tab delimited file test.hits-vs-ref.hits.log2-0.6.pvalue-0.00.minw- 4.cnv. This file contains all information about CNV prediction from the analysis. In order to plot the log2 CNV graph: $ R # in R command prompt > library(cnv) > data <- read.delim("test.hits-vs-ref.hits.log2-0.6.pvalue-0.00.cnv") > cnv.print(data) # output... > cnv.summary(data) # output... > plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6) > ggsave("sample.pdf") The plot.cnv command will generate a plot like this picture: 6

References [] Chao Xie and Martti Tammi. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics, 0():80, Mar 2009. 7