Next Generation Sequencing

Size: px
Start display at page:

Download "Next Generation Sequencing"

Transcription

1 Next Generation Sequencing Cavan Reilly November 21, 2014

2 Table of contents Next generation sequencing NGS and microarrays Study design Quality assessment Burrows Wheeler transform BWT example Introduction to UNIX-like operating systems Installing programs Bowtie SAMtools RNA-Seq Aligning reads to the transcriptome TopHat Cufflinks RNA-Seq using TopHat and Cufflinks RNAseq and count models DNA-Seq BCFtools Variant Pattern Analyzer Genome Analysis Toolkit Microbiomics

3 Next generation sequencing Over the last 7 years or so there has been rapid development of methods for next generation sequencing. Here is the process for the Illumina technology (one of the 3 major producers of platforms for next generation sequencing). The biological sample (e.g. a sample of mrna molecules) is first randomly fragmented into short molecules. Then the ends of the fragments are adenylated and adaptor oligos are attached to the ends.

4 Next generation sequencing The fragments are then size selected, purified and put on a flow cell. An Illumina flow cell has 8 lanes and is covered with other oligonucleotides that bind to the adaptors that have been ligated to the fragmented nucleotide molecules from the sample. The bound fragments are then extended to make copies and these copies bind to the surface of the flow cell. This is continued until there are many copies of the original fragment resulting in a collection of hundreds of millions of clusters.

5

6

7 Next generation sequencing The reverse strands are then cleaved off and washed away and sequencing primer is hybridized to the bound DNA. The individual clusters are then sequenced in parallel, base by base, by hybridizing fluorescently labeled nucleotides. After each round of extension of the nucleotides a laser excites all of the clusters and a read is made of the base that was just added at each cluster. If a very short sequence is bound to the flow cell it is possible that the machine will sequence the adaptor sequence-this is referred to as adaptor contamination. There is also a measure of the quality of the read that is saved along with the read itself.

8 Next generation sequencing These quality measures are on the PHRED scale, so if there is an estimated probability of an error of p, the PHRED based score is 10 log 10 p. If we fragment someone s DNA we can then sequence the fragments, and if we can then put the fragments back together we can then get the sequence of that person s genome or transcriptome. This would allow us to determine what alleles this subject has at every locus that displays variation among humans (e.g. SNPs).

9 Next generation sequencing There are a number of popular algorithms for putting all of the fragments back together: BWA, Maq, SOAP, ELAND and Bowtie. We ll discuss Bowtie in some detail later (BWA uses the same ideas as Bowtie). ELAND is a proprietary algorithm from Illumina and Maq and SOAP use hash tables and are considerably slower than Bowtie.

10 Applications There are many applications of this basic idea: 1. resequencing (DNA-seq) 2. gene expression (RNA-seq) 3. mirna discovery and quantitation 4. DNA methylation studies 5. ChIP-seq studies 6. metagenomics 7. ultra-deep sequencing of viral genomes We will focus on resequencing (and SNP calling) and gene expression in this course.

11 NGS and microarrays Some have talked about the death of microarrays due to the use of NGS in the context of gene expression studies, but this hasn t happened yet. Compared to microarrays, RNA-seq 1. higher sensitivity 2. higher dynamic range 3. lower technical variation 4. don t need a sequenced genome (but it helps, a lot) 5. more information about different exon uses (more than 90% of human genes with more than 1 exon have alternative isoforms)

12 NGS and microarrays In one head to head comparison, 30% more genes are found to be differentially expressed with the same FDR (Marioni, Mason, Mane, et al. (2008)). These authors also found that the correlation between normalized intensity values from a microarray and the logarithm of the counts for the same transcript were (this is similar to what others have reported). The largest differences occurred when the read count was low and the microarray intensity was high which probably reflects cross-hybridization in the microarray.

13 NGS and microarrays Others have found slightly higher correlations, here is a figure from a study by Su, Li, Chen, et al. (2011).

14 Study design Before addressing the technical details, we will outline some considerations regarding study design specific to NGS technology. There are several major manufacturers of the technology necessary to generate short reads, and they all have different platforms with resulting differences in the data and processing methods. A few of the major vendors are 1. Illumina, formerly known as Solexa sequencing, they now have MiSeq and HiSeq (the Genome analyzer is the old platform) 2. Roche, which owns 454 Life Sciences, which supports GS FLX+ system 3. Life technologies, which supports Ion-Torrent

15 Study design We will focus on the Illumina technology as the University of Minnesota has invested in this platform (the HiSeq 2000 and MiSeq platforms is available in the UMN Genomics Center). The MiSeq machine produces longer reads than the HiSeq 2000 We also have some capabilities to do 454 sequencing here. The 454 platform produces longer reads (up to 1000 bp now), and is very popular in the microbiomics literature.

16 Study design By sequencing larger numbers of fragments one can estimate the sequence more accurately. The goal of using high sequence coverage is that by using more fragments it is more likely that one fragment will cover any random location in the genome. Clearly one wants to cover the entire genome if the goal is to sequence it, hence the depth of coverage is how many overlapping fragments will cover a random location. By deep sequencing, we mean coverage of 30X to 40X, whereas coverage of 4X or so would be considered low coverage.

17 Study design The rule for determining coverage is: coverage = (length of reads number of reads)/ (length of genome) For a given sequencing budget, there is a tradeoff between the depth of coverage and the number of subjects one can sequence. The objectives of the study should be considered when determining the depth of coverage. If the goal is to accurately sequence a tumor cell then deep coverage is appropriate, however if the goal is to identify rare variants, then lower coverage is acceptable.

18 Study design For example, the 1000 Genomes project is using about 4X coverage. A reasonable rule of thumb is that one probably needs 20-30X coverage to get false negatives less than 1% in nonrepeat regions of a duploid genome (Li, Ruan and Durbin, 2008). If the organism one is sequencing does not have a sequenced genome, the calculations are considerably more complex and one would want greater coverage in this case. Throughout this treatment we will suppose that the organism under investigation has a sequenced genome.

19 Study design Another choice when designing your study regards the use of mate-pairs. With the Illumina system one can sequence both ends of a fragment simultaneously. This greatly improves the accuracy of the method and is highly recommended. If the goal is to characterize copy number variations then getting paired end sequences is crucial.

20 Quality assessment We can use the ShortRead package in R to conduct quality assessment and get rid of data that doesn t meet quality criteria. To explore this we will use some data from dairy cows. You can find a description of the data along with the actual data in sra format at this site. While the NCBI uses the sra format to store data, the standard format for storing data that arises from a sequencing experiment is the fastq format. We can convert these files to fastq files using a number of tools-here is how to use one supported at NCBI called fastq-dump.

21 Quality assessment To use this you need to download the compressed file, unpack it and then enter the directory that holds the executable and issue the command (more on these steps later). sratoolkit centos linux64/bin/fastq-dump SRR sra Then this will generate fastq files which hold the data. If we open the fastq files we see they hold identifiers and information about the reads followed by the actual reads:

22 Quality HWI-EAS283:8:1:9:19 length=36 ACGAGCTGANGCGCCGCGGGAAGAGGGCCTACGGGG +SRR HWI-EAS283:8:1:9:19 length=36 HWI-EAS283:8:1:9:1628 length=36 CCTAATAATNTTTTCCTTTACCCTCTCCACATTAAT +SRR HWI-EAS283:8:1:9:1628 HWI-EAS283:8:1:9:198 length=36 ATTGAATTTGAGTGTAAACATTCACATAAAGAGAGA +SRR HWI-EAS283:8:1:9:198 length=36 The string of letters and symbols encodes quality information for each base in the sequence.

23 Quality assessment Then we define the directory where the data is and read in all of the filenames that are there > library(shortread) > datadir <- "/home/cavan/documents/ngs/bovinerna-seq" > fastqdir <- file.path(datadir, "fastq") > fls <- list.files(fastqdir, "fastq$", full=true) > names(fls) <- sub(".fastq", "", basename(fls))

24 Quality assessment Then we apply the quality assessment function, qa, to each of the files after we read it into R using the readfastq function > qas <- lapply(seq along(fls), function(i, fls) + qa(readfastq(fls[i]), names(fls)[i]), fls) This can take a long time as it has to read all of these files into R, but note that we do not attempt to read the contents of the files into the R session simultaneously (a couple of objects generated by the call to readfastq stored inside R will really bog it down). Then we rbind the quality assessments together and save to an external location. > qa <- do.call(rbind, qas) > save(qa, file=file.path("/home/cavan/documents/ngs/bovinerna-seq","qa.rda")) > report(qa, dest="/home/cavan/documents/ngs/bovinerna-seq/reportdir") [1] "/home/cavan/documents/ngs/bovinerna-seq/reportdir/index.html"

25 Quality assessment This generates a directory called reportdir and this directory will hold images and an html file that you can open with a browser to inspect the quality report. So we now use the browser to do this-we see the data from run 12 is of considerably higher quality than the others but nothing is too bad. One can implement soft trimming of the sequences to get rid of basecalls that are of poor quality. A function to do this is as follows (thanks to Jeremy Leipzig who posted this on his blog). Note that this can require a lot of computational resources for each fastq file.

26 Quality assessment # softtrim # trim first position lower than minquality and all subsequent positions # omit sequences that after trimming are shorter than minlength # left trim to firstbase, (1 implies no left trim) # input: ShortReadQ reads # integer minquality # integer firstbase # integer minlength # output: ShortReadQ trimmed reads library("shortread") softtrim <- function(reads, minquality, firstbase=1, minlength=5){ qualmat <- as(fastqquality(quality(quality(reads))), matrix ) quallist <-split(qualmat, row(qualmat)) ends <- as.integer(lapply(quallist, function(x){which(x < minquality)[1]-1})) # length=end-start+1, so set start to no more than length+1 to avoid # negative-length starts <-as.integer(lapply(ends,function(x){min(x+1,firstbase)})) # use whatever QualityScore subclass is sent newq <-ShortReadQ(sread=subseq(sread(reads),start=starts,end=ends), quality=new(class=class(quality(reads)), quality=subseq(quality(quality(reads)), start=starts,end=ends)),id=id(reads))

27 Quality assessment } # apply minlength using srfilter lengthcutoff <- srfilter(function(x) { width(x)>=minlength}, name="length cutoff") newq[lengthcutoff(newq)] To use: > library("shortread") > source("softtrimfunction.r") > reads <- readfastq("myreads.fq") > trimmedreads <- softtrim(reads=reads,minquality=5,firstbase=4,minlength=3) > writefastq(trimmedreads,file="trimmed.fq")

28 Bowtie background The keys to understanding how the algorithm implemented by Bowtie are the Burrows Wheeler transform (BWT) and the FM index. 1. the FM index: see Ferragina, P. and Manzini, G. (2000), Opportunistic data structures with applications 2. the Burrows Wheeler transform (BWT): see Burrows, M. and Wheeler, D.J. (1994), A Block-sorting lossless data compression algorithm. The Burrows Wheeler transform is an invertible transformation of a string (i.e. a sequence of letters).

29 Computing the BWT To compute this transform, one 1. determines all rotations of the string 2. sorts the rotations lexicographically to generate an array M 3. saves the last letter of each sorted string in addition to the row of M that corresponds to the original string.

30 Computing the BWT: an example Some descriptions of the algorithm introduce a special character (e.g. $) that is lexicographically smaller than all other letters, then append this letter to the string before starting the rest of the transform. This just guarantees that the first row of M is the original string so that one needn t store the row of M that is the original string. To see how the BWT works, let s consider the consensus sequence just upstream of the transcription start site in humans (from Zhang 1998): GCCACC We will use the modification that introduces the special character.

31 BWT example First add the special character to the end of the sequence we want to transform, denoted S with length N. GCCACC$ then get all of the rotations: $GCCACC C$GCCAC CC$GCCA ACC$GCC CACC$GC CCACC$G GCCACC$

32 BWT example then sort them (i.e. alphabetize these sequences) $GCCACC ACC$GCC C$GCCAC CACC$GC CC$GCCA CCACC$G GCCACC$ so the BWT is just the last column, typically denoted L, and we will refer to element i of this array as L[i] for i = 1,..., N and ranges of elements via L[1,..., k]. L=CCCCAG$

33 BWT example Importantly one can invert this transformation of the string to obtain the original string. This was first developed in the context of data compression, so the idea was to transform the data, then compress it (which can be done very efficiently after applying this transformation), then one can invert the transformation after uncompressing.

34 BWT example Suppose we have the BWT of a genome and want to count the number of occurrences of some pattern, e.g. AC, our query sequence. We can do this as follows. First, create a table that for each letter has the count of the number of occurrences of letters that are lexically smaller than that letter. For us this is just x $ A C G C(x)

35 BWT example Next we need to be able to determine the number of times that a letter x occurs in L[1,..., k] for any k, which we will denote O(x, k). Note that all of these calculations can be done once and stored for future use. Now search for the last letter of our query sequence, which is C, over the range of rows given by [C(x) + 1,..., C(x + 1)] where x is C so the range is [3,6]. Note that these are the rows of M that start with a C although we don t actually need to store all of M to determine this (it just follows from the alphabetization).

36 BWT example Then process the second to last letter, which for our example is A. To determine the range of rows of M to search we look at [C(x) + O(x, s 1 1) + 1,..., C(x) + O(x, s 2 )] where x is A and s 1 and s 2 are the endpoints of the interval from the previous step. Now and O(A, 3 1) = O(A, 2) = 0 O(A, 6) = 1 so the interval is [ ,..., 1 + 1] = [2, 2] then since this is the end of the query sequence we determine the number of occurrences by looking at the size of the range of this interval, which is here just 1.

37 BWT example If the range becomes empty or if the lower end exceeds the upper end the query sequence is not present. For example if we query on CCC and proceed as previously, the first step is the same (since this also ends in C), so then searching for C again we find [C(C) + O(C, s 1 1) + 1, C(x) + O(C, s 2 )] but O(C, 2) = 2 and O(C, 6) = 4 so the interval is

38 BWT example [ , 2 + 4] = [5, 6] which has size 2 (as there are 2 instances of CC in the sequence), so then proceed as in step 2 again to get which is [C(C) + O(C, 5 1) + 1, C(C) + O(C, 6)] [ , 2 + 4] = [7, 6] indicating that this sequence is not present.

39 BWT example For aligning short sequences to a genome we need to know where the short segments align, hence we need to determine the location of the start of the sequence in the genome. If the locations of some of the letters are known we can just trace back to those locations as follows. First, note that we know the first column of the array M. We know this due to 2 facts: 1. the first column will be strictly alphabetized 2. every letter in the sequence is represented in the last column L.

40 BWT example For our problem we have L=CCCCAG$ and so the first column is just F =$ACCCCG. As our sequence is of length N, then if we knew the permutation of 1,..., N that mapped F to L we would know the entire sequence (as we will see shortly). That one can determine this mapping is the key ingredient in computing the inverse BWT. Call this map T where we have F ( T (i) ) = L(i).

41 BWT example To find T, first consider M and a new array, call it M, that is obtained from M by taking the last element from each row and making it the first element but otherwise changing nothing. Here is M : C$GCCAC CACC$GC CC$GCCA CCACC$G ACC$GCC GCCACC$ $GCCACC

42 Inverting the BWT Note that every row of M has a corresponding row in M and like M M has all cyclic rotations of the original sequence S. Also note that the relative order of the rows with the same first letter are the same in both arrays: rows 3, 4, 5 and 6 in M become rows 1, 2, 3 and 4 in M. We can use this insight to determine the mapping from F to L, i.e. T. To determine T, we proceed sequentially as follows. L(1) = C and this is the first C in L so T (1) = i where F (i) is the first C in F so looking at F we see that i = 3.

43 BWT example Then continue with this thinking: T (2) = i and L(2) is the second C in L, but then looking at F we see that the second C in F occurs at position 4, so that T (2) = 4. Continuing in this fashion we find that T = (3, 4, 5, 6, 2, 7, 1).

44 BWT example We can then reconstruct the sequence in reverse order. We know the last element of the sequence is the last element of the first row of M which is the first element of L, which is here C. Then we proceed sequentially using the mapping T as follows S(N i) = L ( T i (1) ) where T i (x) = T i 1( T (x) ), i.e. repeatly apply the mapping T to itself.

45 BWT example so S(N 1) = L ( T (1) ) = L(3) = C S(N 2) = L ( T (3) ) = L(5) = A S(N 3) = L ( T (5) ) = L(2) = C S(N 4) = L ( T (2) ) = L(4) = C S(N 5) = L ( T (4) ) = L(6) = G If we were just trying to find the location of a particular match, like AC we could just stop at that point rather than recover the entire sequence.

46 FM index The FM index is a compressed version of L, C and O along with a compressed version of a set of indices with known locations in S. This allows one to trace back to a known location. So one could align short reads to a genome by first computing the BWT and compressing the results. Unfortunately many of the reads would not align due to sequencing errors and polymorphisms. Hence one needs to allow some discrepancies between the genome and the short reads. While this is adjustable, the default specifications for the program Bowtie are that there is a maximum of 2 mismatches in the first 28 bases and a maximum total quality score of 70 for mismatches on the PHRED scale.

47 Introduction to Linux In order to use the latest tools in bioinformatics you need access to a linux based operating system (or Mac OS X). So we will give some background on how to use a linux based system. The linux operating system is based on the UNIX operating system (by a guy named Linus Torvalds), but is freely available (UNIX was developed by a private company). There are a number of implementations of linux that go by different names: e.g. Ubuntu, Debian, Fedora, and OPENsuse.

48 Introduction to Linux Most beginners choose Ubuntu (or one of its offshoots, like Mint). I use Ubuntu or Fedora on my computers and centos is used on the server we will use for this course. You can install it for free on any PC. Such systems are based on a hierarchical structure for organizing files into directories. Much of the time we will use plain text files: these are files that have no formatting, unlike the files you use in a typical word processing program. To edit such files you need to use a text editor. I like vi and vim (which has capabilities to interact with large files).

49 Introduction to Linux The editor pico is simple to use and installed on the server we will use for this class. To use this you type pico followed by the file name-if the file doesn t exist it will create the file otherwise it will open up the file so that you can edit it. For example [user0001@boris ~]$ pico temp1 will open the file temp1 for editing. Commands at the bottom of the window tell you how to conduct various operations.

50 Introduction to Linux You can create subdirectories in your home directory by typing ~]$ mkdir NGS This would create a NGS subdirectory where you can put files related to your NGS project. Then to keep your microarray data separate you could do [user0001@boris ~]$ mkdir microarray To move around hierarchical directory structure you use the command cd. We enter the NGS directory with [user0001@boris ~]$ cd NGS

51 Introduction to Linux To go back to our home location (i.e. the directory you start from when you log on) you just type cd and if you want to go up one level in the hierarchy you type cd... To list all of the files and directories in your current location you type ls. To copy a file you use cp, to move a file from one location to another you use mv and to remove a file you use rm. One can modify the behavior of these commands by using options.

52 Introduction to Linux For example, if I want to see the size of the files (and some additional information) rather than just list them I could issue the command [user0001@boris ~]$ ls -l Other useful commands include more and less for viewing the contents of files and man for information about a command (from the manual). For instance [user0001@boris ~]$ man ls Will open a help file with information about the options that are available for the ls command.

53 Introduction to Linux Wild cards are characters that operating systems use to hold a variable value. These are extremely useful as it lets the user specify many tasks without having to do a bunch of repetitive typing. For example, suppose you had a bunch of files that ended with.txt and you wanted to delete them. You could issue commands like [user0001@boris ~]$ rm file1.txt [user0001@boris ~]$ rm file2.txt [user0001@boris ~]$ rm file3.txt... Or you could just use [user0001@boris ~]$ rm file*txt or even [user0001@boris ~]$ rm *txt

54 Introduction to Linux Most linux configurations will check that you really want to remove the files, but if you are removing many files, the process of typing yes repeatly is unnecessary. Just type ~]$ rm -f *txt When you obtain programs from the internet that run on linux systems these programs will typically install many subdirectories, so if you want to get rid of all the contents of a directory, first enter that directory, [user0001@boris ~]$ cd olddir

55 Introduction to Linux then issue the command olddir]$ rm -rf * then go back up one level [user0001@boris olddir]$ cd.. and get rid of the directory using the rmdir command [user0001@boris ~]$ rmdir olddir If you issue a command to run a program but wish you hadn t and want it to stop you push the Ctrl and C keys simultaneously.

56 Introduction to Linux To connect to a linux server from a Windows machine you need a Windows program that allows you to log on to a terminal and the ability to transfer files to and from the remote server and your Windows machine. WinSCP is a free program that allows one to do this. You can download the data from the WinSCP website and it installs like a regular Windows program: just allow your computer to install it and go with the default configuration. To get a terminal window you can use a program called PuTTY. So download the program PuTTY (which is just the file called PuTTY.exe) and create a subdirectory in your Program Files directory called PuTTY.

57 Introduction to Linux Then save PuTTY.exe in the directory you just created. Once installed start WinSCP and use boris.biostat.umn.edu as the hostname along with your user name and password. You can then get a terminal window using putty: just click on the box that shows 2 computers that are connected right below the session tab. Then disconnect when you are done (using the option in the session tab).

58 Installation of Bowtie Recall: Bowtie is a program that aligns short reads to an indexed genome. Bowtie2 is a more up to date version that allows gapped alignments and Smith Waterman type scoring, Bowtie just allows a couple of nucleotides to disagree between the read and the genome. To use Bowtie we need a set of short reads (in fastq format) and the index of a genome and of course the program itself. We will download a compressed version of the program (an executable) and after we uncompress it, we can use the executable, i.e. a program that does something. Bowtie has the capability to build indexes if we have the genome sequence.

59 Installation of Bowtie First set up a directory to hold all of this work [user0001@boris ~]$ mkdir RNAseq then create a directory to hold all of the programs we will download and set up, we will call it bin (for binaries). [user0001@boris ~]$ mkdir bin Then enter this directory so we can install some programs. [user0001@boris ~]$ cd bin

60 Installation of Bowtie In order to know what version of software you need to install you need to check the architecture of the computer. To do so we issue the following command bin]$ uname -m x86 64 So this is a 64 bit machine (32 bit machines will report something like i386). So get the appropriate version of Bowtie using the wget command [user0001@boris bin]$ wget 64.zip

61 Installation of Bowtie then unpack it bin]$ unzip bowtie linux-x86 64.zip Archive: bowtie linux-x86 64.zip creating: bowtie / creating: bowtie /scripts/ inflating: bowtie /scripts/build test.sh inflating: bowtie /scripts/make a thaliana tair.sh... This will generate a subdirectory with the name bowtie , so enter that subdirectory so that you can have access to the executable called bowtie in this subdirectory.

62 Aligning reads to an indexed genome The program ships with some data from the bacteriophage Enterobacteria phage lambda, so we will check the installation with this. We will first use a fasta file that ships with Bowtie to construct the index files for this organism. To use the program, you type the name of the executable, then give the location and name of the fasta file, then give the index name. So to test the installation type the following [user0001@boris bowtie ]$./bowtie2-build example/reference/lambda virus.fa lambda virus

63 Aligning reads to an indexed genome Then you get a bunch of output written to the screen about the index creation process and it generates 6 new files: lambda virus.1.bt2, lambda virus.2.bt2, lambda virus.3.bt2, lambda virus.4.bt2, lambda virus.rev.1.bt2 and lambda virus.rev.2.bt2. After that we are ready to align reads to the reference genome using the index we constructed. Here the syntax is to give the name of the index after -x, the files with the reads after -U and here we specify that we want the output in the sam file format using -S.

64 Aligning reads to an indexed genome bowtie ]$./bowtie2 -x lambda virus -U example/reads/reads 1.fq -S eg1.sam reads; of these: (100.00%) were unpaired; of these: 596 (5.96%) aligned 0 times 9404 (94.04%) aligned exactly 1 time 0 (0.00%) aligned >1 times 94.04% overall alignment rate We can also look at the output. [user0001@boris bowtie ]$ head eg1.sam

65 Aligning reads to an indexed genome Illumina stores prebuilt indices for many commonly studied organisms at the location software/igenome.ilmn If one has a set of index files one can use the program bowtie2-inspect to recreate the fasta files. The Illumina site has the fasta files for these organisms in the same location.

66 Altering your $PATH Note that we need to be in the same directory level as our executable in order to use that executable thus far-this is not necessary. To have access to an executable anywhere in your system you should alter your PATH variable in the linux environment to include where you keep your executables then write the executables to your path.

67 Altering your $PATH If you look at your path variable you will see that the bin directory we created is already on the path, so we can just copy executables to that location to use them. Here is how to examine the PATH variable. bowtie ]$ echo $PATH /usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin: /usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin: /sbin:/export/home/courses/ph7445/user0001/bin

68 Copying executables to bin Many NGS programs use bowtie2, and for this to work they need to find the program in their path. So here we copy all of the executables to the bin directory: bowtie ]$ cp bowtie2 $HOME/bin/bowtie2 bowtie ]$ cp bowtie2-align-l $HOME/bin/bowtie2-align-l bowtie ]$ cp bowtie2-align-s $HOME/bin/bowtie2-align-s bowtie ]$ cp bowtie2-build $HOME/bin/bowtie2-build bowtie ]$ cp bowtie2-build-l $HOME/bin/bowtie2-build-l

69 Copying executables to bin and copy the rest too. bowtie ]$ cp bowtie2-build-s $HOME/bin/bowtie2-build-s bowtie ]$ cp bowtie2-inspect $HOME/bin/bowtie2-inspect bowtie ]$ cp bowtie2-inspect-l $HOME/bin/bowtie2-inspect-l bowtie ]$ cp bowtie2-inspect-s $HOME/bin/bowtie2-inspect-s

70 Altering your $PATH Bowtie 2 uses environmental variables to find the executables-just set these as follows. [user0001@boris bowtie ]$ export BT2 HOME=/export/home/courses/ph7445/user0001/bin/bowtie and check this with [user0001@boris bowtie ]$ echo $BT2 HOME

71 Altering environmental variables To test that this works let s go back to our home directory and issue our Bowtie2 commands again (note that we need to tell Bowtie where to find the fastq and index files) [user0001@boris ]$ bowtie2-build $BT2 HOME/example/reference/lambda virus.fa lambda virus [user0001@boris ]$ bowtie2 -x lambda virus -U $BT2 HOME/example/reads/reads 1.fq -S eg1.sam and it works fine.

72 Producing SAM files As we have seen, Bowtie2 can also produce a special file type that is widely used in the NGS world, a SAM file. If we open our eg1.sam file we see that the first 3 rows are the header (since they start Then there is a row for each read and 11 mandatory tab delimited fields per line that provide information about the alignment, for example the 4th column gives the location where the read maps. For the full specification consult samtools.sourceforge.net/samv1.pdf.

73 SAMtools The SAM (for sequence alignment/map) file type has a very strict format. These files can be processed in a highly efficient manner using a suite of programs called SAMtools. Here is how to setup these tools Download the latest version in your bin directory samtools-1.1.tar.bz2. [user0001@boris bin]$ wget

74 SAMtools then unpack bin]$ tar -jxvf samtools-1.1.tar.bz2 then there is a directory full of C programs called samtools-1.1, so go in there and type [user0001@boris samtools-1.1]$ make and write the executable to your path [user0001@boris samtools-1.1]$ cp samtools $HOME/bin/samtools

75 SAMtools example To use this for variant calling, we need an example with at least 2 subjects, so let s use the example that ships with SAMtools. We will return to this example latter, but for now we will just take a look at the pileup which represents the extent of coverage. Here is how to use SAMtools to index a FASTA file: [user0001@boris samtools-1.1]$ samtools faidx examples/ex1.fa

76 BAM and SAM files then create a BAM file from the SAM file (BAM files are binary, so they are compressed but not readable by eye) [user0001@boris samtools-1.1]$ samtools import examples/ex1.fa.fai examples/ex1.sam.gz ex1.bam should get [sam header read2] 2 sequences loaded

77 BAM and SAM files then index the BAM file samtools-1.1]$ samtools index ex1.bam then we can view the alignment to visually examine the extent of coverage and the depth of coverage samtools-1.1]$ samtools tview ex1.bam examples/ex1.fa and hit q to leave that visualization.

78 Visualization of the pileup CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTGGCT A...C C T.T.T..T...T..TC C T......,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,......T T

79 RNA-Seq As was mentioned previously, we can use next generation sequencing to generate sequences from samples of RNA. The transcriptome is the collection of all transcripts that can be generated from a genome. Current estimates indicate that 92-94% of human transcripts with more than one exon have alternatively spliced isoforms (Wang, Sandberg, Luo, et al. Nature, 2008). This means that the same gene can give rise to multiple mrna molecules. Some will differ because they have have different exons assembled into an mrna, but other possibilities include differences in use of the transcription start site (TSS), and differences in the 5 and 3 untranslated regions (UTRs).

80 Aligning reads to the transcriptome Differences in exon usage will result in mrnas from the same gene leading to different proteins, hence we speak of differential coding DNA sequences (CDS). One of the differences that arises due to differential 3 UTR usage is alternative polyadenelation patterns. These differences in 5 and 3 UTR usage lead to mrna molecules that have differential affinity for proteins that bind to the mature mrna molecule and we are just starting to understand how biological processes are regulated by these subtle mechanisms.

81 RNA-Seq Note that there is a difference in aligning a set of reads from an RNA sample and a DNA sample: the RNA molecules will have undergone removal of introns hence they won t align directly to the genome. If there is an annotated transcriptome for an organism you should start out trying to use this to analyze your data. However most investigations find evidence for transcripts that are not currently annotated, even for well studied organisms like mice and humans.

82 Duplicates One of the differences between using next generation sequencing for resequencing and profiling RNA levels is that with libraries prepared from RNA, one often finds duplicate reads. In fact it is not unusual to find that 60-80% of the reads are duplicates.

83 Duplicates In a resequencing study the chances of finding duplicates is quite low: for example, the human haploid genome is about 3 billion bps so the probability of observing a duplicate is about 1 in 3 billion, so with million reads, duplicate reads seem unlikely. Duplicate reads can arise as a PCR artifact, hence in resequencing studies duplicate reads should probably be treated as artifacts and removed. There is a SAMtools function called rmdup that can be used to do this, but read the manual on its use: for example, it is designed for paired end reads by default.

84 Library complexity However, when we consider RNA-Seq libraries, duplicates might arise just due to low complexity of our sample. By low complexity we mean there are not many distinct RNA molecules. The median length of a human transcript is 2186 nucleotides (according to the UCSC hg19 annotation), so if one has a library with only 1 type of transcript and generates 20 million reads from it one is guaranteed to find duplicates. So for analysis of RNA-Seq experiments it is probably unwise to remove duplicates.

85 Alignment algorithms There are a number of algorithms that are designed to align a set of short reads to the transcriptome, here is a short non-exhaustive list. 1. TopHat 2. GSNAP 3. MapSplice 4. SpliceMap 5. HMMSplicer However it is not clear at this point if any of these really dominate the others in terms of performance. However, TopHat is perhaps the most popular currently, and as it builds off of Bowtie, we will use this program in this course.

86 TopHat The basic idea behind TopHat is that if 2 segments of a read map to different locations on the genome then they are likely mapping to 2 different exons. To use TopHat you need fastq files for all of our samples and the genome sequence for the organism. Note that sometimes the data for a single sample will be in more than one fastq file. While we can use Bowtie to align short reads to a genome, reads that span an intron, or genes that span a splice junction, will not align directly to the genome.

87 TopHat If a read contains an internal segment that should map to an exon, Bowtie will fail to map this read to its correct genomic location as it won t match the genome outside of this exon. For this reason, Tophat initially splits the reads up into 25 base units, does the read mapping, then reassembles reads. Note: if you are using a set of reads that are shorter than 50 bases you should instruct TopHat to break the reads into fragments that are about half the length of your reads. For example, if you have a set of Illumina reads that are 36 bases you should instruct TopHat to break them into fragments of length 18 (using the --segment-length options, like --segment-length 18), otherwise it may not detect any splice junctions.

88 TopHat TopHat starts out by mapping the reads to the genome using Bowtie and sets aside those reads that don t map. It then constructs a set of disjoint covered regions (it no longer uses the program Maq as it did at the time of the publication): these are taken to be the initial set of exons observed in that sample. The program determines a consensus sequence using the reads, however in low coverage regions it will use the reference genome to determine the sequence. Also, as the ends of exons are likely only mapped by reads that span the splice junction, the disjoint regions covered by reads are expanded (45 bases by default) using the reference sequence.

89 TopHat For genes that are transcribed at low levels there could be gaps in the exons after the assembly step, hence exons that are very close are merged into a single exon. Introns of length less than 70 bases are rare in mammalian genomes, hence by default we combine exons that are separated by less than this many bases (this can be adjusted). The resulting set of putative exons (with their extensions from the reference genome) we will call islands. The 5 end of an intron is called a donor site and the 3 end is called the acceptor site.

90 TopHat This donor site almost always has the 2 bases GU at the end while the acceptor site is almost always AG, but these are sometimes GC-AG and AU-AC so current versions of TopHat also look for these acceptor and donor sites. TopHat uses these sequence characteristics to find potential donor and acceptor site pairs in the island sequences (not just adjacent ones, but nearby ones to allow for alternative splicing). It then looks at the set of reads that were initially unmapped and examines if any of these map to the sequence that spans the splice junction. To conduct this mapping TopHat uses the fact that most introns are between 70 and 500,000 bases (the maximum differs from the publication): these intron sizes are based on mammals, so the user should specify values for these if using a different organism.

91 TopHat It also requires that at least one read span at least 8 bases on each side of the splice junction by default, but this value can be adjusted. It also only uses the first 28 bases from the 5 end of the read to ensure that the read is of high quality, but this too can be adjusted. Finally the program reports a splice junction only if it occurs in more than 15% of the depth of coverage of the flanking exons (the 15% value comes from the Wang et al. 2008, Nature paper), but this is also adjustable as this is based on human data. If one supplies a transcriptome to TopHat via a GTF file, the program will first map reads to the transcriptome then try to identify splice junctions for the unmapped reads.

92 TopHat One can disable the subsequent search for splice junctions by specifying the -T option if you have provided a transcriptome. If you are going to use an annotated transcriptome then TopHat will need to construct an index for this and if you are processing multiple samples TopHat will redo exactly the same calculation for each. To save time, you should have TopHat do these calculations for the first sample and save the results. You can do this using the --transcriptome-index option. This option expects a directory and a name to use in this directory.

93 TopHat The exact syntax is like this: note that the --transcriptome-index option must be used right after the -G option. tophat -o out sample1 -G known genes.gtf \ --transcriptome-index=transcriptome data/known \ hg19 sample1 1.fq.z Then for processing the other samples we just tell TopHat where these indices are and don t specify the -G option tophat -o out sample2 \ --transcriptome-index=transcriptome data/known \ hg19 sample2 1.fq.z

94 TopHat To use the program first get the latest version bin]$ wget x86 64.tar.gz Then uncompress the binaries, enter the directory that is generated and copy the executables to your path. bin]$ tar zxvf tophat linux x86 64.tar.gz tophat linux x86 64]$ cp tophat2 $HOME/bin/tophat2 tophat linux x86 64]$ cp * $HOME/bin/

95 TopHat Then you should test the installation, so get some test data. tophat linux x86 64]$ wget data.tar.gz Then uncompress, go into that directory and run a test [user0001@boris tophat linux x86 64]$ tar zxvf test data.tar.gz [user0001@boris tophat linux x86 64]$ cd test data Then test by issuing the command (the -r 20 option indicates that we think there are 20 bases between read pairs) [user0001@boris test data]$ tophat2 -r 20 test ref reads 1.fq reads 2.fq

96 TopHat and if it is working correctly this yields the following [ :54:03] Beginning TopHat run (v2.0.13) [ :54:03] Checking for Bowtie Bowtie version: [ :54:03] Checking for Bowtie index files (genome).. Found both Bowtie1 and Bowtie2 indexes [ :54:08] Run complete: 00:00:05 elapsed

97 Cufflinks While TopHat is useful for aligning a set of reads to the transcriptome, one must use some other program to generate transcript level summaries of gene expression and test for differences between groups. One such program that is supported by the same group of researchers that develop TopHat is Cufflinks. Cufflinks uses the output from TopHat and we will see that one can use functionality of the R package cummerbund to examine the output and generate reports and publication quality graphs. It is designed for paired end reads but can be used for unpaired reads.

98 Cufflinks Cufflinks uses an expression for the likelihood of a set of paired end reads that depends on the length of the fragments: with paired end reads one can estimate the distribution of the lengths. Note that there is a fundamental identifiability problem with transcript assembly that is only mitigated by the presence of reads that span splice junctions. Suppose there are 2 exons and some reads map to both exons but we are unable to identify any reads that span splice junctions. If there are 2 exons there are 3 possible transcripts (2 with one of each exon and a third that uses both) so if we allow for all 3 transcripts there are an infinite collection of combinations of the 3 transcript levels that are consistent with observed read counts that hit both exons.

99 Cufflinks For this reason one must enforce a constraint on the transcripts: for example, we attempt to find the minimal set of transcripts that explain the exon counts. Cufflinks quantifies the level of expression of a transcript via the number of fragments per kilobase of exon per million fragments mapped (FPKM). This attempts to account for the fact that longer transcripts will generate more reads than shorter transcripts due to their length. Cuffdiff is a program that is bundled with Cufflinks that allows one to test for differences between experimental conditions.

100 Cufflinks The log of the ratio of FPKM values are approximately normal random variables: this allows one to define a test statistic once one determines an approximation to the standard error of this log ratio. This standard error depends on the number of isoforms for the transcript: if there is a single isoform there is no uncertainty in the expression estimate produced by the algorithm and the only source of variability is from biological replicates. If there are no replicates within conditions the standard error is computed by examining the variance across the 2 conditions. If there are not many transcripts that differ in their level of expression this will provide a reasonable standard error, but if there are replicates one can obtain better estimates of the variance of the log ratio by using the negative binomial distribution.

101 Cufflinks If there is only a single isoform then an approach similar to that used in the R package DESeq is used, but if there are multiple isoforms then uncertainty in the isoform expression estimates must be taken into account: this is done using a generalization of the negative binomial distribution called the beta negative binomial distribution. Some useful details: While both BAM and SAM files can be used for input the program requires that if chromosome names are present in records in the header of the file then the reads must be sorted in the same order.

102 RNA-Seq using TopHat and Cufflinks We ve already installed and tested the latest version of TopHat, so now let s install Cufflinks using what is now the familiar strategy: download the tarball, unpack it and write the executables to our path by copying them to bin. [user0001@boris bin]$ wget x86 64.tar.gz [user0001@boris bin]$ tar zxvf cufflinks linux x86 64.tar.gz [user0001@boris bin]$ cd cufflinks linux x86 64 [user0001@boris cufflinks linux x86 64]$ cp * $HOME/bin/

103 RNA-Seq using TopHat and Cufflinks Then test the installation using some test data, so download and run. cufflinks linux x86 64]$ wget data.sam cufflinks linux x86 64]$ cufflinks test data.sam and you should see

104 RNA-Seq using TopHat and Cufflinks You are using Cufflinks v2.2.1, which is the most recent release. [bam header read] EOF marker is absent. [bam header read] invalid BAM binary header (this is not a BAM file). File test data.sam doesn t appear to be a valid BAM file, trying SAM [11:36:11] Assembling transcripts and estimating abundances. > Processed 1 loci. [*************************] 100% Which indicates that this is working properly. This generates a GTF file called transcripts.gtf which has information on transcripts that were identified and their level of expression.

105 RNA-Seq using TopHat and Cufflinks As an example of using the TopHat/Cufflinks programs we will analyze the data that was used to illustrate the protocol developed by the authors of the programs (this is from the Nature Protocols paper from March 2012). The data is already loaded on the server and we ve installed all of the software, so we are almost ready to go. We also need a set of Bowtie2 indexes and if we want to use an annotated transcriptome we need a GTF file with that information.

106 RNA-Seq using TopHat and Cufflinks These have already been set up in the subdirectory /export/home/courses/ph7445/data The indexes are in the files with names that start with genome located at /export/home/courses/ph7445/data/drosophila melanogaster/ensembl/bdgp5/sequence/bowtie2index (there are 6 of these and I got them from the Illumina website). The file Drosophila melanogaster.bdgp5.68.gtf was obtained from ensembl.org: ftp://ftp.ensembl.org/pub/release-73/gtf/drosophilia melanogaster.

107 RNA-Seq using TopHat and Cufflinks First we must use TopHat to map the reads in the fastq files to the transcriptome. Note that due to all of the options we are going to use we need to break the command up into multiple lines: we do this by ending each line with a back-slash. [user0001@boris RNAseq]$ tophat2 -p 2 -G \ /export/home/courses/ph7445/data/drosophila melanogaster.bdgp5.68.gtf \ -o C1 R1 thout \ /export/home/courses/ph7445/data/drosophila melanogaster/ensembl/bdgp5/sequence/bowtie2index/genome \ /export/home/courses/ph7445/data/gsm C1 R1 1.fq \ /export/home/courses/ph7445/data/gsm C1 R1 2.fq then you should see something like this

An Introduction to Linux and Bowtie

An Introduction to Linux and Bowtie An Introduction to Linux and Bowtie Cavan Reilly November 10, 2017 Table of contents Introduction to UNIX-like operating systems Installing programs Bowtie SAMtools Introduction to Linux In order to use

More information

RNA-seq. Manpreet S. Katari

RNA-seq. Manpreet S. Katari RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,

More information

Sequence Analysis Pipeline

Sequence Analysis Pipeline Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation

More information

RNA-seq Data Analysis

RNA-seq Data Analysis Seyed Abolfazl Motahari RNA-seq Data Analysis Basics Next Generation Sequencing Biological Samples Data Cost Data Volume Big Data Analysis in Biology تحلیل داده ها کنترل سیستمهای بیولوژیکی تشخیص بیماریها

More information

Illumina Next Generation Sequencing Data analysis

Illumina Next Generation Sequencing Data analysis Illumina Next Generation Sequencing Data analysis Chiara Dal Fiume Sr Field Application Scientist Italy 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life,

More information

High-throughout sequencing and using short-read aligners. Simon Anders

High-throughout sequencing and using short-read aligners. Simon Anders High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational

More information

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012) USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are

More information

Mapping NGS reads for genomics studies

Mapping NGS reads for genomics studies Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization

More information

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis

More information

Aligners. J Fass 21 June 2017

Aligners. J Fass 21 June 2017 Aligners J Fass 21 June 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-06-21

More information

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Exercise 2: Browser-Based Annotation and RNA-Seq Data Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler July 24, 2018 This exercise continues your introduction to practical issues in comparative annotation. You ll be annotating genomic sequence

More information

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Although a little- bit long, this is an easy exercise

More information

NGS Data and Sequence Alignment

NGS Data and Sequence Alignment Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local

More information

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG

More information

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems. Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD

More information

Long Read RNA-seq Mapper

Long Read RNA-seq Mapper UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...

More information

Galaxy Platform For NGS Data Analyses

Galaxy Platform For NGS Data Analyses Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary Cufflinks RNA-Seq analysis tools - Getting Started 1 of 6 14.07.2011 09:42 Cufflinks Transcript assembly, differential expression, and differential regulation for RNA-Seq Site Map Home Getting started

More information

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012 SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................

More information

Short Read Alignment. Mapping Reads to a Reference

Short Read Alignment. Mapping Reads to a Reference Short Read Alignment Mapping Reads to a Reference Brandi Cantarel, Ph.D. & Daehwan Kim, Ph.D. BICF 05/2018 Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

More information

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies

More information

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au

More information

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL User manual for QIAseq Targeted RNAscan Panel Analysis 0.5.2 beta 1 Windows, Mac OS X and Linux February 5, 2018 This software is for research

More information

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi

More information

Ensembl RNASeq Practical. Overview

Ensembl RNASeq Practical. Overview Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted

More information

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST A Simple Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at http://www.ncbi.nih.gov/blast/

More information

Running SNAP. The SNAP Team October 2012

Running SNAP. The SNAP Team October 2012 Running SNAP The SNAP Team October 2012 1 Introduction SNAP is a tool that is intended to serve as the read aligner in a gene sequencing pipeline. Its theory of operation is described in Faster and More

More information

Aligners. J Fass 23 August 2017

Aligners. J Fass 23 August 2017 Aligners J Fass 23 August 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-08-23

More information

Maize genome sequence in FASTA format. Gene annotation file in gff format

Maize genome sequence in FASTA format. Gene annotation file in gff format Exercise 1. Using Tophat/Cufflinks to analyze RNAseq data. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise

More information

CLC Server. End User USER MANUAL

CLC Server. End User USER MANUAL CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark

More information

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2014 UNIVERSITY OF KENTUCKY AGTC Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta

More information

ChIP-seq (NGS) Data Formats

ChIP-seq (NGS) Data Formats ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/

More information

From Smith-Waterman to BLAST

From Smith-Waterman to BLAST From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is

More information

Running SNAP. The SNAP Team February 2012

Running SNAP. The SNAP Team February 2012 Running SNAP The SNAP Team February 2012 1 Introduction SNAP is a tool that is intended to serve as the read aligner in a gene sequencing pipeline. Its theory of operation is described in Faster and More

More information

NGS Analysis Using Galaxy

NGS Analysis Using Galaxy NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises

More information

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014

Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Deciphering the information contained in DNA sequences began decades ago since the time of Sanger sequencing.

More information

Genomic Files. University of Massachusetts Medical School. October, 2014

Genomic Files. University of Massachusetts Medical School. October, 2014 .. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

Analysis of ChIP-seq data

Analysis of ChIP-seq data Before we start: 1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for the class (see separate hand out) 3. Connect to your lab space through the wihtdata network and

More information

ASAP - Allele-specific alignment pipeline

ASAP - Allele-specific alignment pipeline ASAP - Allele-specific alignment pipeline Jan 09, 2012 (1) ASAP - Quick Reference ASAP needs a working version of Perl and is run from the command line. Furthermore, Bowtie needs to be installed on your

More information

m6aviewer Version Documentation

m6aviewer Version Documentation m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.

More information

Lecture 12. Short read aligners

Lecture 12. Short read aligners Lecture 12 Short read aligners Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola

More information

Read mapping with BWA and BOWTIE

Read mapping with BWA and BOWTIE Read mapping with BWA and BOWTIE Before We Start In order to save a lot of typing, and to allow us some flexibility in designing these courses, we will establish a UNIX shell variable BASE to point to

More information

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples. Services Performed The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples. SERVICE Sample Received Sample Quality Evaluated Sample Prepared for Sequencing

More information

The software and data for the RNA-Seq exercise are already available on the USB system

The software and data for the RNA-Seq exercise are already available on the USB system BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software

More information

Genomic Files. University of Massachusetts Medical School. October, 2015

Genomic Files. University of Massachusetts Medical School. October, 2015 .. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub trinityrnaseq / RNASeq_Trinity_Tuxedo_Workshop Trinity De novo Transcriptome Assembly Workshop Brian Haas edited this page on Oct 17, 2015 14 revisions De novo RNA-Seq Assembly and Analysis Using Trinity

More information

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq SMART Seq v4 Ultra Low Input RNA Kit for Sequencing Powered by SMART and LNA technologies: Locked nucleic acid technology significantly improves

More information

Identiyfing splice junctions from RNA-Seq data

Identiyfing splice junctions from RNA-Seq data Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010 Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice

More information

Analyzing ChIP- Seq Data in Galaxy

Analyzing ChIP- Seq Data in Galaxy Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...

More information

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) Genome Informatics (Part 1) https://bioboot.github.io/bggn213_f17/lectures/#14 Dr. Barry Grant Nov 2017 Overview: The purpose of this lab session is

More information

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics

More information

Tutorial 1: Exploring the UCSC Genome Browser

Tutorial 1: Exploring the UCSC Genome Browser Last updated: May 12, 2011 Tutorial 1: Exploring the UCSC Genome Browser Open the homepage of the UCSC Genome Browser at: http://genome.ucsc.edu/ In the blue bar at the top, click on the Genomes link.

More information

Unix tutorial, tome 5: deep-sequencing data analysis

Unix tutorial, tome 5: deep-sequencing data analysis Unix tutorial, tome 5: deep-sequencing data analysis by Hervé December 8, 2008 Contents 1 Input files 2 2 Data extraction 3 2.1 Overview, implicit assumptions.............................. 3 2.2 Usage............................................

More information

Galaxy workshop at the Winter School Igor Makunin

Galaxy workshop at the Winter School Igor Makunin Galaxy workshop at the Winter School 2016 Igor Makunin i.makunin@uq.edu.au Winter school, UQ, July 6, 2016 Plan Overview of the Genomics Virtual Lab Introduce Galaxy, a web based platform for analysis

More information

RNA-Seq Analysis With the Tuxedo Suite

RNA-Seq Analysis With the Tuxedo Suite June 2016 RNA-Seq Analysis With the Tuxedo Suite Dena Leshkowitz Introduction In this exercise we will learn how to analyse RNA-Seq data using the Tuxedo Suite tools: Tophat, Cuffmerge, Cufflinks and Cuffdiff.

More information

Easy visualization of the read coverage using the CoverageView package

Easy visualization of the read coverage using the CoverageView package Easy visualization of the read coverage using the CoverageView package Ernesto Lowy European Bioinformatics Institute EMBL June 13, 2018 > options(width=40) > library(coverageview) 1 Introduction This

More information

Goal: Learn how to use various tool to extract information from RNAseq reads.

Goal: Learn how to use various tool to extract information from RNAseq reads. ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2017 Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): Output(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta

More information

ChIP-seq Analysis Practical

ChIP-seq Analysis Practical ChIP-seq Analysis Practical Vladimir Teif (vteif@essex.ac.uk) An updated version of this document will be available at http://generegulation.info/index.php/teaching In this practical we will learn how

More information

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Next generation sequencing: assembly by mapping reads Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Overview What is assembly by mapping? Methods BWT File formats Tools Issues Visualization Discussion

More information

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were

More information

Single/paired-end RNAseq analysis with Galaxy

Single/paired-end RNAseq analysis with Galaxy October 016 Single/paired-end RNAseq analysis with Galaxy Contents: 1. Introduction. Quality control 3. Alignment 4. Normalization and read counts 5. Workflow overview 6. Sample data set to test the paired-end

More information

ChIP-Seq Tutorial on Galaxy

ChIP-Seq Tutorial on Galaxy 1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data

More information

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Table of Contents Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification

More information

Advanced UCSC Browser Functions

Advanced UCSC Browser Functions Advanced UCSC Browser Functions Dr. Thomas Randall tarandal@email.unc.edu bioinformatics.unc.edu UCSC Browser: genome.ucsc.edu Overview Custom Tracks adding your own datasets Utilities custom tools for

More information

Eval: A Gene Set Comparison System

Eval: A Gene Set Comparison System Masters Project Report Eval: A Gene Set Comparison System Evan Keibler evan@cse.wustl.edu Table of Contents Table of Contents... - 2 - Chapter 1: Introduction... - 5-1.1 Gene Structure... - 5-1.2 Gene

More information

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions

More information

Under the Hood of Alignment Algorithms for NGS Researchers

Under the Hood of Alignment Algorithms for NGS Researchers Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window

More information

NGS Data Visualization and Exploration Using IGV

NGS Data Visualization and Exploration Using IGV 1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians

More information

Tiling Assembly for Annotation-independent Novel Gene Discovery

Tiling Assembly for Annotation-independent Novel Gene Discovery Tiling Assembly for Annotation-independent Novel Gene Discovery By Jennifer Lopez and Kenneth Watanabe Last edited on September 7, 2015 by Kenneth Watanabe The following procedure explains how to run the

More information

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software:

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software: A Tutorial: De novo RNA- Seq Assembly and Analysis Using Trinity and edger The following data and software resources are required for following the tutorial: Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat

More information

Bioinformatics for High-throughput Sequencing

Bioinformatics for High-throughput Sequencing Bioinformatics for High-throughput Sequencing An Overview Simon Anders EBI is an Outstation of the European Molecular Biology Laboratory. Overview In recent years, new sequencing schemes, also called high-throughput

More information

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For

More information

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis

More information

Package Rsubread. July 21, 2013

Package Rsubread. July 21, 2013 Package Rsubread July 21, 2013 Type Package Title Rsubread: an R package for the alignment, summarization and analyses of next-generation sequencing data Version 1.10.5 Author Wei Shi and Yang Liao with

More information

INTRODUCTION AUX FORMATS DE FICHIERS

INTRODUCTION AUX FORMATS DE FICHIERS INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan

More information

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013 RNAseq analysis: SNP calling BTI bioinformatics course, spring 2013 RNAseq overview RNAseq overview Choose technology 454 Illumina SOLiD 3 rd generation (Ion Torrent, PacBio) Library types Single reads

More information

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 File Formats: SAM, BAM, and CRAM UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 / BAM / CRAM NEW! http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and

More information

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For

More information

Differential gene expression analysis using RNA-seq

Differential gene expression analysis using RNA-seq https://abc.med.cornell.edu/ Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, September/October 2018 Friederike Dündar with Luce Skrabanek & Paul Zumbo Day 3: Counting reads

More information

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there: Practical Course in Genome Bioinformatics 19.2.2016 (CORRECTED 22.2.2016) Exercises - Day 5 http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2016/ Answer the 5 questions (Q1-Q5) according

More information

NGS Data Analysis. Roberto Preste

NGS Data Analysis. Roberto Preste NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr

More information

Aligning reads: tools and theory

Aligning reads: tools and theory Aligning reads: tools and theory Genome Sequence read :LM-Mel-14neg :LM-Mel-42neg :LM-Mel-14neg :LM-Mel-14pos :LM-Mel-42neg :LM-Mel-14neg :LM-Mel-42neg :LM-Mel-14neg chrx: 152139280 152139290 152139300

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

How to store and visualize RNA-seq data

How to store and visualize RNA-seq data How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk EBI is an Outstation of the European Molecular Biology Laboratory. Talk summary How do we archive RNA-seq

More information

USING BRAT UPDATES 2 SYSTEM AND SPACE REQUIREMENTS

USING BRAT UPDATES 2 SYSTEM AND SPACE REQUIREMENTS USIN BR-1.1.17 1 UPDES In version 1.1.17, we fixed a bug in acgt-count: in the previous versions it had only option -s to accept the file with names of the files with mapping results of single-end reads;

More information

Variant calling using SAMtools

Variant calling using SAMtools Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel

More information

NGS FASTQ file format

NGS FASTQ file format NGS FASTQ file format Line1: Begins with @ and followed by a sequence idenefier and opeonal descripeon Line2: Raw sequence leiers Line3: + Line4: Encodes the quality values for the sequence in Line2 (see

More information

USING BRAT ANALYSIS PIPELINE

USING BRAT ANALYSIS PIPELINE USIN BR-1.2.3 his new version has a new tool convert-to-sam that converts BR format to SM format. Please use this program as needed after remove-dupl in the pipeline below. 1 NLYSIS PIPELINE urrently BR

More information

Performing a resequencing assembly

Performing a resequencing assembly BioNumerics Tutorial: Performing a resequencing assembly 1 Aim In this tutorial, we will discuss the different options to obtain statistics about the sequence read set data and assess the quality, and

More information

Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab

Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab The instruments, the runs, the QC metrics, and the output Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab Overview Roche/454 GS-FLX 454 (GSRunbrowser information) Evaluating run results Errors

More information

ALGORITHM USER GUIDE FOR RVD

ALGORITHM USER GUIDE FOR RVD ALGORITHM USER GUIDE FOR RVD The RVD program takes BAM files of deep sequencing reads in as input. Using a Beta-Binomial model, the algorithm estimates the error rate at each base position in the reference

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Mapping Reads to Reference Genome

Mapping Reads to Reference Genome Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene

More information

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu) HIPPIE User Manual (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu) OVERVIEW OF HIPPIE o Flowchart of HIPPIE o Requirements PREPARE DIRECTORY STRUCTURE FOR HIPPIE EXECUTION o

More information

The preseq Manual. Timothy Daley Victoria Helus Andrew Smith. January 17, 2014

The preseq Manual. Timothy Daley Victoria Helus Andrew Smith. January 17, 2014 The preseq Manual Timothy Daley Victoria Helus Andrew Smith January 17, 2014 Contents 1 Quick Start 2 2 Installation 3 3 Using preseq 4 4 File Format 5 5 Detailed usage 6 6 lc extrap Examples 8 7 preseq

More information

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1 EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1 Introduction This guide contains data analysis recommendations for libraries prepared using Epicentre s EpiGnome Methyl Seq Kit, and sequenced on

More information