Seminar III: R/Bioconductor

Size: px

Start display at page:

Download "Seminar III: R/Bioconductor"

Nigel Osborne
5 years ago
Views:

1 Leonardo Collado Torres Bachelor in Genomic Sciences August - December, / 25

2 Class outline Working with HTS data: a simulated case study Intro R for scripts BLAST Velvet Bowtie Work 2 / 25

3 Intro :O Prepare yourself! 3 / 25

4 Intro About The idea is to learn how to use R as a scripting language to call external programs such as BLAST, Velvet and Bowtie. We ll run these programs with as many default options as we can :) 4 / 25

5 Intro For today you ll need formatdb blastall 1 Velvet Bowtie Of course, some Bioconductor: > source(" > bioclite(c("chipseq")) And a LINUX or UNIX environment :) 1 With Ubuntu use: sudo apt-get install blast2 and voilã a :) 5 / 25

6 Intro External programs As you know, BLAST is very used and useful to find local alignments. Velvet is a great program to assemble short reads into contigs. Bowtie is great to align short reads to a reference genome. 6 / 25

7 R for scripts Running R scripts Thanks to functions like system, you can use R as your scripting language. Of course, a lot of people prefer to use shell directly. Using R can be useful to make some plots on the fly and proc.time helps us track the time spent running our script. You can either use: R CMD BATCH file.r or Rscript file.r > file.log as a shortcut 7 / 25

8 R for scripts Use paste To build system calls, the paste function with the sep or collapse arguments is quite useful: > args <- c(1, 2) > call <- paste("-arg1", args[1], + "-arg2", args[2], sep = " ") > print(call) [1] "-arg1 1 -arg2 2" > call2 <- paste(c("-args", args), + collapse = " ") > print(call2) [1] "-args 1 2" Using a print coupled to a system.time can be useful for slow commands. 8 / 25

9 BLAST Command line You all know how it works, and have run it through the web interface: To run BLAST in command-line you mainly need two programs: 1. formatdb: builds the database (targets) 2. blastall: actually runs BLAST 9 / 25

10 BLAST formatdb Main arguments: -i: the input file -p: the type of database. Use T for proteins or F for nucleotides. -n: the output name, meaning the name of the database. Optional ones I use: -t: the title -l: the log file name -V: to check the names of the targets use V For more info check: formatdb - -help on the terminal formatdb.shtml 10 / 25

11 BLAST blastall Main arguments: -p: the type of BLAST to be run. BLASTP, BLASTN,... -d: the database name 2 -i: the input file name. -o: the output file name. Optional ones I use: -e: the maximum e value allowed for the output file. -m: the format of the output file. I like format 8 :) Click here for examples. For more info check: blastall - -help on the terminal BLAST_blastall.shtml 2 For custom dbs, use the path to the db. 11 / 25

12 Velvet Quick intro Published in 2008, Velvet is the most popular de novo genome assembler for short reads such as those generated by Illumina. Its based on de Brujin graphs and its most important parameter is the k-mer length; similar to the word size. For more info check the paper: 12 / 25

13 Velvet velveth In order to use Velvet we first need to run velveth and specify the: output dir: first value (without any flag) k-mer length: an integer up to input file format: main options are fasta and fastq. type of data: mainly either short or long. input file name For more info type velveth or check the Velvet manual. 3 The lower the value, the slower it runs. 13 / 25

14 Velvet velvetg After running velveth we can run velvetg one or more times on the same directory. Velvetg actually runs Velvet and creates the contigs. To run it type velvetg specifying: the output dir: again, the first unflagged value. some filtering or output options such as min_contig_lgth For more info type velvetg or check the Velvet manual. 14 / 25

15 Bowtie Quick intro Bowtie is a second generation 4 short read aligner that is VERY fast. It s based on the Burrows-Wheeler Transform (BWT) as other fast aligners. Therefore, it builds an index 5 of the reference genome, which speeds up the process. It s very well maintained and for more info check the homepage and related paper :) 4 If you consider MAQ to be the first generation. 5 Similar to the BLAST database. 15 / 25

16 Bowtie bowtie-build It s very simple to use :) Just specify the input file 6 and the output name for the index. After building the index, move the output files 7 into PathToBowtie/indexes/ For more info type: bowtie-build -h 6 In FASTA format. 7 Yup, a few are created. 16 / 25

17 Bowtie bowtie After building your index a quick way to check it is to type: bowtie -c IndexName GCGTGAGCTATGAGAAAGCGCCACGCTTCC Then to run Bowtie I normally use the following arguments: -f: the input file name - -all: to force Bowtie to find all the alignments al: the output name for the FASTA file with the reads that were aligned. - -un: the reads that did not align. Other useful arguments are -m and - -max. For more info type bowtie -h or check the manual. 8 Obviously increases the time quite a bit on real cases. 17 / 25

18 Work Data and problem to solve I generated 18 sets of 70 thousand 50bp reads. One set per student ;) 9 Imagine that these sequences come from a genome related to our species of interest. We want to find variation signatures such as: deletions, invertions and duplications. Always be open to fishy stuff! 9 To find out which one is yours, use the order from Usuarios at Cursos. For example, Fonseca is number 4 and Zepeda Martinez is / 25

19 Work Part I We don t know the name of our species of interest!!! Find it out by building contigs and aligning them versus all known genomes (nucleotides). Explore 10 the reads that were not used to build the contigs. Conclude, remark, etc. 10 Check the files, check the alphabet by cycle frequency, / 25

20 Work Part II How many protein coding genes did we cover at 90% or greater identity and 90% or greater query coverage? You will need to download the FASTA file with the sequence from those genes. Easy to do with the GenBank identifier :) Conclude, remark, etc. 20 / 25

21 Work Part III Align the reads versus our the reference genome of our species of interest. Explore and compare the reads that align more than once and those that aling only once. Identify the number of deletions, duplications and inversions. Plots like coverageplot, densityplot and stripplot will be most useful. To use them re-check the chipseq workflow :) Make some example plots and for the latter two try to make plots spanning all the genome 11. Conclude, remark, etc. 11 Only where you have reads mapped to it. 21 / 25

22 Work Optional parts Using the chipseq worflow, explore only those reads that map to more than one spot. Plot the reads using GenomeGraphs and add boxes for every known gene. Try to pinpoint the exact deleted, duplicated and/or inverted bases. Specially the breakpoints. 22 / 25

23 Work Time to work! Once you are done, let me know and I ll upload all files related to your case :) Compare your conclusions with files such as segments.txt and explore the fig folder. The ref.fa file is the actual reference genome from where I got the 70k reads. Feel free to map your reads to it; some will cannot be uniquely aligned! Once everyone is done, I ll upload the fastagen.r script that created all the data. 23 / 25

24 Work SessionInfo > sessioninfo() R version ( ) i686-pc-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 [2] LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 [4] LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C [6] LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 [8] LC_NAME=C [9] LC_ADDRESS=C [10] LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 [12] LC_IDENTIFICATION=C attached base packages: 24 / 25

25 Work SessionInfo [1] stats graphics grdevices [4] utils datasets methods [7] base 25 / 25

Introduction to R (BaRC Hot Topics)

Introduction to R (BaRC Hot Topics) George Bell September 30, 2011 This document accompanies the slides from BaRC s Introduction to R and shows the use of some simple commands. See the accompanying slides