Next-Generation Sequencing applied to adna

Size: px

Start display at page:

Download "Next-Generation Sequencing applied to adna"

Prosper Allen
5 years ago
Views:

1 Next-Generation Sequencing applied to adna Hands-on session June 13, 2014 Ludovic Orlando - Lorlando@snm.ku.dk Mikkel Schubert - MSchubert@snm.ku.dk Aurélien Ginolhac - AGinolhac@snm.ku.dk Hákon Jónsson - jonsson.hakon@gmail.com Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen 1 Introduction and outline The exercise will consist of the following parts 1. Run example PALEOMIX projects (a) Process and align synthetic data against human mitochondria (b) Generate phylogeny based on protein coding genes 2. Setup and run BAM pipeline for modern and ancient horses (a) Examine the results from the mapping (b) Examine patterns of post-mortem damage in the ancient sample (c) Optimize procedure for the ancient sample 3. Setup and run Phylogenetic pipeline for modern and ancient horses (a) Visualize the resulting phylogeny 1

2 2 Pre-requisites Firstly, before attempting complete any of the analyses in this exercise, it is necessary to source the following script in order to setup the environmental variables needed for the pipelines. : source /home/local/27626/exercises/paleomix/setup_env set MY_HOME="/home/local/ngs_course/${USER}" If using BASH, instead use the following command (if in doubt, just do both): source /home/local/27626/exercises/paleomix/setup_env.sh MY_HOME="/home/local/ngs_course/${USER}" 3 Running the example projects The example project consists of synthetic data generated using primate mitochondrial sequences, which is mapped onto the revised Cambridge Reference Sequence (rcrs). Copy the example project to your home folder and run the read processing and mapping steps using the following commands: mkdir $MY_HOME/my_ancient_dna cd $MY_HOME/my_ancient_dna cp -a /home/local/27626/exercises/paleomix/examples/phylo_pipeline. cd phylo_pipeline/alignment nice bam_pipeline run 000_makefile.yaml The resulting files are located in the same folder; consider the following files (for the Orangutan mitochondrial genome): sumatran_orangutan.rcrs.realigned.bam The Orangutan reads aligned against the human mitochondrial genome; the.realigned postfix signifies that local realignment has been carried out around indels. sumatran_orangutan.rcrs.coverage Table of average coverages for each chromosome / contig. 2

3 sumatran_orangutan.rcrs.depths Table of depth of coverage histogram for each chromosome / contig sumatran_orangutan.summary Summary of the alignment(s), including information about read trimming, percentage of reads mapped, fraction of reads filtered as PCR duplicates. sumatran_orangutan/ This folder contains trimmed reads (organized as in the makefile), and other intermediate files; these are often useful in other analyses, but may be deleted to save space. Next, genotype the coding genes on the mitochondrial genome and build a phylogeny using the following commands. This will generate a maximum likelihood phylogeny with 10 bootstraps (reduced to save time for the exercise). cd $MY_HOME/my_ancient_dna/phylo_pipeline/phylogeny nice phylo_pipeline genotype+msa+phylogeny 000_makefile.yaml The results are located in subfolders in the results/exampleproject folder: genotypes/ (Filtered) genotyping calls in VCF format, and FASTA sequences generated for each region of interest (i.e.. gene), for each sample. alignments/ Multiple sequence alignments for each region of interest (i.e.. gene); the folder contains both the unaligned sequences (*.fasta) and the aligned sequences (*.afa). phylogenies/ Super-matrices (combined multiple-sequence alignments) used in, and Newick trees resulting from, the phylogenetic inference. The resulting phylogeny can be visualized using the nw_display command: cd results/exampleproject/phylogenies/proteincodinggenes/ nw_display replicates.support.newick 4 Setup and run BAM pipeline In the following, we will map the sequencing data derived from four horses; three modern and one pre-historic. To keep things manageable, we are restricting ourselves to 1mb of horse 3

4 chromosome 1 (namely chr1:10,000,000-11,000,000), and have generated a set of FASTQ files containing reads that map to this region, to avoid having to map hundreds of GB of reads for this exercise. Create a project directory for the analyses; the following file-structure is used as it makes subsequent steps simpler: cd $MY_HOME/my_ancient_dna mkdir -p horses/alignments cd horses/alignments Copy the data used for this exercise (sequencing reads and a FASTA sequence for part of chromosome 1): cp -a /home/local/27626/exercises/paleomix/data/alignments/*. Create an empty makefile: bam_pipeline mkfile > makefile.yaml The makefile is specified using YAML, a human-readable markup language that is visually similar to Python code. In other words, the structure is defined using indentation. Note that tabs cannot be used when editing this file, always use spaces! A copy of the final makefile ( final_makefile.yaml ) was included in the data you copied, which may be used for comparison with your own. Open the makefile.yaml file in your favorite editor and add the following lines to end of the file: Przewalski: Przewalski: Library1: Lane1: reads/przewalski_r{pair}.fastq.gz The first line specifies that the name of this project is Przewalski. This means that all resulting files will start with Przewalski. The second line defines the sample name; this is used to tag the resulting alignments data, and is typically the same as the project name. The third line names a single library, which we have chosen to call Library1, while the fourth line names a single lane (i.e.. run on a NGS machine) and specifies the location of the de-multiplexed files. The {Pair} part of the path signifies that this is paired-ended 4

5 reads, and is replaced with 1 and 2 by the pipeline, representing mate 1 and mate 2 reads respectively. A library can contain any number of lanes, and a sample can contain any number of libraries, and so forth, but in this example we have chosen to limit ourselves to a single lane of a single library. Add each of the following samples below the entry for the Przewalski s horse, using the same structure, and the given path for the lane: Standardbred: Standardbred: Library1: Lane1: reads/standardbred_r{pair}.fastq.gz Donkey: Donkey: Library1: Lane1: reads/donkey_r{pair}.fastq.gz Finally, add the following two lanes for the ancient horse; one paired-ended and one single-ended lane: Library1: Lane1: reads/thistlecreek_r{pair}.fastq.gz Lane2: reads/thistlecreek.fastq.gz Finally, update the Prefixes section earlier in the file, in order to specify which FASTA files we will be mapping against. Here we map against a fragment of chromosome 1, which we choose to call EquCab20Chr1frag; this name will be used in the resulting files. The pipeline will take care of indexing the reference using the chosen aligner (BWA by default). The file was copied along with the data above. Note that we intentionally use the name of the FASTA file (with the.fasta extension) for the prefix; this is expected by the second pipeline: Prefixes: EquCab20Chr1frag: Path: prefixes/equcab20chr1frag.fasta 5

6 Run the alignment: nice bam_pipeline run makefile.yaml A mapdamage profile is generated for each library in each sample by default, including a plot of the post-mortem damage patterns (Fragmisincorporation_plot.pdf); compare the ancient sample with any of modern samples, for example: ThistleCreek.EquCab20Chr1frag.mapDamage/Library1/Fragmisincorporation_plot.pdf Standardbred.EquCab20Chr1frag.mapDamage/Library1/Fragmisincorporation_plot.pdf Also compare the read length distribution: ThistleCreek.EquCab20Chr1frag.mapDamage/Library1/Length_plot.pdf Standardbred.EquCab20Chr1frag.mapDamage/Library1/Length_plot.pdf Finally, while working on the ancient ThistleCreek sample we noticed a high proportion of DNA originating from a Pseudomonas bacteria; therefore, add the P. flourescens genome to the list of prefixes, to also map against that genome: Prefixes: EquCab20Chr1frag: Path: prefixes/equcab20chr1frag.fasta Pseudomonas: Path: prefixes/pfluorescens.fasta Run the alignment again, in order to also map against the new Pseudomonas genome: nice bam_pipeline run makefile.yaml Compare the coverage for this mapping (using either the *.summary or *.Pseudomonas.coverage) tables between the modern and ancient samples, and compare the mapdamage plots between the horse genome and the Pseudomonas genome for the ancient horse (ThistleCreek): Standardbred.Pseudomonas.coverage 6

7 ThistleCreek.Pseudomonas.coverage The DNA for the Thistle Creek sample is highly fragmented, so we expect that any fragment which cannot be collapsed is modern in origin; lets therefore exclude paired ended reads where both mate ends passed the quality filters, but were not collapsed (i.e.. did not overlap). This is done simply by repeating the Options section from the top of the file, but writing only the part that needs to be overwritten (namely the ExcludeReads section): Options: ExcludeReads: - Paired Library1: Lane1: reads/thistlecreek_r{pair}.fastq.gz Lane2: reads/thistlecreek.fastq.gz Remove the old ThistleCreek results and run the alignment again: rm -rv ThistleCreek* nice bam_pipeline run makefile.yaml If you look at the *.summary or *.Pseudomonas.coverage, you ll find that this has excluded ~90% of the modern DNA for this particular bacteria. Finally, a problem with using BWA is that assumes that the 5 region of reads (the first 32bp by default) contain fewer mismatches that the rest of the read. This speeds up the alignment, but at the cost of some loss of genuine alignments for ancient DNA, as damage tends to be localized to the 5 region. To disable the use of the seed region, override the UseSeed option for that project. 7

8 Options: Aligners: BWA: UseSeed: no ExcludeReads: - Paired Library1: Lane1: reads/thistlecreek_r{pair}.fastq.gz Lane2: reads/thistlecreek.fastq.gz First, make a note of the current number of hits in ThistleCreek.EquCab20Chr1frag.coverage, and then remove the old results and run the alignment again: rm -rv ThistleCreek* nice bam_pipeline run makefile.yaml Inspect the newly generated ThistleCreek.EquCab20Chr1frag.coverage file; you should see a gain of about 2.5%; not a lot, but every bit matters when only a couple of percent of the DNA sequenced belongs to the sample itself. 5 Setup and run the Phylogenetic pipeline Create a project directory for the analyses: cd $MY_HOME/my_ancient_dna mkdir -p horses/phylogeny cd horses/phylogeny Copy data files (a BED file containing coordinates of genes) and setup symbolic links to the previous analyses; we create a link to the folder containing the BAM files, and a link to the folder containing the FASTA files; this will allow the pipeline to (semi-)automatically locate the files we used: 8

9 mkdir data cd data cp -a /home/local/27626/exercises/paleomix/data/phylogeny/*. ln -s../../alignments samples ln -s../../alignments/prefixes prefixes A copy of the final makefile ( final_makefile.yaml ) was included in the data you copied above, which may be used for comparison with your own. Now create an empty makefile: cd $MY_HOME/my_ancient_dna/horses/phylogeny phylo_pipeline mkfile > makefile.yaml Open this makefile ( makefile.yaml ), update the project title (which determines the folder in which results are placed), and list the four samples we used before: Project: Title: my_horses Samples: Donkey: Gender: Male Standardbred: Gender: Male Przewalski: Gender: Male Gender: Male GenotypingMethod: Random Sampling Due to the low coverage (1x), it is not possible to genotype the ThistleCreek sample in the normal way (here, using SAMTools), so instead we will simply random sample bases at each site. Next, update the RegionsOfInterest section just below the Samples section, to specify which parts of the genome that we want to genotype: 9

10 RegionsOfInterest: ProteinCodingGenes: Prefix: EquCab20Chr1frag Realigned: yes ProteinCoding: yes IncludeIndels: yes HomozygousContigs: Female: - chrm Male: - chrx - chry - chrm The name ProteinCodingGenes together with the Prefix ( EquCab20Chr1frag ) tells the pipeline to look for a BED file containing the coordinates of regions we are interested in at./data/regions/equcab20chr1frag.proteincodinggenes.bed (by default). Finally, we need to specify that we wish to build a phylogeny using this set of genes; find the PhylogeneticInference section, and replace PHYLOGENY_NAME with my_phylogeny, replace REGIONS_NAME with ProteinCodingGenes (which we specified above), and add Donkey to RootTreeOn (see comments in makefile or below): 10

11 PhylogeneticInference: my_phylogeny: RootTreesOn: - Donkey PerGeneTrees: no RegionsOfInterest: ProteinCodingGenes: Partitions: "111" ExaML: Replicates: 1 Bootstraps: 100 Model: GAMMA Run the genotyping, multiple sequence alignment, and phylogeny: nice phylo_pipeline genotype+msa+phylogeny makefile.yaml Visualize the resulting phylogeny Newick utils: cd results/my_horses/phylogenies/my_phylogeny/ nw_display replicates.support.newick Notice the extreme branch-length of the branch leaning to the ThistleCreek sample; this is a result of the higher error rate resulting not only from the post-mortem damage, but also from the fact that we random sampled sites. References [1] Aurelien Ginolhac, Morten Rasmussen, M Thomas P. Gilbert, Eske Willerslev, and Ludovic Orlando. mapdamage: testing for damage patterns in ancient dna sequences. Bioinformatics, 27(15): , Aug

12 [2] Hákon Jónsson, Aurélien Ginolhac, Mikkel Schubert, Philip L F. Johnson, and Ludovic Orlando. mapdamage2.0: fast approximate bayesian estimates of ancient dna damage parameters. Bioinformatics, 29(13): , Jul [3] Stinus Lindgreen. Adapterremoval: easy cleaning of next-generation sequencing reads. BMC Res Notes, 5:337, [4] Mikkel Schubert, Luca Ermini, Clio Der Sarkissian, Hákon Jónsson, Aurélien Ginolhac, Robert Schaefer, Michael D. Martin, Ruth Fernández, Martin Kircher, Molly McCue, Eske Willerslev, and Ludovic Orlando. Characterization of ancient and modern genomes by snp detection and phylogenomic and metagenomic analysis using paleomix. Nat Protoc, 9(5): , May

Variant calling using SAMtools

Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel