Exome sequencing. Jong Kyoung Kim

Size: px

Start display at page:

Download "Exome sequencing. Jong Kyoung Kim"

Patience Jennings
5 years ago
Views:

1 Exome sequencing Jong Kyoung Kim

2 Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic variant calling tools, and to tackle copy number (CNV) and structural variation (SV). These tools were primarily designed to process exomes and whole genomes generated with Illumina sequencing technology, but they can be adapted to handle a variety of other technologies and experimental designs. Although it was originally developed for human genetics, the GATK has since evolved to handle genome data from any organism, with any level of ploidy.

3 GATK Best Practices

4 GATK Best Practices

5 File formats

6 Pre-processing

7 Overview

8 Overview When you receive sequence data from your sequencing provider, the data is typically in a raw state (one or several FASTQ files) that is not immediately usable for variant discovery analysis. Even if you receive a BAM file (i.e. a file in which the reads have been aligned to a reference genome) you still need to apply some data processing in order to maximize the technical correctness of the data. This first phase of the GATK workflow describes the pre-processing steps that are necessary in order to prepare your data for analysis, starting with FASTQ files and ending in an analysis-ready BAM file. We begin by mapping the sequence reads to the reference genome to produce a file in SAM/BAM format sorted by coordinate. Next, we mark duplicates to mitigate biases introduced by data generation steps such as PCR amplification. Finally, we recalibrate the base quality scores, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read.

9 Raw reads sample="srr " Illumina" REF="home/jkkim/reference/ebola/1976.fa" fastq-dump --split-files $sample

10 Map to reference bwa mem -t 4 -R $RG $REF ${sample}_1.fastq ${sample}_2.fastq samtools sort > ${sample}.sorted.bam samtools index ${sample}.sorted.bam

11 Mark duplicates

12 Mark duplicates Once your data has been mapped to the reference genome, you can proceed to mark duplicates. The idea here is that during the sequencing process, the same DNA fragments may be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process does not remove the reads, but identifies them as duplicates by adding a flag in the read's SAM record. Most GATK tools will then ignore these duplicate reads by default, through the internal application of a read filter.

13 Mark duplicates java -jar /home/jkkim/bin/picard-2.9.2/picard.jar MarkDuplicates INPUT=${sample}.sorted.bam \ OUTPUT=${sample}.sorted.markduplicated.bam \ METRICS_FILE=${sample}.markduplicatedmetrics.txt \ CREATE_INDEX=true TMP_DIR=/tmp

14 Recalibrate bases Variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic technical error, leading to over- or underestimated base quality scores in the data. Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants (which you can bootstrap if there is none available for your organism), then it adjusts the base quality scores in the data based on the model. Note that this base recalibration process should not be confused with variant recalibration, which is a sophisticated filtering technique applied on the variant callset produced in a later step.

15 Recalibrate bases

16 Recalibrate bases Create a sequence dictionary for a reference sequence. java -jar /home/jkkim/bin/picard /picard.jar CreateSequenceDictionary R=$REF O=/home/jkkim/reference/ebola/1976.dict

17 Recalibrate bases Four steps: 1. Analyze patterns of covariation in the sequence dataset 2. Do a second pass to analyze covariation remaining after recalibration 3. Generate before/after plots 4. Apply the recalibration to your sequence data

Recalibrate bases Analyze patterns of covariation in the sequence dataset: This creates a GATKReport file called recal_data.table containing several tables.

18 Recalibrate bases Analyze patterns of covariation in the sequence dataset: This creates a GATKReport file called recal_data.table containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data. It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities.

19 Recalibrate bases If we do not have a set of known sites: 1. First do an initial round of SNP calling on your original, unrecalibrated data. 2. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. 3. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

20 Recalibrate bases Do a second pass to analyze covariation remaining after recalibration This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the -BQSR flag, which tells the GATK engine to perform on-the-fly recalibration based on the first recalibration data table.

21 Recalibrate bases Generate before/after plots This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data.

22 Recalibrate bases Apply the recalibration to your sequence data This creates a file called recal_reads.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down.

23 Variant discovery

24 Overview

25 Overview You are ready to undertake the variant discovery process, i.e. identify the sites where your data displays variation relative to the reference genome, and calculate genotypes for each sample at that site. Unfortunately some of the variation you might observe is caused by mapping and sequencing artifacts, so the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). We have found that it is very difficult to reconcile these objectives in a single step, so instead we decompose the variant discovery process into two separate steps: variant calling and variant filtering. The first step is designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

26 Call variants The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other.

27 Call variants java -jar /opt/genomics/tools/gatk-3.7/genomeana lysistk.jar -T HaplotypeCaller \ -R $REF -I ${sample}.sorted.markduplicated.bam \ --genotyping_mode DISCOVERY \ -stand_call_conf 30 \ -ploidy 1 \ -o ${sample}.raw.vcf

28 Filter variants The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large. The best way to filter the raw variant callset is to use variant quality score recalibration (VQSR), which uses machine learning to identify annotation profiles of variants that are likely to be real, and assigns a VQSLOD score to each variant that is much more reliable than the QUAL score calculated by the caller. In the first step of this two-step process, the program builds a model based on training variants, then applies that model to the data to assign a well-calibrated probability to each variant call. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, finetuned to balance specificity and sensitivity.

29 Filter variants The downside of how variant recalibration works is that the algorithm requires highquality sets of known variants to use as training and truth resources, which for many organisms are not yet available. It also requires quite a lot of data in order to learn the profiles of good vs. bad variants, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, on RNA-seq, and on non-model organisms. If for any of these reasons you find that you cannot perform variant recalibration on your data (after having tried the workarounds that we recommend, where applicable), you will need to use hard-filtering instead. This consists of setting flat thresholds for specific annotations and applying them to all variants equally.

30 Apply hard filters Apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available. Steps: 1. Extract the SNPs from the call set 2. Determine parameters for filtering SNPs 3. Apply the filter to the SNP call set 4. Extract the Indels from the call set 5. Determine parameters for filtering indels 6. Apply the filter to the Indel call set

31 Extract the SNPs from the call set java -jar /opt/genomics/tools/gatk-3.7/genomeana lysistk.jar -T SelectVariants \ -R $REF -V ${sample}.raw.vcf \ -selecttype SNP -o ${sample}.raw.snp.vcf

32 SRR raw.snp.vcf

33 Determine parameters for filtering SNPs SNPs matching any of these conditions will be considered bad and filtered out, i.e. marked FILTER in the output VCF file. The program will specify which parameter was chiefly responsible for the exclusion of the SNP using the culprit annotation. SNPs that do not match any of these conditions will be considered good and marked PASS in the output VCF file.

34 Apply the filter to the SNP call set java -jar /opt/genomics/tools/gatk-3.7/genomeana lysistk.jar -T VariantFiltration \ -R $REF -V ${sample}.raw.snp.vcf \ --filterexpression "QD < 2.0 FS > MQ < 40.0 MQRankSum < ReadPosRa nksum < -8.0" \ --filtername "SNP_FILTER" \ -o ${sample}.filtered.snp.vcf

35 SRR filtered.snp.vcf

36 Extract the Indels from the call set java -jar /opt/genomics/tools/gatk-3.7/genomeanalysistk. jar -T SelectVariants \ -R $REF -V ${sample}.raw.vcf \ -selecttype INDEL -o ${sample}.raw.indel.vcf

37 SRR raw.indel.vcf

38 Apply the filter to the Indel call set java -jar /opt/genomics/tools/gatk-3.7/genomeana lysistk.jar -T VariantFiltration \ -R $REF -V ${sample}.raw.indel.vcf \ --filterexpression "QD < 2.0 FS > ReadPosRankSum < -20.0" \ --filtername "INDEL_FILTER" \ -o ${sample}.filtered.indel.vcf

39 SRR filtered.indel.vcf

40 Callset refinement

41 Overview

42 Overview Once you have generated and filtered your callset according to our recommendations, you have several options for evaluating and refining the variant and genotype calls further, before moving on with your study. We perform some refinement steps on the genotype calls based on population frequencies and pedigree information if available, add functional annotations related to predicted biological effect, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow should be adapted to each project's requirements.

43 Recalibrate bases: Analyze patterns of covariation in the sequence dataset java -jar /opt/genomics/tools/gatk-3.7/genomeanalysistk. jar -T BaseRecalibrator \ -R $REF -I ${sample}.sorted.markduplicated.bam \ -knownsites ${sample}.filtered.snp.vcf \ -knownsites ${sample}.filtered.indel.vcf \ -o ${sample}.recal.table

44 Recalibrate bases: Do a second pass to analyze covariation remaining after recalibration java -jar /opt/genomics/tools/gatk-3.7/genomeanalysistk. jar -T BaseRecalibrator \ -R $REF -I ${sample}.sorted.markduplicated.bam \ -knownsites ${sample}.filtered.snp.vcf \ -knownsites ${sample}.filtered.indel.vcf \ -BQSR ${sample}.recal.table \ -o ${sample}.postrecal.table

45 Recalibrate bases: Generate before/after plots java -jar /opt/genomics/tools/gatk-3.7/genomeana lysistk.jar -T AnalyzeCovariates \ -R $REF \ -before ${sample}.recal.table \ -after ${sample}.postrecal.table \ -plots ${sample}.recalibration.pdf

46 SRR recalibration.pdf

47 Recalibrate bases: Apply the recalibration java -jar /opt/genomics/tools/gatk-3.7/genomeanalysistk. jar -T PrintReads \ -R $REF -I ${sample}.sorted.markduplicated.bam \ -BQSR ${sample}.recal.table \ -o ${sample}.sorted.markduplicated.recalibrated. bam

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder WM2 Bioinformatics ExomeSeq data analysis part 1 Dietmar Rieder RAW data Use putty to logon to cluster.i med.ac.at In your home directory make directory to store raw data $ mkdir 00_RAW Copy raw fastq