WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

Size: px

Start display at page:

Download "WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder"

Kenneth Anderson
5 years ago
Views:

1 WM2 Bioinformatics ExomeSeq data analysis part 1 Dietmar Rieder

2 RAW data Use putty to logon to cluster.i med.ac.at In your home directory make directory to store raw data $ mkdir 00_RAW Copy raw fastq files into 00_RAW $ cp /export/data/wm2/raw/hmads-1_zz.fastq 00_RAW $ cp /export/data/wm2/raw/jurkat-1_zz.fastq 00_RAW

3 Quality assessment Analyze sequencing read quality using fastqc Copy adapters.txt to 00_RAW $ cp /export/data/wm2/adapters/adapters.txt 00_RAW Change to 00_RAW $ cd 00_RAW fastqc interactive mode $ fastqc a adapters.txt (select fastq File menu) fastqc command line mode $ fastqc a adapters.txt <fastq filename>!inspect and discuss results!

4 Clean read artifacts (cutadapt) Read length Adapters Low quality Filter short reads $ fastx_trimmer -h $ cutadapt h $ fastqc

5 Clean read artifacts (cutadapt) Read length: 200 Adapters: 3 TCACCGACTGCCCATAGAGAGGCTGAGAC Low quality: < Q15 Filter short reads: < 25 (Attention: use single line per command) $ cd; mkdir 01_Clean; cd 01_Clean $ fastx_trimmer -l 200 -i../00_raw/jurkat.fastq -o jurkat_l200.fastq $ fastqc $ cutadapt -a ATCACCGACTGCCCATAGAGAGGCTGAGAC -q 15 -m 25 -o jurkat_l_25_l200_cutadapt.fastq jurkat_l200.fastq > jurkat_l_25_l200_cutadapt.out $ fastqc

6 Align clean reads to the reference genome (hg19) Aligner: BWA (command: bwa mem) Reads: jurkat_l_25_l200_cutadapt.fastq Reference: /export/data/wm2/genome/hg19/hg19 Output (SAM file): ~/02_MAPPED/jurkat.sam Readgroup option ( R): '@RG\tID:Jurkat\tSM:Tumor\tPL:IONTORRENT Use multiple CPUs ( t): -t 8 (Attention: use single line per command) $ cd; mkdir 02_MAPPED; cd 02_MAPPED $ bwa mem -t 8 -R '@RG\tID:Jurkat\tSM:Tumor\tPL:IONTORRENT' /export/data/wm2/genome/hg19/hg19../01_clean/jurkat_l_25_l200_cutadapt.fastq > jurkat.sam

7 Make BAM file from SAM Tool: samtools SAM file: jurkat.sam BAM file: jurkat.bam $ samtools view b jurkat.sam > jurkat.bam Sort BAM by coordinate: samtools sort $ samtools sort jurkat.bam jurkat_sorted Index BAM: samtools index $ samtools index jurkat_sorted.bam

8 Get alignment statistics Tool: samtools BAM file: jurkat_sorted.bam Count all reads: $ samtools view c jurkat_sorted.bam Count aligned reads: $ samtools view c F 0x4 jurkat_sorted.bam Count uniquely aligned reads: $ samtools view q 1 jurkat_sorted.bam!!!calculate alignment rate!!!

9 Tools: bedtools, R Plot target coverage BAM files: jurkat_sorted.bam, hmads_sorted.bam Generate coverage histogram: $ bedtools coverage -hist -abam hmads_sorted.bam -b /export/data/wm2/amplicon-regions/panel.bed grep ^all > hmads_align.hist.txt $ bedtools coverage -hist -abam Jurkat_sorted.bam -b /export/data/wm2/amplicon-regions/panel.bed grep ^all > Jurkat_align.hist.txt (Attention: use single line per command) Use Rscript to create plot: $ Rscript /export/data/wm2/r/plotcoverage.r

10 Recalibrate Base Quality Scores Generate base recalibration table to compensate for systematic errors in basecalling confidences Tool: GenomeAnalysisTK (GATK) BaseRecalibrator BAM file: jurkat_sorted.bam Regions: /export/data/wm2/amplicon-regions/panel.bed SNPdb: /export/data/wm2/dbsnp/hg19/common_all_ vcf Recalibration table: jurkat_recal_data.table Reference sequence: /export/data/wm2/genome/hg19/hg19.fa

11 Recalibrate Base Quality Scores Analyze patterns of covariation in the sequence dataset Run command: $ java jar /usr/local/bioinf/gatk/gatk/genomeanalysistk.jar \ -T BaseRecalibrator \ -R /export/data/wm2/genome/hg19/hg19.fa \ -I jurkat_sorted.bam \ -L /export/data/wm2/amplicon-regions/panel.bed \ -knownsites /export/data/wm2/dbsnp/hg19/common_all_ vcf \ -o jurkat_recal_data.table Expected Result This creates jurkat_recal_data.table. This file contains the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

12 Recalibrate Base Quality Scores Do a second pass to analyze covariation remaining after recalibration Run command: $ java jar /usr/local/bioinf/gatk/gatk/genomeanalysistk.jar \ -T BaseRecalibrator \ -R /export/data/wm2/genome/hg19/hg19.fa \ -I jurkat_sorted.bam \ -L /export/data/wm2/amplicon-regions/panel.bed \ -knownsites /export/data/wm2/dbsnp/hg19/common_all_ vcf \ -BQSR jurkat_recal_data.table \ -o jurkat_post_recal_data.table Expected Result This creates another GATKReport file, which we will use in the next step to generate plots. Note the use of the BQSR flag, which tells the GATK engine to perform on thefly recalibration based on the first recalibration data table.

13 Recalibrate Base Quality Scores Run command: Generate before/after plots $ java jar /usr/local/bioinf/gatk/gatk/genomeanalysistk.jar \ -T AnalyzeCovariates \ -R /export/data/wm2/genome/hg19/hg19.fa \ -L /export/data/wm2/amplicon-regions/panel.bed \ -before jurkat_recal_data.table \ -after jurkat_post_recal_data.table \ -o jurkat_recalibration_plots.pdf Expected Result This generates a document called recalibration_plots.pdf containing plots that show how the reported base qualities match up to the empirical qualities calculated by the BaseRecalibrator. Comparing the before and after plots allows you to check the effect of the base recalibration process before you actually apply the recalibration to your sequence data.

14 Recalibrate Base Quality Scores Apply the recalibration to your sequence data Run command: $ java jar /usr/local/bioinf/gatk/gatk/genomeanalysistk.jar \ -T PrintReads \ -R /export/data/wm2/genome/hg19/hg19.fa \ -I jurkat_sorted.bam \ -L /export/data/wm2/amplicon-regions/panel.bed \ -BQSR jurkat_recal_data.table \ -o jurkat_recal_sorted.bam Expected Result This creates a file called jurkat_recal_sorted.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores.

15 Call somatic SNPs and indels MuTect2 is a somatic SNP and indel caller Tool: GenomeAnalysisTK (GATK) MuTect2 BAM files: Tumor: jurkat_recal_sorted.bam Normal: hmads_recal_sorted.bam Regions: /export/data/wm2/amplicon-regions/panel.bed SNPdb: /export/data/wm2/cosmic/hg19/common_all_ vcf COSMICdb: /export/data/wm2/cosmic/hg19/v77/cosmiccodingmuts.hg19.v77.vcf /export/data/wm2/cosmic/hg19/v77/cosmicnoncodingmuts.hg19.v77.vcf Reference sequence: /export/data/wm2/genome/hg19/hg19.fa

16 Call somatic SNPs and indels Run command: $ java jar /usr/local/bioinf/gatk/gatk/genomeanalysistk.jar \ -nct 16 -T MuTect2 \ -R /export/data/wm2/genome/hg19/hg19.fa \ -I:tumor jurkat_recal_sorted.bam \ -I:normal hmads_recal_sorted.bam \ --dbsnp /export/data/wm2/cosmic/hg19/common_all_ vcf \ --cosmic \ /export/data/wm2/cosmic/hg19/v77/cosmiccodingmuts.hg19.v77.vcf \ --cosmic \ /export/data/wm2/cosmic/hg19/v77/cosmicnoncodingmuts.hg19.v77.vcf \ -L /export/data/wm2/amplicon-regions/panel.bed \ -o jurkat_mutations.vcf Expected Result This creates a file called jurkat_mutations.vcf containing all called SNPs and indels.

Visualize results on IGV Integrative Genomics Viewer (IGV) high performance visualization tool Enables interactive exploration of large, integrated genomic

17 Visualize results on IGV Integrative Genomics Viewer (IGV) high performance visualization tool Enables interactive exploration of large, integrated genomic datasets It supports a wide variety of data types array based data next generation sequence data genomic annotations ( )

Exome sequencing. Jong Kyoung Kim

Exome sequencing. Jong Kyoung Kim Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic