Bioinformatica e analisi dei genomi

Size: px
Start display at page:

Download "Bioinformatica e analisi dei genomi"

Transcription

1 Bioinformatica e analisi dei genomi Anno 2016/2017 Pierpaolo Maisano Delser mail: maisanop@tcd.ie

2 Background Laurea Triennale: Scienze Biologiche, Universita degli Studi di Ferrara, Dr. Silvia Fuselli; Laurea Specialistica: Scienze Biomolecolari e Cellulari, Universita degli Studi di Ferrara, Dr. Silvia Fuselli; PhD in Genetics, University of Leicester, prof. Mark A. Jobling; Post-doctoral fellow EPHE-MNHN, Paris, Dr. Stefano Mona. Research Fellow Trinity College, Dublin, Prof. Daniel Bradley Cusco, Marzo 2009

3 Muséum national d'histoire naturelle - Paris

4 Trinity College Dublin - Ireland

5 Informazioni pratiche Teoria + pratica; Software and tools; Files; Slides on the website; Argomenti nuovi / argomenti gia trattati;

6 Informazioni pratiche Cartella di lavoro (fastq file): /home/bioinfo_file/ File referenza, intevalli per il coverage, genoma per IGV): /home/bioinfo_file/reference/ Ricordatevi i percorsi dei file!! pwd: mostra la vostra posizione

7 Programma next-generation sequencing (NGS) come, quando, perche? un esempio di gestione e analisi dati NGS: tipo di dato; file e formati; programmi; interpretazione dei risultati; stima dell errore; quando fermarsi? Applicazioni e/o progetti su diversi organismi.

8 NGS: come, quando, perché? reads trimming sequencing unaligned reads QC de-novo assembly mapping to a reference genome mapping refinement assembly QC mapping QC DNA-seq whole genome capture: exome/custom/cancer amplicon sequencing Filtering RNA-seq whole transcriptome amplicon sequencing: fixed/custom Validation

9 NGS: come, quando, perché? Domanda: quando? Risposta: quando ha senso! Amplicone 400bp in 100 individui? Sanger sequencing Domanda: perche? Risposta: la vostra idea per un progetto! 50 ampliconi in 100 individui? NGS + target capture Gene conversion, elementi ripetuti, recombination breakpoints? NGS + Sanger sequencing

10 un esempio di gestione e analisi dati NGS Roche 454 Illumina MiSeq/HiSeq Ion torrent PGM/Proton Pacific Bioscience (PacBio) Nanopore minion/gridion

11 un esempio di gestione e analisi dati NGS reads trimming sequencing unaligned reads QC de-novo assembly mapping to a reference genome mapping refinement assembly QC mapping QC DNA-seq whole genome capture: exome/custom/cancer amplicon sequencing Filtering RNA-seq whole transcriptome amplicon sequencing: fixed/custom Validation

12 un esempio di gestione e analisi dati NGS progetto progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?) progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs ) progetto:applicazione:scopo:coverage (SNPs, indels, repeated elements, CNVs )

13 Project: Carcharodon carcharias - the great white shark;

14 Project: Carcharodon carcharias; Diploid organism; 82 chromosomes (41 pairs); Genome size ~5.2 Gb not fully sequenced yet; Target capture experiment; Paired-end reads (250bp).

15 Project: Target capture experiment; Meyerson M et al., 2010, NatRevGenetics

16 Single-end (SE) or paired-end (PE) sequencing. fragment ======================================== fragment + adaptors ~~~========================================~~~ SE read > PE reads R > < R2 unknown gap...

17 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

18 .fa/.fasta.fastq.sam (.sai).bam (.bai).vcf sequences read data mapped reads mapped reads (binary) variant information

19 raw reads 1:N:0:1 AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNA TGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNN GCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNN GTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNN NAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG + 8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@ FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFF FGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:* #212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-

20 raw reads (.fastq) Terminal: more cc_gn2_r1_trimmed.fastq head cc_gn2_r1_trimmed.fastq

21 raw reads 1:N:0:1 AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNA TGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNN GCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNN GTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNN NAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG + 8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@ FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFF FGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:* #212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-

22 raw reads (.fastq) Instrument ID Run ID flowcell ID coordinates of the cluster lane tile First mate in the pair (paired-end reads) Is the read filtered? No (N) or Yes (Y) Control included? 0=No index 1:N:0:1 AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNA TGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNN GCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNN GTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNN NAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG + 8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@ FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFF FGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:* #212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((- Quality values for each nucleotide

23 raw reads 1:N:0:1 AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNA TGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNN GCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNN GTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNN NAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG + 8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@ FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFF FGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:* #212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((- Quality values for each nucleotide Lowest ASCII Highest!"#$%&'()*+,-./ :;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~ Illumina 1.8+ Phred+33, raw reads typically (0, 41)

24 raw reads (.fastq) Lowest ASCII }~ Illumina 1.8+ Phred+33, raw reads typically (0, 41) Phred-scale value: Q = -10*log_10P P = 10 -Q/10 Phred Quality Score (Q) Probability of incorrect base call (P) 10 1 in 10 90% 20 1 in % 30 1 in % 40 1 in % 50 1 in % Base call accuracy

25 raw reads (.fastq) Open cc_gn2_r2_trimmed.fastq Terminal: more cc_gn2_r2_trimmed.fastq OR head cc_gn2_r2_trimmed.fastq Are cc_gn2_r1_trimmed.fastq and cc_gn2_r2_trimmed.fastq coming from two different lanes? What s the difference between the two fastq files?

26 raw reads (.fastq) cc_gn2_r1_trimmed.fastq Same lane, different read mate in the 1:N:0:1 AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNA TGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNN GCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNN GTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNN NAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG 2:N:0:1 CCATTTCTNNNNNNNAGGACCTNNNNNNNAGCCCTNNNNNNNNNNNNNAGNATATGANNNNN NNTCTTATTNANCCANNNTCTAGNNNNNNNCTTTCCTNNNNNNNTCTCTGANNNNNNNNNNNN NNCCCTTCCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTNTCTCNTNNNNNNNN NNNNAAAATCCNNNNNNNNNNNNNNCCACTAANNNNNNNNNNNNNNAAGAAATAACACACNN NNNNNACAAAAANNNNNNNACAACACNNNNNNNGCATAAANNNA

27 raw reads (.fastq) 1. Fastq quality control + trimming Adapters? Low quality bases? 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment GATK base recalibration GATK duplicate removal picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels single/multi-sample samtools raw variants (.vcf) variant score recalibration known SNPs/indels big datasets 6. variant filtering and validation vcftools in silico vs in vitro validation

28 1- Fastq quality control + trimming Fastqc: quality control of the raw data coming out from the sequencer Evaluation of the quality of the generated data; Basic summary statistics of the raw data; Several modules to evaluate different features (i.e. adapters; base quality, etc ) Feedback (green, orange, red): do not fully rely on that, think what does it mean!!

29 1- Fastq quality control + trimming

30 1- Fastq quality control + trimming Per base sequence quality: warning

31 1- Fastq quality control + trimming What can we do to improve the quality at the end of the reads? Read Trimming: removal of lower-quality 3' Ends with Low Quality Scores bp bp

32 1- Fastq quality control + trimming Per sequence quality score: pass

33 1- Fastq quality control + trimming Sequence length: pass

34 1- Fastq quality control + trimming Adapters removal Failed Warning

35 1- Fastq quality control + trimming Adapters removal Pass

36 1- Fastq quality control + trimming Overrepresented sequences Removal of overrepresented sequences (PCR primers).

37 FASTQC references: Software website: Manual:

38 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

39 2- Alignment to a reference genome Alignment : process of determining the most likely location within the genome for the observed DNA read raw reads reference genome

40 2- Alignment to a reference genome the shorter the read the harder is to find its location in the genome big amount of data: computationally challenging for memory and speed trade-off: speed vs sensitivity the higher the accuracy the longer the alignment run two classes of methods: Burrows-Wheeler Fast less robust at high divergence with reference genome e.g. bwa Hashing slow (needs more memory) robust at high divergence with reference genome e.g. stampy BW: Hashing:

41 2- Alignment to a reference genome What if there are several possible places to align your sequencing read? This may be due to: - Repeated elements in the genome - Low complexity sequences - Reference errors and gaps MQ is a phred-score of the quality of the alignment raw reads reference genome low MQ: the probability of mapping to different locations is high, but no perfect multiple matches high MQ: a single match MQ0: a perfect multiple match

42 2- Alignment to a reference genome This may be due to: - Repeated elements in the genome - Low complexity sequences - Reference errors and gaps Reference sequence Element 1 Element 2 Sample_1 Reference sequence 1 copia 1 copia Sample_1 1 copia 1 copia

43 2- Alignment to a reference genome This may be due to: - Repeated elements in the genome - Low complexity sequences - Reference errors and gaps Reference sequence Element 1 Element 1 Element 2 Sample_1 Perfect mul ple matches MQ0 Not a perfect match Low MQ

44 2- Alignment to a reference genome This may be due to: - Repeated elements in the genome - Low complexity sequences - Reference errors and gaps Reference sequence Element 1 Element 1 Element 2 Sample_1 Perfect mul ple matches MQ0 Not a perfect match Low MQ Reference sequence 2copia 1 copia Sample_1 1 copia 1 copia

45 2- Alignment to a reference genome This may be due to: - Repeated elements in the genome - Low complexity sequences - Reference errors and gaps Reference sequence Element 1 Element 2 Sample_1 False heterozygous call Cluster of heterozygotes Reference sequence Sample_1 1 copia 2copia 1 copia 1 copia

46 2- Alignment to a reference genome This may be due to: - Repeated elements in the genome - Low complexity sequences - Reference errors and gaps AluSg7

47 2- Alignment to a reference genome This may be due to: - Repeated elements in the genome - Low complexity sequences - Reference errors and gaps

48 2- Alignment to a reference genome: mapping with bwa-mem Three different algorithm: 1. BWA-backtrack: for illumina reads up to 100bp; 2. BWA-SW: long read support, split alignment; 3. BWA-MEM: long read support, split alignment, faster, more accurate Fastq files are already trimmed adapters removed

49 2- Alignment to a reference genome: mapping with bwa-mem paired-end alignment; it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file; Option to mark shorter split hits as secondary (not supplementary). Split read: Karacok E et al., 2012

50 2- Alignment to a reference genome: mapping with bwa-mem paired-end alignment; it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file; Option to mark shorter split hits as secondary (not supplementary). bwa mem [options] [RefSeq] [fastq1] [fastq2] > cc_gn2_r12.sam Type: bwa mem to check the options

51 2- Alignment to a reference genome: mapping with bwa-mem paired-end alignment; it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file; Option to mark shorter split hits as secondary (not supplementary). Mark shorter split hits as secondary Reference genome Fastq 1 bwa mem -M cc_ref.fa cc_gn2_r1_trimmed.fastq cc_gn2_r2_trimmed.fastq > cc_gn2_r12.sam Fastq 2

52 2- Alignment to a reference genome: from sam to bam Convert sam-to-bam: samtools view input_sam.. input_bam Option to define that the input is a sam file; Option to have output in bam format; Option to define the output;

53 2- Alignment to a reference genome: from sam to bam Input is a sam file sam-to-bam Output in bam format Input file (sam) samtools view -Sb cc_gn2_r12.sam -o cc_gn2_r12.bam Output file (bam)

54 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

55 aligned reads (.sam/.bam) SAM/BAM format SAM sequence alignment map BAM binary alignment map Standard formats for alignment BAM is the binary version of SAM reduced size, easier to store and to access but the full information is not readable by human eye

56 aligned reads (.sam/.bam) more cc_gn2_r12.bam

57 aligned reads (.sam/.bam) BAM format

58 aligned reads (.sam/.bam) more cc_gn2_r12.sam

59 aligned reads (.sam/.bam) SAM format

60 aligned reads (.sam/.bam) SAM sequence alignment map BAM binary alignment map They consist of two parts: 1. Header: contains information about the sample 2. Alignment: contains location and qualities for all the reads You can find a detailed explanation in the sam/bam format specification (

61 aligned reads (.sam/.bam) SAM format Header Alignment

62 aligned reads (.sam/.bam) SAM sequence alignment map BAM binary alignment map They consist of two parts: 1. Header: contains information about the sample Header header Reference sequence dictionary, one per chromosome, SN (reference sequence name) and LN (reference sequence Read Program, ID comment

63 aligned reads (.sam/.bam) They consist of two parts: SAM sequence alignment map BAM binary alignment map 1. Header: contains information about the sample Reference Sequence Name Reference Sequence Length

64 aligned reads (.sam/.bam) SAM format Header Alignment

65 aligned reads (.sam/.bam) They consist of two parts: SAM sequence alignment map BAM binary alignment map 2. Alignment: contains location and qualities for all the reads Alignment contains one line per read, and each line contains 12 columns:

66 aligned reads (.sam/.bam) SAM sequence alignment map BAM binary alignment map 2. Alignment: contains location and qualities for all the reads M00725:28: AJ72K:1:1101:18215: cc_ref M = CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAG TAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

67 aligned reads (.sam/.bam) SAM sequence alignment map BAM binary alignment map 2. Alignment: contains location and qualities for all the reads QNAME M00725:28: AJ72K:1:1101:18215: cc_ref M = FLAG CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAG TAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

68 aligned reads (.sam/.bam) bitwise FLAG It is an integer, but it represents the sum of different values. QNAME FLAG M00725:28: AJ72K:1:1101:18215: Open Firefox > google.co.uk > Type bitwise flag broad There is a tool online which provides a quick translation (

69 aligned reads (.sam/.bam) bitwise FLAG It is an integer, but it represents the sum of different values. QNAME FLAG M00725:28: AJ72K:1:1101:18215: There is a tool online which provides a quick translation (

70 aligned reads (.sam/.bam) SAM sequence alignment map BAM binary alignment map 2. Alignment: contains location and qualities for all the reads QNAME FLAG RNAME POS MAPQ M00725:28: AJ72K:1:1101:18215: cc_ref CIGAR 300M = CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAG TAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

71 aligned reads (.sam/.bam) CIGAR string It is a compact representation of sequence alignment. It includes: M match or mismatch I insertion D deletion read: ACTCA TGCAGT ref: ACTCAGTG GT cigar 5M1D2M2I2M So, what is the cigar line of? read: ACGTCATG CAGT ref: ACG CATGCGGCAGT cigar 3M1I4M3D4M

72 aligned reads (.sam/.bam) SAM sequence alignment map BAM binary alignment map 2. Alignment: contains location and qualities for all the reads QNAME FLAG RNAME POS MAPQ M00725:28: AJ72K:1:1101:18215: cc_ref CIGAR MRNM 300M = CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAG TAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

73 aligned reads (.sam/.bam) SAM sequence alignment map BAM binary alignment map 2. Alignment: contains location and qualities for all the reads QNAME FLAG RNAME POS MAPQ M00725:28: AJ72K:1:1101:18215: cc_ref CIGAR MRNM MPOS ISIZE 300M = SEQ CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAG TAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT QUAL CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+

74 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

75 3- Bam refinement before starting BAM missing a RG LINE what is a RG LINE?? You can find a detailed explanation in the sam/bam format specification (

76 3- Bam refinement before starting picard-tools AddOrReplaceReadGroups INPUT=cc_gn2_R12.bam OUTPUT=cc_gn2_R12_rg.bam RGLB=cc_gn2 RGPL=Illumina RGPU=01 RGSM=shark

77 3- Bam refinement before starting sort the bam (this adds the bam extension automatically!) It sorts alignments by coordinates samtools sort cc_gn2_r12_rg.bam cc_gn2_r12_rg_sorted Index the sorted bam file samtools index cc_gn2_r12_rg_sorted.bam

78 aligned reads (.sam/.bam) BAM format

79 aligned reads (.sam/.bam) We can read the header of the BAM file use samtools to check the header of the BAM samtools view -H cc_gn2_r12_rg_sorted.bam 1. How many chromosomes are present in your header? 2. Which version of the BAM is it? 3. Is it sorted?

80 aligned reads (.sam/.bam) Yes, by VN:1.4 SN:cc_ref ID:1 PU:01 LB:cc_gn2 SM:shark PL:Illumina 1 chromosome ( artificial )

81 aligned reads VN:1.3 SN:I SN:II SN:III SN:IV SN:IX SN:Mito SN:V SN:VI SN:VII SN:VIII SN:X SN:XI SN:XII SN:XIII SN:XIV SN:XV SN:XVI ID:bwa PN:bwa VN: r789 CL:bwa mem -M

82 3- Bam refinement Input: BAM Three main steps: 1. Local realignment 2. Base quality recalibration 3. Duplicate removal Output: BAM

83 3- Bam refinement Local realignment Problem: Short indels in the sample relative to the reference sequence can pose difficulties for alignment programs. Indels occuring towards the ends of the reads are often not aligned correctly, introducing an excess of SNPs Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCT ACTTTCGGATGCTGATCGGGATGCTTTAGCTGA TCGGATGCTGATCGGGATGCTTTAGCTGATGCT CTGATCGGGATGCTTTAGCTGATGCTGATGG Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCT ACTTTCGGATGCTGATC ATGCTTTAGCTGA TCGGATGCTGATC T_GCTTTAGCTGATGCT CTGATC ATGCTTTAGCTGATGCTGATGG TC T_GCTTTAGCTGATGCTGATGGGCTT Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCT ACTTTCGGATGCTGATC ATGCTTTAGCTGA TCGGATGCTGATC ATGCTTTAGCTGATGCT CTGATC ATGCTTTAGCTGATGCTGATGG TC ATGCTTTAGCTGATGCTGATGGGCTT

84 3- Bam refinement Local realignment It uses the full alignment context to determine whether the indel exists. Two-step process: 1. RealignerTargetCreator: it determines the small suspicious intervals which are likely in need of realignment 2. IndelRealigner: it runs the realignment on those intervals notes: - having a list of known indels helps

85 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

86 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

87 3- Bam refinement Base Quality Recalibration Each base call has an associated base call quality (phred-scale). Rule of thumb: anything less than Q20 is not useful data. The quality of a call depends on multiple factors (e.g. position in the read, sequence context). In addition, the alignment can provide useful information. Mismatches to the reference are considered errors (unless they are described polymoprhisms). It requires a catalogue of variable sites! How does it work?

88 3- Bam refinement Base Quality Recalibration BAM files with variants List of know variants 123 C/A BQ cov1 cov2 cov3 145 G/A BQ cov1 cov2 cov G/T BQ cov1 cov2 cov3 123 C/A 145 G/A 1345 C/T 1345 C/T BQ cov1 cov2 cov C/G BQ cov1 cov2 cov3 Considered as real variants Recalibrated BQ using different covariates BQ: base quality Covariates: position in the read, sequencing cycle, dinucleotide,

89 3- Bam refinement Base Quality Recalibration BAM files with variants 123 C/A BQ cov1 cov2 cov3 145 G/A BQ cov1 cov2 cov G/T BQ cov1 cov2 cov3 Considered as real variants Recalibrated BQ using different covariates High quality >>>>>>>>>>>>> Low Quality 123 A 1345 C/T BQ cov1 cov2 cov C/G BQ cov1 cov2 cov T 145 A BQ: base quality Covariates: position in the read, end of read with worse calls! 1298 T 1789 G

90 3- Bam refinement Base Quality Recalibration BAM files with variants 123 C/A BQ cov1 cov2 cov3 145 G/A BQ cov1 cov2 cov G/T BQ_1 cov1 cov2 cov3 Considered as real variants Recalibrated BQ using different covariates High quality >>>>>>>>>>>>> Low Quality 123 A 1345 C/T BQ cov1 cov2 cov C/G BQ_1 cov1 cov2 cov T 145 A BQ: base quality Covariates: position in the read, end of read with worse calls! 1298 T 1789 G

91 3- Bam refinement Base Quality Recalibration Covariate: cycle number First cycle: higher quality Last cycles: lower quality

92 3- Bam refinement Base Quality Recalibration We will not run the Base Quality Recalibration because of time and list of variants available. Few more details: It supports several platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences (stated on the website) and IonTorrent (stated in the GATK forum). You can find how to do it at: ers_bqsr_baserecalibrator.php

93 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

94 3- Bam refinement Duplicate removal PCR is used during library preparation. This can result in duplicate DNA fragments in the final library prep. What we want: information from independent fragments What we do not want: copies of the same information coming from one fragment Reference Sequence

95 3- Bam refinement Duplicate removal PCR is used during library preparation. This can result in duplicate DNA fragments in the final library prep. What we want: information from independent fragments What we do not want: copies of the same information coming from one fragment Reference Sequence

96 3- Bam refinement Duplicate removal C C C C A Ref call Possible heterozygote, SNP call

97 3- Bam refinement Duplicate removal Why is it important to do it? It can result in false SNPs calls. Duplicates may fake a high coverage thus giving high support to some variants. PCR-free protocols exist but require a large amount of DNA.

98 3- Bam refinement Duplicate removal Number of duplicates varies according to the complexity of the library: whole genome experiments (<5%) custom enrichment ones (<30%) It must be done after alignment and at the library level. How does it work? It identifies read-pairs where the outer ends map to the same position on the genome and removes all but one copy.

99 3- Bam refinement Duplicate removal Duplicate removal Module used picard-tools MarkDuplicates INPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bam METRICS_FILE=dupl_metrics.txt Input file Output file Metrics file

100 3- Bam refinement Duplicate removal Duplicate removal picard-tools MarkDuplicates INPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bam METRICS_FILE=dupl_metrics.txt What do we have to do after each step??? Sort and Index the newly generated BAM file

101 3- Bam refinement Duplicate removal sort the bam samtools sort cc_gn2_r12_rg_sorted_rmdup.bam cc_gn2_r12_rg_sorted_rmdup_sorted Index the sorted bam file samtools index cc_gn2_r12_rg_sorted_rmdup_sorted.bam

102 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

103 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

104 3- Bam refinement BAM check BAM check gives us answers to several questions: How many duplicates do I have? Is that reasonable for my experiment? Duplicate removal How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes? Alignment Stats How much average coverage do I have? Is the coverage evenly distributed along my region? Coverage

105 3- Bam refinement BAM check: duplicate removal picard-tools MarkDuplicates INPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bam METRICS_FILE=dupl_metrics.txt

106 3- Bam refinement BAM check: duplicate removal Open the file dupl_metrics.txt More/cat/gedit 1) Check the % of duplicates; ~20.3%

107 3- Bam refinement BAM check: duplicate removal ## net.sf.picard.metrics.stringheader # net.sf.picard.sam.markduplicates INPUT=[/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/whi te_shark/dd10/final_files/cc_gn2_r12_rg_sorted.bam] OUTPUT=/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/w hite_shark/dd10/final_files/cc_gn2_r12_rg_sorted_rmdup.bam METRICS_FILE=/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_g n2/white_shark/dd10/final_files/dupl_metrics.txt PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM= CREATE_INDEX=false CREATE_MD5_FILE=false ## net.sf.picard.metrics.stringheader # Started on: Thu Nov 10 18:37:26 GMT 2016 ## METRICS CLASS net.sf.picard.sam.duplicationmetrics LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE cc_gn We have 20.3% of PCR duplicates in our experiment

108 3- Bam refinement BAM check BAM check gives us answers to several questions: How many duplicates do I have? Is that reasonable for my experiment? Duplicate removal How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes? Alignment Stats How much average coverage do I have? Is the coverage evenly distributed along my region? Coverage

109 3- Bam refinement BAM check: alignment stats Run flagstat on the BAM file before and after BAM refinement, can you see any difference? BEFORE: samtools flagstat cc_gn2_r12_rg_sorted.bam AFTER: samtools flagstat cc_gn2_r12_rg_sorted_rmdup.bam

110 3- Bam refinement BAM check: alignment stats in total (QC-passed reads + QC-failed reads) duplicates mapped (25.51%:-nan%) paired in sequencing read read properly paired (19.71%:-nan%) with itself and mate mapped singletons (2.91%:-nan%) with mate mapped to a different chr with mate mapped to a different chr (mapq>=5) in total (QC-passed reads + QC-failed reads) duplicates mapped (25.51%:-nan%) paired in sequencing read read properly paired (19.71%:-nan%) with itself and mate mapped singletons (2.91%:-nan%) with mate mapped to a different chr with mate mapped to a different chr (mapq>=5 BEFORE AFTER

111 3- Bam refinement BAM check: alignment stats duplicates LIBRARY UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES PERCENT_DUPLICATION cc_gn (8687*2) = 24400

112 3- Bam refinement BAM check BAM check gives us answers to several questions: How many duplicates do I have? Is that reasonable for my experiment? Duplicate removal How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes? Alignment Stats How much average coverage do I have? Is the coverage evenly distributed along my region? Coverage

113 3- Bam refinement BAM check: coverage estimation Depth of Coverage The number of reads that spans a given DNA sequence of interest. This is commonly expressed in terms of Yx where Y is the number of reads and x is the unit reflecting the depth of coverage metric (i.e. 5x, 10x, 20x, 100x) 7x 11x 9x

114 3- Bam refinement BAM check: coverage estimation GATK, coverage per base Reference sequence Input file java -jar /opt/gatk-3.5-0/genomeanalysistk.jar -I cc_gn2_r12_rg_sorted_rmdup_sorted.bam -R cc_ref.fa -T DepthOfCoverage Module used -o cc_gn2_r12_rg_sorted_rmdup_coverage -L coverage.intervals List of intervals Output

115 3- Bam refinement BAM check: coverage estimation GATK, coverage per base -L coverage.intervals List of intervals, cc_ref:

116 3- Bam refinement BAM check: coverage estimation gedit cc_gn2_r12_rg_sorted_rmdup_coverage.sample_summ ary OR More/Cat/Head sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15 shark Total N/A N/A N/A sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15 shark Total N/A N/A N/A

117 3- Bam refinement BAM check: coverage estimation Per base coverage gedit cc_gn2_r12_rg_sorted_rmdup_coverage OR More/Cat/Head Locus Total_Depth Average_Depth_sample Depth_for_shark cc_ref: cc_ref: cc_ref: cc_ref: cc_ref: cc_ref: cc_ref: cc_ref: cc_ref: cc_ref: cc_ref:

118 3- Bam refinement BAM check: coverage estimation Open R: Open a terminal; Type R;

119 3- Bam refinement BAM check: coverage estimation data<read.table("cc_gn2_r12_rg_sorted_rmdup_coverag e", sep="\t", header=t) names(data) old_col<-data$locus new_col<gsub("cc_ref:","",as.character(old_col)) data["pos"]<-new_col plot(data$pos, data$depth_for_shark, type="l")

120 3- Bam refinement BAM check: coverage estimation Coverage Position (bp)

121 3- Bam refinement BAM check: coverage estimation

122 3- Bam refinement BAM check: coverage estimation Why is important to check the coverage? To check how your experiment performed (one of the ways to assess the quality of your experiment); To understand how confident can you be with your data; To decide on filtering after variant calling; To look for structural variation.

123 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

124 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

125 4- BAM check: visualisation Different tools to visualise aligned NGS data: IGV: Tablet: Samtools tview

126 4- BAM check: visualisation Samtools tview: Basic visualization tool; Terminal based; No need of RAM/Java/extra packages;

127 4- BAM check: visualisation samtools tview cc_gn2_r12_rg_sorted_rmdup_sorted.bam -p cc_ref:

128 4- BAM check: visualisation Input: bam file samtools tview cc_gn2_r12_rg_sorted_rmdup_sorted.bam -p cc_ref: Position (Chr:bp)

129 4- BAM check: visualisation Reference sequence Consensus Position Read

130 4- BAM check: visualisation. : base matching positive strand;, : base matching negative strand; underlined: secondary or orphan; Uppercase letters: base matching positive strand; Lowercase letters: base matching negative strand;

131 4- BAM check: visualisation Press.

132 4- BAM check: visualisation Reverse strand Positive strand

133 4- BAM check: visualisation?: menu m: mapping quality n: nucleotide b: base quality.: on/off dots q: exit Secondary or orphan Colours for quality

134 4- BAM check: visualisation Press q, m Press? MQ >=30 0 MQ 9

135 4- BAM check: visualisation Press b Press? 20 BQ BQ 19 BQ 30 0 BQ 9

136 4- BAM check: visualisation Press n What does it happen?? The four nucleotides are highlighted by four different colours, no more mapping or base quality

137 4- BAM check: visualisation

138 4- BAM check: visualisation 1. Is there any polymorphic site between position 1,174,361 and 1,174,391? 2. Is there any polymorphic sites between position 1,233,201 and 1,233,221? If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).

139 4- BAM check: visualisation

140 4- BAM check: visualisation 1. Is there any polymorphic site between position 1,174,361 and 1,174,391? No polymorphic sites between position 1,174,361 and 1,174, Is there any polymorphic sites between position 1,233,201 and 1,233,221? If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).

141 4- BAM check: visualisation

142 4- BAM check: visualisation Yes, there is one polymorphic sites Reference allele: G Alternative allele: T Call: possible homozygous alternative (T/T) Average BQ: BQ 30 (white) Average MQ: MQ 30 (white)

143 4- BAM check: visualisation 1. Is there any polymorphic site between position 4761 and 4771? No polymorphic sites between position 4761 and Is there any polymorphic sites between position1781 and 1791? Yes, there is one polymorphic sites Reference allele: G Alternative allele: T Call: possible homozygous alternative (T/T) Average BQ: BQ 30 Average MQ: MQ 30 If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).

144 4- BAM check: visualisation IGV

145 4- BAM check: visualisation IGV

146 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

147 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

148 5- Variant calling SNPs indels SV samtools GATK: 1. Unified Genotyper 2. Haplotype caller samtools GATK: 1. Unified Genotyper 2. Haplotype caller Dindel SVMerge pipeline combining many different tools SNPs: Single Nucleotide Polymorphisms Indels: insertions/deletions SV: Structural Variation

149 5- Variant calling Variant calling: Examine the bases aligned to position and look for differences What we are looking for: Polymorphic sites / monomorphic sites Factors to consider: - Base call qualities of each supporting base - Proximity to indels and homopolymer run - Mapping qualities of the reads supporting the SNP (increased read length or paired-end help MQ scores) - Sequencing depth - Individual vs multi-sample calling

150 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

151 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

152 5- Variant calling About the variant file: raw variants (.vcf) = filtered variants (.vcf) variants (.vcf)

153 variants (.vcf) Standardised format for storing DNA polymorphism data - SNPs, indels, SV - Rich annotations Can be indexed for fast data retrieval of variants from a range of positions Can store variant information over many samples Record meta-data about the site - dbsnp accession, filter status Very flexible - Tags can be introduced to describe new types of variants - Different VCF files may contain different information/annotations

154 variants (.vcf) Data Header

155 variants (.vcf) Header Something specific to the header lines???

156 variants (.vcf) Header Lines starting with ##: arbitrary number of meta-information lines ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward, ref-reverse, alt-forward and alt-reverse bases"> ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">

157 variants (.vcf) Header line starting with #: column definition mandatory columns include:

158 variants (.vcf) Header line starting with #: column definition mandatory columns include:

159 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM chromosome

160 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS chromosome position of the start of the variant

161 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS ID chromosome position of the start of the variant unique identifier of the variant (e.g. rs number for SNPs)

162 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS ID REF chromosome position of the start of the variant unique identifier of the variant (e.g. rs number for SNPs) reference allele

163 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS ID REF ALT chromosome position of the start of the variant unique identifier of the variant (e.g. rs number for SNPs) reference allele comma separated list of alternate non-reference alleles

164 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS ID REF ALT QUAL chromosome position of the start of the variant unique identifier of the variant (e.g. rs number for SNPs) reference allele comma separated list of alternate non-reference alleles phred-scaled quality score

165 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS ID REF ALT QUAL FILTER chromosome position of the start of the variant unique identifier of the variant (e.g. rs number for SNPs) reference allele comma separated list of alternate non-reference alleles phred-scaled quality score site filtering information

166 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS ID REF ALT QUAL FILTER INFO chromosome position of the start of the variant unique identifier of the variant (e.g. rs number for SNPs) reference allele comma separated list of alternate non-reference alleles phred-scaled quality score site filtering information user extensible annotation (e.g. samtools and GATK may differ in this)

167 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chromosome position of the start of the variant unique identifier of the variant (e.g. rs number for SNPs) reference allele comma separated list of alternate non-reference alleles phred-scaled quality score site filtering information user extensible annotation (e.g. samtools and GATK may differ in this) how the information for each sample is presented (i.e. GT:DP:DV:SP:DPR)

168 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chromosome position of the start of the variant unique identifier of the variant (e.g. rs number for SNPs) reference allele comma separated list of alternate non-reference alleles phred-scaled quality score site filtering information user extensible annotation (e.g. samtools and GATK may differ in this) how the information for each sample is presented (i.e. GT:DP:DV:SP:DPR) samples follow

169 variants (.vcf) Header line starting with #: column definition mandatory columns include: CHROM POS ID REF ALT QUAL FILTER INFO FORMAT chromosome position of the start of the variant unique identifier of the variant (e.g. rs number for SNPs) reference allele comma separated list of alternate non-reference alleles phred-scaled quality score site filtering information user extensible annotation (e.g. samtools and GATK may differ in this) how the information for each sample is presented (i.e. GT:DP:DV:SP:DPR) samples follow

170 variants (.vcf) Data Header

171 variants (.vcf) Data one line per site (all columns described above per line); useful information per site and per sample;

172 5- Variant calling Software and tools: samtools and bcftools Input: cc_gn2_r12_rg_sorted_rmdup_sorted.bam????? You have to create the command needed depending on what we want to have in the final vcf file.

173 5- Variant calling samtools mpileup cc_gn2_r12_rg_sorted_rmdup_sorted.bam bcftools view > cc_gn2_bq20_mq40.vcf samtools mpileup: No indel calling; Parameter for adjusting mapq (use 50 as value) ; Base quality minimum 20; Mapping quality minimum 40; Output per-sample strand bias P-value (SP); Output per-sample Depth of Coverage (DP); Generate BCF output; Specify reference sequence file.

174 5- Variant calling samtools mpileup cc_gn2_r12_rg_sorted_rmdup_sorted.bam bcftools view > cc_gn2_bq20_mq40.vcf Bcftools view: Call genotypes at variant sites; Variant sites only.

175 5- Variant calling samtools mpileup cc_gn2_r12_rg_sorted_rmdup_sorted.bam bcftools view > cc_gn2_bq20_mq40.vcf How to know all the options corresponding to these requirements??? Type samtools mpileup in the terminal; Type bcftools view in the terminal.

176 5- Variant calling samtools mpileup cc_gn2_r12_rg_sorted_rmdup_sorted.bam bcftools view > cc_gn2_bq20_mq40.vcf samtools mpileup: 1. No indel calling; 2. Parameter for adjusting mapq (use 50 as value) ; 3. Base quality minimum 20; 4. Mapping quality minimum 40; 5. Output per-sample strand bias P-value (SP); 6. Output per-sample Depth of Coverage (DP); 7. Generate BCF output; 8. Specify reference sequence file. Bcftools call: 1. Call genotypes at variant sites; 2. Variant sites only.

177 5- Variant calling samtools mpileup cc_gn2_r12_rg_sorted_rmdup_sorted.bam bcftools view > cc_gn2_bq20_mq40.vcf samtools mpileup: 1. No indel calling; -I 2. Parameter for adjusting mapq (use 50 as value) ; -C Base quality minimum 20; -Q Mapping quality minimum 40; -q Output per-sample strand bias P-value (SP); -S 6. Output per-sample Depth of Coverage (DP); -D 7. Generate BCF output; -g 8. Specify reference sequence file. -f ref_seq Bcftools call: 1. Call genotypes at variant sites; -g 2. Variant sites only. -v

178 5- Variant calling Variant calling command: samtools mpileup -I -C 50 -Q 20 -q 40 -S -D -g -f cc_ref.fa cc_gn2_r12_rg_sorted_rmdup_sorted.bam bcftools view -gv - > cc_gn2_bq20_mq40.vcf

179 5- Variant calling gedit cc_gn2_bq20_mq40.vcf OR More/Cat/Head #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT shark cc_ref G A 140. DP=10;VDB= e- 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

180 5- Variant calling #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT shark cc_ref G A 140. DP=10;VDB= e 02;AF1=1;AC1=2; DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

181 5- Variant calling #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT shark cc_ref G A 140. DP=10;VDB= e 02;AF1=1;AC1=2; DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

182 5- Variant calling #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT shark cc_ref G A 140. DP=10;VDB= e 02;AF1=1;AC1=2; DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39 INFO: user extensible annotation (e.g. samtools and GATK may differ in this) INFO: DP=10;VDB= e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48 DP: DPR: DP4: MQ: "Raw read depth" "Number of high-quality bases observed for each allele" "Number of high-quality ref-forward, ref-reverse, alt-forward and alt-reverse bases" "Average mapping quality"

183 5- Variant calling #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT shark cc_ref G A 140. DP=10;VDB= e 02;AF1=1;AC1=2; DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39 INFO field regards the SITE and NOT the specific samples in multisample calling!!!

184 5- Variant calling #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT shark cc_ref G A 140. DP=10;VDB= e 02;AF1=1;AC1=2; DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39

185 5- Variant calling #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT shark cc_ref G A 140. DP=10;VDB= e 02;AF1=1;AC1=2; DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39 FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ) Each sample (multisample calling) has its own FORMAT field, information are sample specific!! GT:PL:DP:SP:GQ

186 5- Variant calling FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ) GT:PL:DP:SP:GQ GT: genotype 0/0: homozygote reference; 0/1: heterozygote; 1/1: homozygote alternative; 1/1:173,21,0:7:0:39 Homozygote alternative 0: reference allele; 1:alternative allele;

187 5- Variant calling FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ) GT:PL:DP:SP:GQ PL: List of Phred-scaled genotype likelihoods; Approximate likelihood = 10^(-PL/10) 1/1:173,21,0:7:0:39 P(0/0) = 10^(-173/10) = 10^(-17.3) = P(0/1) = 10^(-21/10) = 10^(-2.1) = P(1/1) = 10^(-0/10) = 10^(0) = 1

188 5- Variant calling FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ) GT:PL:DP:SP:GQ DP: Number of high-quality bases; 1/1:173,21,0:7:0:39 DP=7

189 5- Variant calling FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ) GT:PL:DP:SP:GQ SP: Phred-scaled strand bias P-value; Strand bias p value= 10^(- SP /10) 0.05=10^(- SP /10) -10*Log 0.05 = SP SP =13.01 Would you keep sites with SP 13 or sites with SP 13? SP p value there is a sta s cal significant difference in the number of alleles coming from the two strands; SP p value there is NOT a statistical significant difference in the number of alleles coming from the two strands.

190 5- Variant calling FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ) GT:PL:DP:SP:GQ SP: Phred-scaled strand bias P-value; Strand bias p value= 10^(- SP /10) 0.05=10^(- SP /10) -10*Log 0.05 = SP SP =13.01 If we assume a p value threshold of 0.05, we will keep all sites with SP 13 1/1:173,21,0:7:0:39 SP=0

191 5- Variant calling FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ) GT:PL:DP:SP:GQ GQ: conditional genotype quality, encoded as a phred quality; GQ= 10log10 P(genotype call is wrong) P(genotype call is wrong) = 10^(-GQ/10) 1/1:173,21,0:7:0:39 P(genotype call is wrong) = 10^(-39/10)=10^-3.9=

192 5- Variant calling #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT shark cc_ref G A 140. DP=10;VDB= e 02;AF1=1;AC1=2; DP4=0,0,6,1;MQ=47;FQ=-48 GT:PL:DP:SP:GQ 1/1:173,21,0:7:0:39 FORMAT: how the information for each sample is presented (i.e. GT:PL:DP:SP:GQ) GT:PL:DP:SP:GQ GT: PL: DP: SP: DPR: Genotype; List of Phred-scaled genotype likelihoods; Number of high-quality bases; Phred-scaled strand bias P-value; Genotype quality.

193 5- Variant calling Useful references and resources: at-is-a-vcf-and-how-should-i-interpret-it

194 5- Variant calling We have our raw variants but now we need to refine our dataset how? Variant score recalibration; Variant filtering; Validation

195 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

196 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

197 5- Variant calling: Variant quality score recalibration Available in GATK. It aims at producing well-calibrated probabilities for the variants called. It develops a continuous, covarying estimate of the relationship between SNP call annotations ( e.g. MQ, QD ) and the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact. It needs true sites to be trained (i.e. HapMap Phase 3 data, OMNI 2.5 M, etc ). We are not going to use it, because it needs big datasets (either many samples, or whole genome data) to work properly. You can find more information at

198 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

199 raw reads (.fastq) 1. Fastq quality control + trimming 2. alignment to a reference genome close reference? time limited? bwa distant reference? stampy Adapters? Low quality bases? aligned reads (.sam/.bam) 4. bam check visualization 3. bam refinement duplicate metrics (picard) flagstat (samtools) coverage distribution (GATK) samtools IGV/tablet local realignment base recalibration duplicate removal GATK GATK picard aligned reads (.sam/.bam) ready-to-use variants (.vcf) 5. variant calling SNPs/indels raw variants (.vcf) variant score recalibration 6. variant filtering and validation single/multi-sample known SNPs/indels vcftools samtools big datasets in silico vs in vitro validation

200 5- Variant filtering and validation WHY filtering??? Each caller tends to call as many sites it can but it provides useful information on several parameters; Many calls are guesses ; Artifacts; Presence of sequencing error; To create a high quality dataset; WHY validation??? An error-free dataset is unrealistic; To estimate the % of error within our high quality dataset;

201 5- Variant filtering and validation: FILTERING Filtering is all about finding the right balance:

202 5- Variant filtering and validation: FILTERING Filtering is all about finding the right balance: Stringent filters Enough information (sites)

203 5- Variant filtering and validation: FILTERING Less stringent filters Lots of sites but with many possible errors

204 5- Variant filtering and validation: FILTERING Not many sites left Very stringent filters

205 5- Variant filtering and validation: FILTERING General approach: Try different thresholds for each parameter and draw a distribution; If possible, look for sudden changes in the distribution and set a threshold; No fixed rule on which thresholds you should use; It is data and project specific; It is a crucial step to create a high quality dataset.

206 5- Variant filtering and validation: FILTERING Common cautions: - Base quality BQ20 - Depth (min and max) very dependent on your average - Mapping quality MQ40/50 (minimum MQ30) - Strand-bias p-value> SNP density dependent on the genome [e.g. no more than 1 SNP/4bp] - Indel proximity not closer than 10bp to an indel - Missing data? - Some filters may be applied during the variant calling while others are applied afterwards.

207 5- Variant filtering and validation: FILTERING MQ distribution

208 5- Variant filtering and validation: FILTERING MQ distribution

209 5- Variant filtering and validation: FILTERING Coverage: distribution per run

210 5- Variant filtering and validation: FILTERING Coverage: distribution per population

211 5- Variant filtering and validation: FILTERING VCFtools ( --min-meandp <float> --max-meandp <float> Includes only sites with mean depth values (over all included individuals) greater than or equal to the "--min-meandp" value and less than or equal to the "--maxmeandp" value. One of these options may be used without the other. These options require that the "DP" FORMAT tag is included for each site. GT:PL:DP:SP:GQ

212 5- Variant filtering and validation: FILTERING VCFtools ( Input file (vcf) vcftools --vcf cc_gn2_bq20_mq40.vcf DP threshold --min-meandp 20 --recode --recode-info-all Keep all the INFO --out cc_gn2_bq20_mq40_dp20.vcf Create new vcf file (recode in the file name) Output file INFO: DP=10;VDB= e 02;AF1=1;AC1=2;DP4=0,0,6,1;MQ=47;FQ=-48

213 5- Variant filtering and validation: FILTERING Gedit cc_gn2_bq20_mq40_dp20.vcf.recode.vcf OR Cat/more cc_gn2_bq20_mq40_dp20.vcf.recode.vcf

214 5- Variant filtering and validation: FILTERING cc_ref T G 182. DP=26;VDB= e- 02;RPB= e+00;AF1=0.5;AC1=1;DP4=5,6,5,5;MQ=48;FQ=184;PV4=1,1,1,1 GT:PL:DP:SP:GQ 0/1:212,0,217:21:0:99 cc_ref N C 222. DP=24;VDB= e- 01;AF1=1;AC1=2;DP4=0,0,13,10;MQ=48;FQ=-96 GT:PL:DP:SP:GQ 1/1:255,69,0:23:0:99 cc_ref G T 222. DP=24;VDB= e- 01;AF1=1;AC1=2;DP4=0,0,11,12;MQ=46;FQ=-96 GT:PL:DP:SP:GQ 1/1:255,69,0:23:0:99

215 5- Variant filtering and validation: FILTERING cc_ref T G 182. GT:PL:DP:SP:GQ 0/1:212,0,217:21:0:99 cc_ref N C 222. GT:PL:DP:SP:GQ 1/1:255,69,0:23:0:99 cc_ref G T 222. GT:PL:DP:SP:GQ 1/1:255,69,0:23:0:99 1,233,213 G/T familiar??? You discover it in tview!!!

216 5- Variant filtering and validation: FILTERING Visualisation of the filtered variant sites in IGV IGV needs: A genome: reference sequence; BAM files: sorted and indexed.

217 5- Variant filtering and validation: FILTERING Visualisation of the filtered variant sites in IGV./igv.sh Genomes > Load Genome > white_shark.genome Files > Load > cc_gn2_r12_rg_sorted_rmdup_sorted.bam

218 5- Variant filtering and validation: FILTERING Visualisation of the filtered variant sites in IGV cc_ref:39327

219 5- Variant filtering and validation: FILTERING Visualisation of the filtered variant sites in IGV cc_ref:381078

220 5- Variant filtering and validation: FILTERING Visualisation of the filtered variant sites in IGV cc_ref:

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010 Next Generation Sequence Alignment on the BRC Cluster Steve Newhouse 22 July 2010 Overview Practical guide to processing next generation sequencing data on the cluster No details on the inner workings

More information

INTRODUCTION AUX FORMATS DE FICHIERS

INTRODUCTION AUX FORMATS DE FICHIERS INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,

More information

NGS Data Analysis. Roberto Preste

NGS Data Analysis. Roberto Preste NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,

More information

NGS Sequence data. Jason Stajich. UC Riverside. jason.stajich[at]ucr.edu. twitter:hyphaltip stajichlab

NGS Sequence data. Jason Stajich. UC Riverside. jason.stajich[at]ucr.edu. twitter:hyphaltip stajichlab NGS Sequence data Jason Stajich UC Riverside jason.stajich[at]ucr.edu twitter:hyphaltip stajichlab Lecture available at http://github.com/hyphaltip/cshl_2012_ngs 1/58 NGS sequence data Quality control

More information

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional. Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference

More information

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory Pre-processing and quality control of sequence data Barbera van Schaik KEBB - Bioinformatics Laboratory b.d.vanschaik@amc.uva.nl Topic: quality control and prepare data for the interesting stuf Keep Throw

More information

Lecture 12. Short read aligners

Lecture 12. Short read aligners Lecture 12 Short read aligners Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola

More information

Exome sequencing. Jong Kyoung Kim

Exome sequencing. Jong Kyoung Kim Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic

More information

The SAM Format Specification (v1.3-r837)

The SAM Format Specification (v1.3-r837) The SAM Format Specification (v1.3-r837) The SAM Format Specification Working Group November 18, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited

More information

SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call

SAMtools.   SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call SAMtools http://samtools.sourceforge.net/ SAM/BAM mapping BAM SAM BAM BAM sort & indexing (ex: IGV) mapping SNP call SAMtools NGS Program: samtools (Tools for alignments in the SAM format) Version: 0.1.19

More information

Analysing re-sequencing samples. Anna Johansson WABI / SciLifeLab

Analysing re-sequencing samples. Anna Johansson WABI / SciLifeLab Analysing re-sequencing samples Anna Johansson Anna.johansson@scilifelab.se WABI / SciLifeLab Re-sequencing Reference genome assembly...gtgcgtagactgctagatcgaaga... Re-sequencing IND 1 GTAGACT AGATCGG GCGTAGT

More information

The SAM Format Specification (v1.3 draft)

The SAM Format Specification (v1.3 draft) The SAM Format Specification (v1.3 draft) The SAM Format Specification Working Group July 15, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text

More information

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 File Formats: SAM, BAM, and CRAM UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 / BAM / CRAM NEW! http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and

More information

Read Mapping and Variant Calling

Read Mapping and Variant Calling Read Mapping and Variant Calling Whole Genome Resequencing Sequencing mul:ple individuals from the same species Reference genome is already available Discover varia:ons in the genomes between and within

More information

Mapping NGS reads for genomics studies

Mapping NGS reads for genomics studies Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization

More information

Under the Hood of Alignment Algorithms for NGS Researchers

Under the Hood of Alignment Algorithms for NGS Researchers Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window

More information

High-throughout sequencing and using short-read aligners. Simon Anders

High-throughout sequencing and using short-read aligners. Simon Anders High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel

More information

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au

More information

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for

More information

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline Supplementary Information Detecting and annotating genetic variations using the HugeSeq pipeline Hugo Y. K. Lam 1,#, Cuiping Pan 1, Michael J. Clark 1, Phil Lacroute 1, Rui Chen 1, Rajini Haraksingh 1,

More information

Sequence Mapping and Assembly

Sequence Mapping and Assembly Practical Introduction Sequence Mapping and Assembly December 8, 2014 Mary Kate Wing University of Michigan Center for Statistical Genetics Goals of This Session Learn basics of sequence data file formats

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational

More information

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012 SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................

More information

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja / From fastq to vcf Overview of resequencing analysis samples fastq fastq fastq fastq mapping bam bam bam bam variant calling samples 18917 C A 0/0 0/0 0/0 0/0 18969 G T 0/0 0/0 0/0 0/0 19022 G T 0/1 1/1

More information

MPG NGS workshop I: Quality assessment of SNP calls

MPG NGS workshop I: Quality assessment of SNP calls MPG NGS workshop I: Quality assessment of SNP calls Kiran V Garimella (kiran@broadinstitute.org) Genome Sequencing and Analysis Medical and Population Genetics February 4, 2010 SNP calling workflow Filesize*

More information

Analysing re-sequencing samples. Malin Larsson WABI / SciLifeLab

Analysing re-sequencing samples. Malin Larsson WABI / SciLifeLab Analysing re-sequencing samples Malin Larsson Malin.larsson@scilifelab.se WABI / SciLifeLab Re-sequencing Reference genome assembly...gtgcgtagactgctagatcgaaga...! Re-sequencing IND 1! GTAGACT! AGATCGG!

More information

Calling variants in diploid or multiploid genomes

Calling variants in diploid or multiploid genomes Calling variants in diploid or multiploid genomes Diploid genomes The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen: 1. 2.

More information

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements

CORE Year 1 Whole Genome Sequencing Final Data Format Requirements CORE Year 1 Whole Genome Sequencing Final Data Format Requirements To all incumbent contractors of CORE year 1 WGS contracts, the following acts as the agreed to sample parameters issued by NHLBI for data

More information

Galaxy Platform For NGS Data Analyses

Galaxy Platform For NGS Data Analyses Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account

More information

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions

More information

DNA Sequencing analysis on Artemis

DNA Sequencing analysis on Artemis DNA Sequencing analysis on Artemis Mapping and Variant Calling Tracy Chew Senior Research Bioinformatics Technical Officer Rosemarie Sadsad Informatics Services Lead Hayim Dar Informatics Technical Officer

More information

Reads Alignment and Variant Calling

Reads Alignment and Variant Calling Reads Alignment and Variant Calling CB2-201 Computational Biology and Bioinformatics February 22, 2016 Emidio Capriotti http://biofold.org/ Institute for Mathematical Modeling of Biological Systems Department

More information

Introduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)

Introduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013) Introduction to NGS analysis on a Raspberry Pi Beta version 1.1 (04 June 2013)!! Contents Overview Contents... 3! Overview... 4! Download some simulated reads... 5! Quality Control... 7! Map reads using

More information

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V. REPORT NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5 1.3 ACCURACY

More information

NA12878 Platinum Genome GENALICE MAP Analysis Report

NA12878 Platinum Genome GENALICE MAP Analysis Report NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD Jan-Jaap Wesselink, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5

More information

Variant calling using SAMtools

Variant calling using SAMtools Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel

More information

Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN. Sophie Gallina CNRS Evo-Eco-Paléo (EEP)

Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN. Sophie Gallina CNRS Evo-Eco-Paléo (EEP) Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN Sophie Gallina CNRS Evo-Eco-Paléo (EEP) (sophie.gallina@univ-lille1.fr) Module 1/5 Analyse DNA NGS Introduction Galaxy : upload

More information

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM). Release Notes Agilent SureCall 4.0 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional

More information

Decrypting your genome data privately in the cloud

Decrypting your genome data privately in the cloud Decrypting your genome data privately in the cloud Marc Sitges Data Manager@Made of Genes @madeofgenes The Human Genome 3.200 M (x2) Base pairs (bp) ~20.000 genes (~30%) (Exons ~1%) The Human Genome Project

More information

Sequence Analysis Pipeline

Sequence Analysis Pipeline Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation

More information

ADNI Sequencing Working Group. Robert C. Green, MD, MPH Andrew J. Saykin, PsyD Arthur Toga, PhD

ADNI Sequencing Working Group. Robert C. Green, MD, MPH Andrew J. Saykin, PsyD Arthur Toga, PhD ADNI Sequencing Working Group Robert C. Green, MD, MPH Andrew J. Saykin, PsyD Arthur Toga, PhD Why sequencing? V V V V V V V V V V V V V A fortuitous relationship TIME s Best Invention of 2008 The initial

More information

Handling sam and vcf data, quality control

Handling sam and vcf data, quality control Handling sam and vcf data, quality control We continue with the earlier analyses and get some new data: cd ~/session_3 wget http://wasabiapp.org/vbox/data/session_4/file3.tgz tar xzf file3.tgz wget http://wasabiapp.org/vbox/data/session_4/file4.tgz

More information

SNP Calling. Tuesday 4/21/15

SNP Calling. Tuesday 4/21/15 SNP Calling Tuesday 4/21/15 Why Call SNPs? map mutations, ex: EMS, natural variation, introgressions associate with changes in expression develop markers for whole genome QTL analysis/ GWAS access diversity

More information

AgroMarker Finder manual (1.1)

AgroMarker Finder manual (1.1) AgroMarker Finder manual (1.1) 1. Introduction 2. Installation 3. How to run? 4. How to use? 5. Java program for calculating of restriction enzyme sites (TaqαI). 1. Introduction AgroMarker Finder (AMF)is

More information

Falcon Accelerated Genomics Data Analysis Solutions. User Guide

Falcon Accelerated Genomics Data Analysis Solutions. User Guide Falcon Accelerated Genomics Data Analysis Solutions User Guide Falcon Computing Solutions, Inc. Version 1.0 3/30/2018 Table of Contents Introduction... 3 System Requirements and Installation... 4 Software

More information

Local Run Manager Resequencing Analysis Module Workflow Guide

Local Run Manager Resequencing Analysis Module Workflow Guide Local Run Manager Resequencing Analysis Module Workflow Guide For Research Use Only. Not for use in diagnostic procedures. Overview 3 Set Parameters 4 Analysis Methods 6 View Analysis Results 8 Analysis

More information

NGS Data Visualization and Exploration Using IGV

NGS Data Visualization and Exploration Using IGV 1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians

More information

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR GOAL OF THIS SESSION Assuming that The audiences know how to perform GWAS

More information

Sequence mapping and assembly. Alistair Ward - Boston College

Sequence mapping and assembly. Alistair Ward - Boston College Sequence mapping and assembly Alistair Ward - Boston College Sequenced a genome? Fragmented a genome -> DNA library PCR amplification Sequence reads (ends of DNA fragment for mate pairs) We no longer have

More information

Sentieon Documentation

Sentieon Documentation Sentieon Documentation Release 201808.03 Sentieon, Inc Dec 21, 2018 Sentieon Manual 1 Introduction 1 1.1 Description.............................................. 1 1.2 Benefits and Value..........................................

More information

DNA / RNA sequencing

DNA / RNA sequencing Outline Ways to generate large amounts of sequence Understanding the contents of large sequence files Fasta format Fastq format Sequence quality metrics Summarizing sequence data quality/quantity Using

More information

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM).

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA-MEM). Release Notes Agilent SureCall 3.5 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional

More information

Variation among genomes

Variation among genomes Variation among genomes Comparing genomes The reference genome http://www.ncbi.nlm.nih.gov/nuccore/26556996 Arabidopsis thaliana, a model plant Col-0 variety is from Landsberg, Germany Ler is a mutant

More information

NGS Analysis Using Galaxy

NGS Analysis Using Galaxy NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises

More information

Mapping reads to a reference genome

Mapping reads to a reference genome Introduction Mapping reads to a reference genome Dr. Robert Kofler October 17, 2014 Dr. Robert Kofler Mapping reads to a reference genome October 17, 2014 1 / 52 Introduction RESOURCES the lecture: http://drrobertkofler.wikispaces.com/ngsandeelecture

More information

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 SAM and VCF formats UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 File Format: SAM / BAM / CRAM! NEW http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and

More information

MiSeq Reporter TruSight Tumor 15 Workflow Guide

MiSeq Reporter TruSight Tumor 15 Workflow Guide MiSeq Reporter TruSight Tumor 15 Workflow Guide For Research Use Only. Not for use in diagnostic procedures. Introduction 3 TruSight Tumor 15 Workflow Overview 4 Reports 8 Analysis Output Files 9 Manifest

More information

Genomic Files. University of Massachusetts Medical School. October, 2014

Genomic Files. University of Massachusetts Medical School. October, 2014 .. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder WM2 Bioinformatics ExomeSeq data analysis part 1 Dietmar Rieder RAW data Use putty to logon to cluster.i med.ac.at In your home directory make directory to store raw data $ mkdir 00_RAW Copy raw fastq

More information

RNA-seq Data Analysis

RNA-seq Data Analysis Seyed Abolfazl Motahari RNA-seq Data Analysis Basics Next Generation Sequencing Biological Samples Data Cost Data Volume Big Data Analysis in Biology تحلیل داده ها کنترل سیستمهای بیولوژیکی تشخیص بیماریها

More information

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL QIAseq DNA V3 Panel Analysis Plugin USER MANUAL User manual for QIAseq DNA V3 Panel Analysis 1.0.1 Windows, Mac OS X and Linux January 25, 2018 This software is for research purposes only. QIAGEN Aarhus

More information

RPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts.

RPGC Manual. You will also need python 2.7 or above to run our home-brew python scripts. Introduction Here we present a new approach for producing de novo whole genome sequences--recombinant population genome construction (RPGC)--that solves many of the problems encountered in standard genome

More information

Understanding and Pre-processing Raw Illumina Data

Understanding and Pre-processing Raw Illumina Data Understanding and Pre-processing Raw Illumina Data Matt Johnson October 4, 2013 1 Understanding FASTQ files After an Illumina sequencing run, the data is stored in very large text files in a standard format

More information

The SAM Format Specification (v1.4-r956)

The SAM Format Specification (v1.4-r956) The SAM Format Specification (v1.4-r956) The SAM Format Specification Working Group April 12, 2011 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text

More information

Sequence Alignment/Map Optional Fields Specification

Sequence Alignment/Map Optional Fields Specification Sequence Alignment/Map Optional Fields Specification The SAM/BAM Format Specification Working Group 14 Jul 2017 The master version of this document can be found at https://github.com/samtools/hts-specs.

More information

NGS Data and Sequence Alignment

NGS Data and Sequence Alignment Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local

More information

Dindel User Guide, version 1.0

Dindel User Guide, version 1.0 Dindel User Guide, version 1.0 Kees Albers University of Cambridge, Wellcome Trust Sanger Institute caa@sanger.ac.uk October 26, 2010 Contents 1 Introduction 2 2 Requirements 2 3 Optional input 3 4 Dindel

More information

The SAM Format Specification (v1.4-r994)

The SAM Format Specification (v1.4-r994) The SAM Format Specification (v1.4-r994) The SAM Format Specification Working Group January 27, 2012 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text

More information

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits Run Setup and Bioinformatic Analysis Accel-NGS 2S MID Indexing Kits Sequencing MID Libraries For MiSeq, HiSeq, and NextSeq instruments: Modify the config file to create a fastq for index reads Using the

More information

Genomic Files. University of Massachusetts Medical School. October, 2015

Genomic Files. University of Massachusetts Medical School. October, 2015 .. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)

Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs) next generation sequencing analysis guidelines Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs) See what more we can do for you at www.idtdna.com. For Research Use Only

More information

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September 27 2014 Static Dynamic Static Minimum Information for Reporting

More information

Atlas-SNP2 DOCUMENTATION V1.1 April 26, 2010

Atlas-SNP2 DOCUMENTATION V1.1 April 26, 2010 Atlas-SNP2 DOCUMENTATION V1.1 April 26, 2010 Contact: Jin Yu (jy2@bcm.tmc.edu), and Fuli Yu (fyu@bcm.tmc.edu) Human Genome Sequencing Center (HGSC) at Baylor College of Medicine (BCM) Houston TX, USA 1

More information

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013 RNAseq analysis: SNP calling BTI bioinformatics course, spring 2013 RNAseq overview RNAseq overview Choose technology 454 Illumina SOLiD 3 rd generation (Ion Torrent, PacBio) Library types Single reads

More information

Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab

Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab The instruments, the runs, the QC metrics, and the output Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab Overview Roche/454 GS-FLX 454 (GSRunbrowser information) Evaluating run results Errors

More information

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For

More information

Genome Assembly Using de Bruijn Graphs. Biostatistics 666

Genome Assembly Using de Bruijn Graphs. Biostatistics 666 Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position

More information

Goal: Learn how to use various tool to extract information from RNAseq reads.

Goal: Learn how to use various tool to extract information from RNAseq reads. ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2017 Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): Output(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta

More information

Exeter Sequencing Service

Exeter Sequencing Service Exeter Sequencing Service A guide to your denovo RNA-seq results An overview Once your results are ready, you will receive an email with a password-protected link to them. Click the link to access your

More information

Intro to NGS Tutorial

Intro to NGS Tutorial Intro to NGS Tutorial Release 8.6.0 Golden Helix, Inc. October 31, 2016 Contents 1. Overview 2 2. Import Variants and Quality Fields 3 3. Quality Filters 10 Generate Alternate Read Ratio.........................................

More information

Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual

Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual Tumor-Specific NeoAntigen Detector (TSNAD) v2.0 User s Manual Zhan Zhou, Xingzheng Lyu and Jingcheng Wu Zhejiang University, CHINA March, 2016 USER'S MANUAL TABLE OF CONTENTS 1 GETTING STARTED... 1 1.1

More information

Rsubread package: high-performance read alignment, quantification and mutation discovery

Rsubread package: high-performance read alignment, quantification and mutation discovery Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For

More information

Isaac Enrichment v2.0 App

Isaac Enrichment v2.0 App Isaac Enrichment v2.0 App Introduction 3 Running Isaac Enrichment v2.0 5 Isaac Enrichment v2.0 Output 7 Isaac Enrichment v2.0 Methods 31 Technical Assistance ILLUMINA PROPRIETARY 15050960 Rev. C December

More information

Mapping. Reference. read

Mapping. Reference. read Mapping Reference read Assembly vs mapping contig1 contig2 reads bly as s em ll v sa all ma pp all ing vs r efe ren ce Reference What s the problem? Reads differ from the genome due to evolution and sequencing

More information

BaseSpace - MiSeq Reporter Software v2.4 Release Notes

BaseSpace - MiSeq Reporter Software v2.4 Release Notes Page 1 of 5 BaseSpace - MiSeq Reporter Software v2.4 Release Notes For MiSeq Systems Connected to BaseSpace June 2, 2014 Revision Date Description of Change A May 22, 2014 Initial Version Revision History

More information

Practical exercises Day 2. Variant Calling

Practical exercises Day 2. Variant Calling Practical exercises Day 2 Variant Calling Samtools mpileup Variant calling with samtools mpileup + bcftools Variant calling with HaplotypeCaller (GATK Best Practices) Genotype GVCFs Hard Filtering Variant

More information

Identiyfing splice junctions from RNA-Seq data

Identiyfing splice junctions from RNA-Seq data Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010 Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice

More information

NGS Analyses with Galaxy

NGS Analyses with Galaxy 1 NGS Analyses with Galaxy Introduction Every living organism on our planet possesses a genome that is composed of one or several DNA (deoxyribonucleotide acid) molecules determining the way the organism

More information

Contact: Raymond Hovey Genomics Center - SFS

Contact: Raymond Hovey Genomics Center - SFS Bioinformatics Lunch Seminar (Summer 2014) Every other Friday at noon. 20-30 minutes plus discussion Informal, ask questions anytime, start discussions Content will be based on feedback Targeted at broad

More information

Briefly: Bioinformatics File Formats. J Fass September 2018

Briefly: Bioinformatics File Formats. J Fass September 2018 Briefly: Bioinformatics File Formats J Fass September 2018 Overview ASCII Text Sequence Fasta, Fastq ~Annotation TSV, CSV, BED, GFF, GTF, VCF, SAM Binary (Data, Compressed, Executable) Data HDF5 BAM /

More information

Copyright 2014 Regents of the University of Minnesota

Copyright 2014 Regents of the University of Minnesota Quality Control of Illumina Data using Galaxy Contents September 16, 2014 1 Introduction 2 1.1 What is Galaxy?..................................... 2 1.2 Galaxy at MSI......................................

More information

Local Run Manager Amplicon Analysis Module Workflow Guide

Local Run Manager Amplicon Analysis Module Workflow Guide Local Run Manager Amplicon Analysis Module Workflow Guide For Research Use Only. Not for use in diagnostic procedures. Overview 3 Set Parameters 4 Analysis Methods 6 View Analysis Results 9 Analysis Report

More information

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems. Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD

More information

Short Read Alignment. Mapping Reads to a Reference

Short Read Alignment. Mapping Reads to a Reference Short Read Alignment Mapping Reads to a Reference Brandi Cantarel, Ph.D. & Daehwan Kim, Ph.D. BICF 05/2018 Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

More information

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were

More information

ASAP - Allele-specific alignment pipeline

ASAP - Allele-specific alignment pipeline ASAP - Allele-specific alignment pipeline Jan 09, 2012 (1) ASAP - Quick Reference ASAP needs a working version of Perl and is run from the command line. Furthermore, Bowtie needs to be installed on your

More information

Sequence Alignment/Map Format Specification

Sequence Alignment/Map Format Specification Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 28 Feb 2014 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing

More information