Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN. Sophie Gallina CNRS Evo-Eco-Paléo (EEP)

Size: px

Start display at page:

Download "Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN. Sophie Gallina CNRS Evo-Eco-Paléo (EEP)"

Candace Wood
6 years ago
Views:

1 Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN Sophie Gallina CNRS Evo-Eco-Paléo (EEP)

2 Module 1/5 Analyse DNA NGS Introduction Galaxy : upload data, datasets & histories Reads Quality Control Reads cleaning Aligning reads on reference Galaxy Workflow & best practices March / 15

3 NGS introduction Sequencers Librairies, adaptors Multiplexing, barcode Reads, single-end, paired-end Encoding quality with scores Fastq Format Cleaning reads Analysis protocoles (alignment and assembly) March / 15

4 Sequencers Illumina Source : March / 15

5 Libraries Source : Thierry Grange, 5ème Ecole de bioinformatique AVIESAN-IFB DNA fragment (size depends on libraries) March / 15

6 Adaptators Source : Thierry Grange, 5ème Ecole de bioinformatique AVIESAN-IFB ements/eba216 Roscoff_216_Grange_.pdf DNA fragment without adaptator DNA fragment with adaptators March / 15

7 Read = DNA fragment end Single-end Read 1 Sequencing only 1 end Paired-end Sequencing both ends Reads orientation Insert size Read 1 DNA fragment DNA fragment Read 2 Insert size Based on DNA fragment size, these situations may arise : DNA fragment Read 1 Read 1 Read 2 March 217 DNA fragment Read 2 7 / 15

8 Multiplexing, barcode ex Illumina HiSeq2 Adaptators within DNA fragment are sequenced. Adaptators are removed, they are not present in files provided by sequencing platforms. When used for multiplexing, sequences are demultiplexed in differents files. DNA fragment without adaptator DNA fragment with adaptators March / 15

9 Sequencer's output : fastq file format 1 read Identifiant Sequence CGCCCGGCCAATCATTGTGGTTTTAAGTCACTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCT + CTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCTCTGTGCACCCAGCATTGCCCAGAACAGGGC + CTCCCAGCTTCCAACAGACCCTGTCCCAGCTCCCTCCAAGCTGAGTGTTGGCCTGATACCTACCAGTGGAGCGAGGGGAACCCGAGGACTGCCAAGGGCA + AAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGCCCCCCTTTCCCCCCCGGGGGGGGGACAGGGGGGGTGTTCGGGCCCCGCGCCGCCCTTGACCACGG + EKLMPPPPPQQQQQQQQQQQQQQQK########################################################################### March / 15

10 Paired-end fastq file format 2 files : Forward (1) Reverse (2) 1 interleaved paired CTAGGAAGCGTAGTCCTGGGGTCATCTCTCCTATTAATACTGTTGGGGAATGTTTAGTA + CATTATTTCATAGTAGCCAAAAAGTGGAAACAGTCAAAATATCCGTCAGTGAATTGACC + 1./.,/&((&3=;B@F86C>@51(3:).6GG TATTTCTGGAATTTTCCATTTAATATTTTCAGACTGCAGTTGACTGCGGGTAACTGAAA + TTCTGGTCAGTAAGACCTCAAAAGGTTAAATACTAGCGATTTACACACCTTAAATGATT + CCTAAAATGGTGTGTTTTCGTATATTCACAATGCTGTGGAACCATCACCACTATCTGAT + TCTTTCTTTTGTTTTTTTTTCTGAGATGTCTTTTGTTTTTGTTCTGAGGTCTTGTTATG + ILLUMINA_13:3:111:1249:1993 length=11 TTTTCAGAGTAGTTGGTACCCAATTGGAAGATGTGACCCACTTCGATACCGCGCTTGAG + ILLUMINA_13:3:111:1249:1993 length=99 ANNNNNNCTTCGGTATNAACTGGGGNNNNGATGTTGAACTGGGTAAAGTCGAAGATCTG + ILLUMINA_13:3:111:1463:1964 length=11 NTGAGTAGCTCAATGCGCTGACGCCAATAGCTATACCAACGACTGGCCAGATTATGTTT + ILLUMINA_13:3:111:1463:1964 length=99 AAGTGACCCATCGCGATAAAGTGCTGCGCAGTAAANAGCANCTGTTNGATGCTGGCTTA + ILLUMINA_13:3:111:1366:197 length=11 NAAGTCGCGGCGACCCCTATCGTGGCTTTCGGCGTACGCCATTTCAATGCGGCCGCCGC + B[[X[YY[YVcc_cccc_cc ILLUMINA_13:3:111:1366:197 length=99 TGGTCAATACAAGCCGCAATACCTGCATCATGCGGNGGAANAATTTGCGCGCCGTTTTC + ggfegggggggdeggggfgcgggagggggggega^bb`^]b[y[[[zffffh_afeefe March / 15

11 Reads quality Errors when reading bases Depends on sequencing technologie Error rate increases with read size For each position in the read One base (ATCG) One error probability March / 15

12 Phred Quality Score (for a base) Phred quality score Q = logarithmically related to the base calling error probabilities P. (Error probabilities depends on sequencing technologie) Source : Score coding : number [-6] => 1 character [more compact in a file] A T C A With coding convention 38=G, 39=H ATCA HHGH There are more than one coding convention for the score (for history reasons) March / 15

13 Quality score coding For history reasons More than one way to compute score More than one coding convention Code ASCII Each keyboard symbole has a number examples!=33, A=65 Source : Galaxy : Always uses Sanger coding => conversion tool (groomer) March / 15

14 Example for score interpretation using sanger coding (Galaxy) Bad : -19 correct : 2-29 Good : 29-4 S - Sanger Phred+33 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...!"#$%&'()+,-./ :;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~ March / 15

15 Interpreting quality CGCCCGGCCAATCATTGTGGTTTTAAGTCACTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCT + CTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCTCTGTGCACCCAGCATTGCCCAGAACAGGGC + CTCCCAGCTTCCAACAGACCCTGTCCCAGCTCCCTCCAAGCTGAGTGTTGGCCTGATACCTACCAGTGGAGCGAGGGGAACCCGAGGACTGCCAAGGGCA + AAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGCCCCCCTTTCCCCCCCGGGGGGGGGACAGGGGGGGTGTTCGGGCCCCGCGCCGCCCTTGACCACGG + EKLMPPPPPQQQQQQQQQQQQQQQK########################################################################### SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...!"#$%&'()+,-./ :;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~ March / 15

16 Read cleaning : CGCCCGGCCAATCATTGTGGTTTTAAGTCACTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCT + CTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCTCTGTGCACCCAGCATTGCCCAGAACAGGGC + CTCCCAGCTTCCAACAGACCCTGTCCCAGCTCCCTCCAAGCTGAGTGTTGGCCTGATACCTACCAGTGGAGCGAGGGGAACCCGAGGACTGCCAAGGGCA + AAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGCCCCCCTTTCCCCCCCGGGGGGGGGACAGGGGGGGTGTTCGGGCCCCGCGCCGCCCTTGACCACGG + CGCCCGGCCAATCATTGTGGTTTTAAGTCACTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCT + CTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCTCTGTGCACCCAGCATTGCCCAGAACAGGGC + CTCCCAGCTTCCAACAGACCCTGTCCCAGCTCCCTCCAAGCTGAG + D?KMPQEPGCPQQNPQIQIGR@DPERQHEKBED=HCHG8EHFDCD March / 15

17 Quality control examples March / 15

18 Reads cleaning Cut adaptators at read ends Trimming : cut read ends (5' ou 3') Fixed number of bases Individual base quality Mean quality of bases in a sliding window Filtering : remove read Size criteria (example < 6bp) Mean base quality for all bases criteria (example < 25) March / 15

19 Reads cleaning example : protocole for de-novo transcriptome assembly Clean adaptators Trimming 5' et 3' on base quality (> 3) Trimming using sliding window (4 bases, Q < 2) Filtering on mean read quality (Q < 25) Filtering on read size (taille < 2) Source : Erwan Core, 5ème Ecole de bioinformatique AVIESAN-IFB March / 15

20 Protocole for variant analysis Reads (fastq) Quality control Filtering adaptators, filtering and trimming on base quality Reads cleaning Reads (fastq) Genome (fasta) Quality control Alignement Alignment (sam/bam) Quality control Alignment cleaning Alignment (sam/bam) Filtering on alignment quality, marking duplicated reads, local realignment Metrics, cover statistics Variant detection Variants (vcf) Module 4 cycle NGS Variants cleaning and annotation March / 15

21 Protocole for de-novo assemby Reads (fastq) Quality control Reads cleaning Filtrering adaptators, quality filtering and trimming Reads (fastq) Quality control Assembly Contigs, scaffolds Metrics Assembly cleaning Contigs, scaffolds Metrics March / 15

22 Quality scores Score for base A Score for base C Score for base T Score for base A Read A C T A... Reference Mean Score for read A Alignment score Sample1 Sample2 Score for variant calling : A/T A T A T A T Score for sample genotyping : A/T A T A Score for sample genotyping : NA March / 15

23 Quality scores Quality = probability of no mistake on Base calling Read alignment Variant calling Sample genotyping Depends on algorithm or on protocole Protocole Illumina # protocole PacBio Protocole Illumina v1 # protocole Illumina v2 Instrument's evolution Computation type Probability depends on instrument bias Base calling algorithm # alignment algorithm # variant detection algorithm Differents tools for the same operation BWA # Bowtie2 for alignment March / 15

24 Galaxy Connection Upload data Working with datasets and histories Adding local reference Converting to fastqsanger format March / 15

25 Data for this tutorial Data from Human genome from Hapmap project Reference : small region from chromosome 2 2:38-53 (assembly GRCh37) file GRCh37_region1.fasta Reads: Illumina paired-end (2x1bp) for 3 samples (HG96, HG11 and HG13) files HGXXX_1.fastq, HGXXX_2.fastq (only reads for this small region, for reasons of speed) Dowload files on billile wkiki : Main goals for this first part of tutorial Upload reference and reads for one sample (HG11) Work with histories, datasets and tools March / 15

26 Connect : Galaxy v1 1 Enter IP number 2 Click on Galaxy icon IP simple IP + session 3 Menu User / login 4 Username : user@galaxy.ifb.fr Password : ifbuser Username + Password 5 Menu User => Check that you are connected March / 15

27 Connect : Galaxy v2 1 Enter IP number 2 Menu User / login 3 Username : bilillen Password : bilillen 4 Menu User => Check that you are connected March / 15

28 History : «Folder» containing a set of data Default name = «Unamed history» 1 Rename history => TP1 2 Explore history menu 3 Create new history March / 15

29 List histories, go back to TP1 1 List all histories 2 Go back to TP1 history March / 15

30 Dataset ~ «Data file» Upload reference in a dataset 1 Tools Get Data / Upload File 3 fasta 2 Choose file 4 unspecified 5 start 6 close 2 Choose file GRCh37_regions1.fasta 3 Choose fasta format (! not csfasta) 4 Keep «Unspecified» as genome 5 Run with start 6 Close March / 15

31 Dataset : summary, attributes, full data 1 Click on dataset name show summary of attributes and data 2 Click on the eye show data 3 Click on pencil show attributes March / 15

32 Add a local reference (TP_ref) 1 Menu User / Custom Builds 2 2 Choose name TP_ref 3 Choose fasta format 4 Choose dataset n 1 : GRCh37_regions1.fasta 5 Submit Reference is now available March / 15

«?» 3 Click on pencil to change attributes 4 Choose TP_ref database 5

33 Check / Change database attribute 1 Analyse Data 2 1 Menu Analyze Data 2 Click on dataset name to see summary => database attribute is «?» 3 Click on pencil to change attributes 4 Choose TP_ref database 5 Save 6 check database attribute is now «TP_ref» March / 15

34 Upload reads (fastq) for sample HG11 1 Tools: Get Data / Upload File 2 Choose files HG11_1_fastq and HG11_2.fastq 3 Choose «fastq» format 4 choose «TP_ref» genome 5 Run with start 6 Close 7 Check attributes fastq 2 choose local file 4 TP_ref 5 start 6 close March / 15

35 Look at a fastq file identifiant sequence quality 1 read What is the file size? How many reads? Sizes of reads? Which is the quality coding? Galaxy : always uses Sanger => conversion tool : groomer Source : March / 15

Convert to Sanger format : groomer tool 1 2 3 6 4 5 1 Tools: FASTQ Groomer 2 Choose to «groom» many

8+» format 5 Execute Create 2 new datasets (N 4 et 5) 6 Check new datasets attributes What are the

36 Convert to Sanger format : groomer tool Tools: FASTQ Groomer 2 Choose to «groom» many files 3 Choose files HG11_1_fastq and HG11_2.fastq 4 Choose «Sanger & Illumina 1.8+» format 5 Execute Create 2 new datasets (N 4 et 5) 6 Check new datasets attributes What are the sizes of new datasets? How many reads? Which is the quality coding? What are the names of new datasets? March / 15

37 Rename datasets 2 HG11_OK_1.fastq For each new datasets : 1 Click on pencil to change attributes 2 Change the name 3 Save 4 Check new datasets names After changin a dataset name, how can we retreive dataset origin? March / 15

38 Retreive dataset origin This dataset results from groomer tool, applied on dataset 2 (HG11_1.fastq) March / 15

39 Reads quality control Per base quality Per read mean quality Read size Adaptators Duplicated reads March / 15

1 2 1,2 Choose tool : FastQC 3 Choose datasets n 4 et 5 4 Execute

40 Reads quality control (fastqc) Andrews, S. FastQC A Quality Control tool for High Throughput Sequence Data ,2 Choose tool : FastQC 3 Choose datasets n 4 et 5 4 Execute Create 4 new datasets For each fastq file : 1 «raw data» and 1 «Webpage» 3 4 March / 15

41 Manage fastqc result datasets Look quickly at dataset content (we will deeply look at that later) 2 Remove «RawData» datasets 3 Rename «Webpage»datasets HG11_1.QC et HG11_2.QC March / 15

42 FastqQC : Summary & Basic Statistics March / 15

43 FastQC : Per base sequence quality Median Mean Good Quality Mean Quality Bad quality March / 15

44 Fasqc : Per base sequence quality Example OK Example KO Source : bad_sequence_fastqc.html March / 15

Fastqc : Per sequence quality score Example OK Example KO Source : http://www.

45 Fastqc : Per sequence quality score Example OK Example KO Source : bad_sequence_fastqc.html March / 15

46 FastQC : Sequence Length Distribution & Per sequence GC content March / 15

47 FastQC : Per base sequence content Example OK Example KO Source : bad_sequence_fastqc.html March / 15

48 FastQC : Per base N content Example OK Example KO Source : bad_sequence_fastqc.html March / 15

FastQC : Overrepresented sequences Example OK Example KO Source : http://www.

49 FastQC : Overrepresented sequences Example OK Example KO Source : bad_sequence_fastqc.html March / 15

50 FastQC : Sequence Duplication Levels Example OK Example KO Source : bad_sequence_fastqc.html March / 15

51 FastQC Adapter Content Example OK Example KO Source : bad_sequence_fastqc.html March / 15

FastQC Kmer Content Example OK Example KO Source : http://www.

52 FastQC Kmer Content Example OK Example KO Source : bad_sequence_fastqc.html March / 15

53 FastQC Example with PacBio Source : pacbio_srr7514_fastqc.html March / 15

54 Cleaning Reads Filtrering adaptators Filtering & trimming reads Comparing quality before and after cleaning March / 15

55 Filtering & trimming Filtering = remove reads Based on quality or size criteria Trimming = remove read ends Fixed number of bases Bases < quality March / 15

56 Trimming Cut bad quality bases at the end of reads Exemple OK Exemple KO Source : bad_sequence_fastqc.html March / 15

Filtering Remove reads with bad mean quality Exemple OK Exemple KO Source : http://www.

57 Filtering Remove reads with bad mean quality Exemple OK Exemple KO Source : bad_sequence_fastqc.html March / 15

58 Reads cleaning (Trimmomatic) 1/2 Bolger, A. M. and Lohse, M. and Usadel, B. (214). Trimmomatic: a flexible trimmer for Illumina sequence data. In Bioinformatics, 3 (15), pp Choose files 2 Parameters for adaptators March / 15

59 Reads cleaning (Trimmomatic) 2/2 Add operations (cleaning steps) : 1 : LEADING : Cut bad quality 5' bases 2 : TRAILNG:Cut bad quality 3' bases 3 : SLIDINGWINDOW : Cut bases with bad mean quality in a sliding window 4 : AVGQUAL : remove reads with bad mean quality 5 : MINLEN : Remove small size reads March / 15

paired reads after cleaning? Are there any trace or log of what happens during cleaning?

60 Trimmomatic : Results Unpaired reads2 (corresponding reads1 has been removed during cleaning) Unpaired reads1 (corresponding reads2 has been removed during cleaning) Reads2 after cleaning Reads1 after cleaning How many paired reads after cleaning? Are there any trace or log of what happens during cleaning? From summary, click on «i» icon Look content of «stdout» This contains all messages send by this tool during execution. March / 15

Trimmomatic : 2nd try Run this tool again after changing AVGQUAL parameter value to 3 1 1 Use one of the datasets produced by previous analysis 2 Click on «Run this job again» icon 2 All parameters

61 Trimmomatic : 2nd try Run this tool again after changing AVGQUAL parameter value to Use one of the datasets produced by previous analysis 2 Click on «Run this job again» icon 2 All parameters are pre-sets with values used in the previous execution 3 3 Change only the parameter in step n 4 AVGQUAL value 25 How many paired reads after new cleaning? Work only with paired dataset produced by first cleaning (remove second try) Rename the datatsets HG11_1_clean.fastq and HG11_2_clean.fastq Run quality control on these 2 datasets. Compare quality control before and after cleaning March / 15

62 Compare quality control before / after cleaning Solution 1 : Open a second web browser window Connect to access history and datasets Visualize datasets HG11_OK_1.CQ HG11_clean_1.QC March / 15

63 Compare quality control before / after cleaning Solution 2 : Use Galaxy «Scratchbook» to manage Galaxy windows 1 «Enable Scratchbook» 2 Visualize dataset HG11_clean_1.fastq 3 Visualize dataset HG11_OK_1.fastq March / 15

64 Quality control before & after cleaning HG11_OK_1.QC HG11_clean_1.QC March / 15

65 Quality control before & after cleaning HG11_OK_1.QC HG11_clean_1.QC March / 15

66 Reads alignment SAM/BAM Format SAM results for single-end and paired-end reads BWA & Bowtie2 alignment tools Manage duplicated reads (picard / MarkDuplicates) Count alignments (samtools / flagstat) Compute deepth and coverage (Deeptools/PlotCoverage) March / 15

67 SAM - Sequence Alignment/Map Alignement Coding to a SAM format SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147 ref M2I4M1D3M 3S6M1P1I4M 5S6M 6M14N5M 6H5M 9M = = 7 39 TTAGATAAAGGATACTG AAAAGATAAGGATA GCCTAAGCTAA ATAGCTTCAGC TAGGC CAGCGGCAT SA:Z:ref,29,,6H5M,17,; SA:Z:ref,9,+,5S6M,3,1; NM:i:1 Source : March / 15

SAM File format Header @SQ : Reference sequence SN:ref : sequence name, LN:45 : sequence size 1 line per read, 12 columns @SQ SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147

68 SAM File format : Reference sequence SN:ref : sequence name, LN:45 : sequence size 1 line per read, 12 SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147 ref M2I4M1D3M 3S6M1P1I4M 5S6M 6M14N5M 6H5M 9M = = 7 39 TTAGATAAAGGATACTG AAAAGATAAGGATA GCCTAAGCTAA ATAGCTTCAGC TAGGC CAGCGGCAT SA:Z:ref,29,,6H5M,17,; SA:Z:ref,9,+,5S6M,3,1; NM:i:1 1 log 1 ( p ) P = error probability on position Mapping tools estimation based on missmatch, insertions, deletions multiple alignements Source : March / 15

69 SAM - SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147 ref 37 1 : + 2 : + 32 : + 64 : M2I4M1D3M 3S6M1P1I4M 5S6M 6M14N5M 6H5M 9M = = 7 39 TTAGATAAAGGATACTG AAAAGATAAGGATA GCCTAAGCTAA ATAGCTTCAGC TAGGC CAGCGGCAT SA:Z:ref,29,,6H5M,17,; SA:Z:ref,9,+,5S6M,3,1; NM:i:1 template having multiple segments in sequencing each segment properly aligned according to the aligner SEQ of the next segment in the template being reverse complemented the first segment in the template Source : March / 15

SAM - Flags @SQ SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147 ref 37 3 3 3 3 17 3 8M2I4M1D3M 3S6M1P1I4M 5S6M 6M14N5M 6H5M 9M = 37 39 = 7 39

70 SAM - SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147 ref M2I4M1D3M 3S6M1P1I4M 5S6M 6M14N5M 6H5M 9M = = 7 39 TTAGATAAAGGATACTG AAAAGATAAGGATA GCCTAAGCTAA ATAGCTTCAGC TAGGC CAGCGGCAT SA:Z:ref,29,,6H5M,17,; SA:Z:ref,9,+,5S6M,3,1; NM:i:1 March / 15

71 BAM : Binary sam Same data as in SAM «binary» format, more compact Smaller files Faster treatment for computers BAI : Index for BAM file Speed up data search and retrieve in a BAM file March / 15

72 SAM Examples for single-end reads S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S11 S12 S13 S14 S15 S16 S lecture parfaite, bonne qualité lecture parfaite, mauvaise qualité lecture parfaite, bonne qualité sauf sur les 2 dernières bases lecture parfaite, bonne qualité sauf sur les 6 dernières bases insertion de 5 bases en position 5, bonne qualité deletion de 5 bases en position 5, bonne qualité 5 substitutions réparties sur le read, bonne qualité 5 substitutions au début du read, bonne qualité 5 substitutions au milieu du read, bonne qualité 5 substitutions au début, mauvaise qualité lecture parfaite, bonne qualité, duplicaion 1 lecture parfaite, bonne qualité, duplication 2 lecture parfaite, bonne qualité lecture parfaite, bonne qualité, décalé de 1bp lecture non alignée, bonne qualité lecture alignée 2 fois, bonne qualité lecture alignée 1 fois, bonne qualité SAM flags read paired read mapped in proper pair read unmapped mate unmapped read reverse strand mate reverse strand first in pair second in pair not primary alignment read fails platform/vendor quality checks read is PCR or optical duplicate supplementary alignment BWA QNAME S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S12 S12 S13 S14 S15 S16 S17 FLAG 4 RNAME POS MAPQ x x Bowtie2 CIGAR 1M 1M 1M 1M 48M5I47M 5M5D45M 1M 5S95M 1M 5S95M 1M 1M 1M 99M 1M 1M NEXT PNEXT TLEN QNAME S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S12 S12 S13 S14 S15 S16 S17 March 217 FLAG 4 RNAME POS MAPQ CIGAR 1M 1M 1M 1M 48M5I47M 5M5D45M 1M 4M1I95M 1M 4M1I95M 1M 1M 1M 99M 1M 1M NEXT PNEXT TLEN 72 / 15

73 SAM Examples for single-end reads S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S11 S12 S13 S14 S15 S16 S lecture parfaite, bonne qualité lecture parfaite, mauvaise qualité lecture parfaite, bonne qualité sauf sur les 2 dernières bases lecture parfaite, bonne qualité sauf sur les 6 dernières bases insertion de 5 bases en position 5, bonne qualité deletion de 5 bases en position 5, bonne qualité 5 substitutions réparties sur le read, bonne qualité 5 substitutions au début du read, bonne qualité 5 substitutions au milieu du read, bonne qualité 5 substitutions au début, mauvaise qualité lecture parfaite, bonne qualité, duplicaion 1 lecture parfaite, bonne qualité, duplication 2 lecture parfaite, bonne qualité lecture parfaite, bonne qualité, décalé de 1bp lecture non alignée, bonne qualité lecture alignée 2 fois, bonne qualité lecture alignée 1 fois, bonne qualité SAM flags read paired read mapped in proper pair read unmapped mate unmapped read reverse strand mate reverse strand first in pair second in pair not primary alignment read fails platform/vendor quality checks read is PCR or optical duplicate supplementary alignment BWA QNAME FLAG RNAME S1 S2 S3 Duplicated reads S4 S5 S6 S7 S8 S9 S1 S12 S12 S13 S14 S15 4 S16 S17 POS MAPQ x x Bowtie2 CIGAR 1M 1M 1M 1M 48M5I47M 5M5D45M 1M 5S95M 1M 5S95M 1M 1M 1M 99M 1M 1M NEXT PNEXT TLEN Score 6 / Unaligned reads QNAME FLAG S1 S2 Score [-42] S3 S4 S5 S6 S7 S8 S9 S1 S12 S12 S13 S14 S15 4 S16 S17 March 217 RNAME POS MAPQ CIGAR 1M 1M 1M 1M 48M5I47M 5M5D45M 1M 4M1I95M 1M 4M1I95M 1M 1M 1M 99M 1M 1M NEXT PNEXT TLEN Multiple Alignements 73 / 15

74 SAM : Examples for paired-end reads BWA QNAME FLAG RNAME POS MAPQ CIGAR P1_F 97 C M P1_R 145 C M P2_F 97 C M P2_R 145 C M P3_F 73 C M P3_R 133 C1 51 P4_F 97 C M P4_R 145 C M P5_F 97 C M P5_R 145 C M P6_F 97 C M P6_R 145 C M P7_F 65 C M P7_R 129 C M Bowtie2 RNEXT PNEXT TLEN = 41 3 = 21-3 = = 31-9 = 51 = 51 C2 11 C1 61 = = = = 81-2 = = C1 QNAME FLAG RNAME POS MAPQ CIGAR P1_F 99 C M P1_R 147 C M P2_F 97 C M P2_R 145 C M P3_F 73 C M P3_R 133 C1 51 P4_F 97 C M P4_R 145 C M P5_F 99 C M P5_R 147 C M P6_F 99 C M P6_R 147 C M P7_F 65 C M R7_R 129 C M 1 C2 R1 R2 R3 R4 R5 R6 R RNEXT PNEXT TLEN = 41 3 = 21-3 = = 31-9 = 51 = 51 C2 11 C1 61 = 91 3 = 71-3 = 11 3 = 81-3 = = SAM flags read paired x x x x x x x x read mapped in proper pair x x read unmapped x mate unmapped x read reverse strand mate reverse strand x x x x first in pair x x x x second in pair x x x x not primary alignment read fails platform/vendor quality checks read is PCR or optical duplicate supplementary alignment March / 15

75 SAM : Examples for paired-end reads Difference on flags BWA QNAME FLAG RNAME POS MAPQ CIGAR P1_F 97 C M P1_R 145 C M P2_F 97 C M P2_R 145 C M P3_F 73 C M P3_R 133 C1 51 P4_F 97 C M P4_R 145 C M P5_F 97 C M P5_R 145 C M P6_F 97 C M P6_R 145 C M P7_F 65 C M P7_R 129 C M RNEXT PNEXT TLEN = 41 3 = 21-3 = = 31-9 = 51 = 51 C2 11 C1 61 = = = = 81-2 = = Bowtie2 QNAME FLAG RNAME POS MAPQ CIGAR P1_F 99 C M P1_R 147 C M P2_F 97 C M P2_R 145 C M P3_F 73 C M P3_R 133 C1 51 P4_F 97 C M P4_R 145 C M P5_F 99 C M P5_R 147 C M P6_F 99 C M P6_R 147 C M P7_F 65 C M R7_R 129 C M Difference on positions when duplicated sequences C1 R1 R2 R3 R4 R5 R6 R7 1 C2 F R Bowtie2 R RNEXT PNEXT TLEN = 41 3 = 21-3 = = 31-9 = 51 = 51 C2 11 C1 61 = 91 3 = 71-3 = 11 3 = 81-3 = = SAM flags read paired x x x x x x x x read mapped in proper pair x x read unmapped x mate unmapped x read reverse strand mate reverse strand x x x x first in pair x x x x second in pair x x x x not primary alignment read fails platform/vendor quality checks read is PCR or optical duplicate supplementary alignment BWA March / 15

In Bioinformatics, 25 (14), pp. 1754 176. Li, H. and Durbin, R. (21).

76 Alignment with BWA Li, H. and Durbin, R. (29). Fast and accurate short read alignment with Burrows-Wheeler transform. In Bioinformatics, 25 (14), pp Li, H. and Durbin, R. (21). Fast and accurate longread alignment with Burrows-Wheeler transform. In Bioinformatics, 26 (5), pp March / 15

77 Alignment with bowtie2 Langmead, Ben and Trapnell, Cole and Pop, Mihai and Salzberg, Steven L (29). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. In Genome Biology, 1 (3), pp. R25. Langmead, Ben and Salzberg, Steven L (212). Fast gapped-read alignment with Bowtie 2. In Nature Methods, 9 (4), pp March / 15

Cleaning duplicated reads Source : GATK Marking duplicates https://software.broadinstitute.

78 Cleaning duplicated reads Source : GATK Marking duplicates March / 15

Picard / MarkDuplicate Additional information about

UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES 12 12

79 Picard / MarkDuplicate Additional information about Picard tools is available from Picard web site at BWA Bowtie2 UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS 3 UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION,2962,296 ESTIMATED_LIBRARY_SIZE March / 15

Alignment count : samtools flagstat BWA 8129 + in total (QC passed reads + QC failed reads) + secondary 3 + supplementary 24 + duplicates 811 + mapped (99.

23%: nan%) + with mate mapped to a different chr + with mate mapped to a different chr (mapq>=5) Bowtie 2 8126 + in total (QC passed reads + QC failed reads) + secondary + supplementary 24 +

80 Alignment count : samtools flagstat BWA in total (QC passed reads + QC failed reads) + secondary 3 + supplementary 24 + duplicates mapped (99.77%: nan%) paired in sequencing read read properly paired (98.2%: nan%) with itself and mate mapped 19 + singletons (.23%: nan%) + with mate mapped to a different chr + with mate mapped to a different chr (mapq>=5) Bowtie in total (QC passed reads + QC failed reads) + secondary + supplementary 24 + duplicates mapped (99.73%: nan%) paired in sequencing read read properly paired (99.36%: nan%) with itself and mate mapped 18 + singletons (.22%: nan%) + with mate mapped to a different chr + with mate mapped to a different chr (mapq>=5) March / 15

81 Coverage and deepth of coverage Source : Élodie Girard, 5ème Ecole de bioinformatique AVIESAN-IFB March / 15

Heyne, Steffen and Dündar, Friederike and Manke, Thomas (216).

82 Computing coverage and deepth of coverage DeepTools2 / plotcoverage Ramírez, Fidel and Ryan, Devon P and Grüning, Björn and Bhardwaj, Vivek and Kilpert, Fabian and Richter, Andreas S and Heyne, Steffen and Dündar, Friederike and Manke, Thomas (216). deeptools2: a next generation web server for deep-sequencing data analysis. In Nucleic Acids Research, 44 (W1), pp. W16 W165 March / 15

83 DeepTools / Plot Coverage March / 15

84 Galaxy Workflow Extract workflow from an history Modify workflow Execute workflow on new data Compare results from 2 workflows (in 2 histories) March / 15

85 Extract Workflow from the history of steps applied to the first sample March / 15

86 Visualize workflow March / 15

87 Modify workflow visualisation March / 15

input : eg Reference, Forward fastq, Reverse fastq You can also

88 Modify some steps configuration This WF uses 3 input files. Change box name to describe which data is required for each input : eg Reference, Forward fastq, Reverse fastq You can also change any parameter for example for trimmomatic step. March / 15

enable any parameter to be set at run time.

89 Enable a parameter to be set at run time Parameters for each tool will have the predefined values set in the workflow You can modify this to enable any parameter to be set at run time. Modify Trimmomatic so that Adapter are set at run time Do'nt forget to save your workflow! March / 15

90 Import new data for sample HG13 Importe files HG13_1.fastq and HG_13_2.fastq March / 15

91 Analyze these new data with the same workflow Run the workflow with these new data March / 15

92 Browse results March / 15

93 Gather BWA alignment results for the 2 samples 1/2 Create a new history named results From history TP1 : Copy HG11_BWA_MD.bam dataset March / 15

94 Gather BWA alignment results for the 2 samples 2/2 From history HG13 : Copy dataset «MarkDuplicates on data 13: MarkDuplicates BAM output»... March / 15

95 Visualize deepth of coverage for both samples Rename datasets Run plotcoverage March / 15

96 Galaxy Best Practices Manage disk space Export analysis results (datasets and histories) Export / Import analysis protocoles (workflow) March / 15

97 Manage disk space Global disk space Disk space per dataset Disk space per history March / 15

98 Export analysis results : datasets Image Text Bam March 217 HTML + files 98 / 15

99 Export analysis results : histories March / 15

100 Export / import analysis protocoles : workflow Export Import March / 15

101 Galaxy - Visualisation Create a visualisation Add datasets Change visualisation parameters March / 15

102 Create a new visualisation 1 Menu Visualisation / New Track Browser 2 Choose name TP2_vizu 3 Choose reference TP2_ref 4 Create Save March / 15

103 Visualize BAM files 1/3 1 Add datasets 2 Choose BAM files 3 Add March / 15

104 Visualize BAM files 2/3 Zoom and move March / 15

105 Visualize BAM files 3/3 1 Set Display Mode 2 Coverage March / 15

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your