Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN. Sophie Gallina CNRS Evo-Eco-Paléo (EEP)

Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN Sophie Gallina CNRS Evo-Eco-Paléo (EEP) (sophie.gallina@univ-lille1.fr)

Module 1/5 Analyse DNA NGS Introduction Galaxy : upload data, datasets & histories Reads Quality Control Reads cleaning Aligning reads on reference Galaxy Workflow & best practices March 217 2 / 15

NGS introduction Sequencers Librairies, adaptors Multiplexing, barcode Reads, single-end, paired-end Encoding quality with scores Fastq Format Cleaning reads Analysis protocoles (alignment and assembly) March 217 3 / 15

Sequencers Illumina Source : http://www.illumina.com/systems/sequencing.html March 217 4 / 15

Libraries Source : Thierry Grange, 5ème Ecole de bioinformatique AVIESAN-IFB 216 https://www.francebioinformatique.fr/fr/evenements/eba216 DNA fragment (size depends on libraries) March 217 5 / 15

Adaptators Source : Thierry Grange, 5ème Ecole de bioinformatique AVIESAN-IFB 216 https://www.france-bioinformatique.fr/fr/even ements/eba216 https://www.francebioinformatique.fr/sites/default/files/i1_ngs_ Roscoff_216_Grange_.pdf DNA fragment without adaptator DNA fragment with adaptators March 217 6 / 15

Read = DNA fragment end Single-end Read 1 Sequencing only 1 end Paired-end Sequencing both ends Reads orientation Insert size Read 1 DNA fragment DNA fragment Read 2 Insert size Based on DNA fragment size, these situations may arise : DNA fragment Read 1 Read 1 Read 2 March 217 DNA fragment Read 2 7 / 15

Multiplexing, barcode ex Illumina HiSeq2 Adaptators within DNA fragment are sequenced. Adaptators are removed, they are not present in files provided by sequencing platforms. When used for multiplexing, sequences are demultiplexed in differents files. DNA fragment without adaptator DNA fragment with adaptators March 217 8 / 15

Sequencer's output : fastq file format 1 read Identifiant Sequence Quality @SRR62641.6751359 CGCCCGGCCAATCATTGTGGTTTTAAGTCACTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCT + CBLNPGJQQQJPPQPPQPQRGPPPPRRQQRPSPGRQQQRLRRRMEPQQPMJHQQEHKMMFIIRH?SIIHKNJIKRLJJIKHEABHIFGCGGEFCGDGDCE @SRR62634.16249693 CTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCTCTGTGCACCCAGCATTGCCCAGAACAGGGC + ALKMOOOOPPQJQOPPPPPQPPPPPPRJQRQQQQQRPQPRQQPFQSQQPRLIMHKSNRJQORMFELRPQNQRQJQRRPQQLIRKDMKQJPN8CFDGCCCB @SRR62634.26465 CTCCCAGCTTCCAACAGACCCTGTCCCAGCTCCCTCCAAGCTGAGTGTTGGCCTGATACCTACCAGTGGAGCGAGGGGAACCCGAGGACTGCCAAGGGCA + D?KMPQEPGCPQQNPQIQIGR@DPERQHEKBED=HCHG8EHFDCD6<329@<:69A<6,;<967>;=C:>AA8BBED####################### @SRR62635.15516129 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGCCCCCCTTTCCCCCCCGGGGGGGGGACAGGGGGGGTGTTCGGGCCCCGCGCCGCCCTTGACCACGG + EKLMPPPPPQQQQQQQQQQQQQQQK########################################################################### March 217 9 / 15

Paired-end fastq file format 2 files : Forward (1) Reverse (2) 1 interleaved paired file @ERR229776.184 CTAGGAAGCGTAGTCCTGGGGTCATCTCTCCTATTAATACTGTTGGGGAATGTTTAGTA + BAEEAGEED96EHFE@BF><>EAAC;EBH<K<6:HJGFFHBC>DDIKG4AIHFFD@/= @ERR229776.12365 CATTATTTCATAGTAGCCAAAAAGTGGAAACAGTCAAAATATCCGTCAGTGAATTGACC + 1./.,/&((&3=;B@F86C>@51(3:).6GG 68C:CG)#B4/=HDJ6;79)<@C/ @ERR229776.114918 TATTTCTGGAATTTTCCATTTAATATTTTCAGACTGCAGTTGACTGCGGGTAACTGAAA + CEEEEEFEDAEGGGFDHGFFHGIHHHIIIIGKHBKJJIGHFHKILJKLEJLJJIFJMJK @ERR229776.184 TTCTGGTCAGTAAGACCTCAAAAGGTTAAATACTAGCGATTTACACACCTTAAATGATT + CFIEEG@FFFGKFJHJ>HHKLLJIIJILLJIILJHKAKJKKJJJJJJLMKJMKJJJJKJ @ERR229776.12365 CCTAAAATGGTGTGTTTTCGTATATTCACAATGCTGTGGAACCATCACCACTATCTGAT + 4B@EDFF=(/CHBHEHCE6@ED8E@@I6HJB6E:6%@C46FFIBGCIGKD,DN=CBBE@ @ERR229776.114918 TCTTTCTTTTGTTTTTTTTTCTGAGATGTCTTTTGTTTTTGTTCTGAGGTCTTGTTATG + CFIGGGKHHHFHHFIJIIIJKLIIHJIIIKLJKKIJKLLKJFJJMHJJLFJMJIKKJJJ @SRR531199.1 ILLUMINA_13:3:111:1249:1993 length=11 TTTTCAGAGTAGTTGGTACCCAATTGGAAGATGTGACCCACTTCGATACCGCGCTTGAG + dffffffffdffeffdadffffeeefdeffeffefffffffffddeefeydfefefe[e @SRR531199.1 ILLUMINA_13:3:111:1249:1993 length=99 ANNNNNNCTTCGGTATNAACTGGGGNNNNGATGTTGAACTGGGTAAAGTCGAAGATCTG + BBBBBBSZTUVWO]YB_[cbabbWBBBBSVVUUgggadcdedbedcddfffdegeggef @SRR531199.2 ILLUMINA_13:3:111:1463:1964 length=11 NTGAGTAGCTCAATGCGCTGACGCCAATAGCTATACCAACGACTGGCCAGATTATGTTT + BXSSRU[X[Wcc_cccccccccccc_cccccccccc_cccccccccccccccccccccc @SRR531199.2 ILLUMINA_13:3:111:1463:1964 length=99 AAGTGACCCATCGCGATAAAGTGCTGCGCAGTAAANAGCANCTGTTNGATGCTGGCTTA + ggggggggggggggggggfgfggggggggggggg^bbbbabbbaz]bz[ccccfggggg @SRR531199.3 ILLUMINA_13:3:111:1366:197 length=11 NAAGTCGCGGCGACCCCTATCGTGGCTTTCGGCGTACGCCATTTCAATGCGGCCGCCGC + B[[X[YY[YVcc_cccc_cc [[[V[^^^^^V[[]SXWUX[\\]]Z^^^B @SRR531199.3 ILLUMINA_13:3:111:1366:197 length=99 TGGTCAATACAAGCCGCAATACCTGCATCATGCGGNGGAANAATTTGCGCGCCGTTTTC + ggfegggggggdeggggfgcgggagggggggega^bb`^]b[y[[[zffffh_afeefe March 217 1 / 15

Reads quality Errors when reading bases Depends on sequencing technologie Error rate increases with read size For each position in the read One base (ATCG) One error probability March 217 11 / 15

Phred Quality Score (for a base) Phred quality score Q = logarithmically related to the base calling error probabilities P. (Error probabilities depends on sequencing technologie) Source : https://fr.wikipedia.org/wiki/score_de_qualité_phred Score coding : number [-6] => 1 character [more compact in a file] A T C A 39 39 38 39 With coding convention 38=G, 39=H ATCA HHGH There are more than one coding convention for the score (for history reasons) March 217 12 / 15

Quality score coding For history reasons More than one way to compute score More than one coding convention Code ASCII Each keyboard symbole has a number examples!=33, A=65 Source : https://fr.wikipedia.org/wiki/fastq Galaxy : Always uses Sanger coding => conversion tool (groomer) March 217 13 / 15

Example for score interpretation using sanger coding (Galaxy) Bad : -19 correct : 2-29 Good : 29-4 S - Sanger Phred+33 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...!"#$%&'()+,-./123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~...1...2...3...4... 123456789123456789123456789123456789...... March 217 14 / 15

Interpreting quality score @SRR62641.6751359 CGCCCGGCCAATCATTGTGGTTTTAAGTCACTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCT + CBLNPGJQQQJPPQPPQPQRGPPPPRRQQRPSPGRQQQRLRRRMEPQQPMJHQQEHKMMFIIRH?SIIHKNJIKRLJJIKHEABHIFGCGGEFCGDGDCE @SRR62634.16249693 CTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCTCTGTGCACCCAGCATTGCCCAGAACAGGGC + ALKMOOOOPPQJQOPPPPPQPPPPPPRJQRQQQQQRPQPRQQPFQSQQPRLIMHKSNRJQORMFELRPQNQRQJQRRPQQLIRKDMKQJPN8CFDGCCCB @SRR62634.26465 CTCCCAGCTTCCAACAGACCCTGTCCCAGCTCCCTCCAAGCTGAGTGTTGGCCTGATACCTACCAGTGGAGCGAGGGGAACCCGAGGACTGCCAAGGGCA + D?KMPQEPGCPQQNPQIQIGR@DPERQHEKBED=HCHG8EHFDCD6<329@<:69A<6,;<967>;=C:>AA8BBED####################### @SRR62635.15516129 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGCCCCCCTTTCCCCCCCGGGGGGGGGACAGGGGGGGTGTTCGGGCCCCGCGCCGCCCTTGACCACGG + EKLMPPPPPQQQQQQQQQQQQQQQK########################################################################### SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...!"#$%&'()+,-./123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{ }~...1...2...3...4... 123456789123456789123456789123456789...... March 217 15 / 15

Read cleaning : goal @SRR62641.6751359 CGCCCGGCCAATCATTGTGGTTTTAAGTCACTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCT + CBLNPGJQQQJPPQPPQPQRGPPPPRRQQRPSPGRQQQRLRRRMEPQQPMJHQQEHKMMFIIRH?SIIHKNJIKRLJJIKHEABHIFGCGGEFCGDGDCE @SRR62634.16249693 CTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCTCTGTGCACCCAGCATTGCCCAGAACAGGGC + ALKMOOOOPPQJQOPPPPPQPPPPPPRJQRQQQQQRPQPRQQPFQSQQPRLIMHKSNRJQORMFELRPQNQRQJQRRPQQLIRKDMKQJPN8CFDGCCCB @SRR62634.26465 CTCCCAGCTTCCAACAGACCCTGTCCCAGCTCCCTCCAAGCTGAGTGTTGGCCTGATACCTACCAGTGGAGCGAGGGGAACCCGAGGACTGCCAAGGGCA + D?KMPQEPGCPQQNPQIQIGR@DPERQHEKBED=HCHG8EHFDCD6<329@<:69A<6,;<967>;=C:>AA8BBED####################### @SRR62635.15516129 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGCCCCCCTTTCCCCCCCGGGGGGGGGACAGGGGGGGTGTTCGGGCCCCGCGCCGCCCTTGACCACGG + EKLMPPPPPQQQQQQQQQQQQQQQK########################################################################### @SRR62641.6751359 CGCCCGGCCAATCATTGTGGTTTTAAGTCACTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCT + CBLNPGJQQQJPPQPPQPQRGPPPPRRQQRPSPGRQQQRLRRRMEPQQPMJHQQEHKMMFIIRH?SIIHKNJIKRLJJIKHEABHIFGCGGEFCGDGDCE @SRR62634.16249693 CTAAGTTTGAGGCTATTTTGTTTTACAGCAAAAGCTAACTGATGCAGACAGGGACAAGTCAGTCTCATCTCTGTGCACCCAGCATTGCCCAGAACAGGGC + ALKMOOOOPPQJQOPPPPPQPPPPPPRJQRQQQQQRPQPRQQPFQSQQPRLIMHKSNRJQORMFELRPQNQRQJQRRPQQLIRKDMKQJPN8CFDGCCCB @SRR62634.26465 CTCCCAGCTTCCAACAGACCCTGTCCCAGCTCCCTCCAAGCTGAG + D?KMPQEPGCPQQNPQIQIGR@DPERQHEKBED=HCHG8EHFDCD March 217 16 / 15

Quality control examples March 217 17 / 15

Reads cleaning Cut adaptators at read ends Trimming : cut read ends (5' ou 3') Fixed number of bases Individual base quality Mean quality of bases in a sliding window Filtering : remove read Size criteria (example < 6bp) Mean base quality for all bases criteria (example < 25) March 217 18 / 15

Reads cleaning example : protocole for de-novo transcriptome assembly Clean adaptators Trimming 5' et 3' on base quality (> 3) Trimming using sliding window (4 bases, Q < 2) Filtering on mean read quality (Q < 25) Filtering on read size (taille < 2) Source : Erwan Core, 5ème Ecole de bioinformatique AVIESAN-IFB 216 https://www.francebioinformatique.fr/fr/evenements/eba216 March 217 19 / 15

Protocole for variant analysis Reads (fastq) Quality control Filtering adaptators, filtering and trimming on base quality Reads cleaning Reads (fastq) Genome (fasta) Quality control Alignement Alignment (sam/bam) Quality control Alignment cleaning Alignment (sam/bam) Filtering on alignment quality, marking duplicated reads, local realignment Metrics, cover statistics Variant detection Variants (vcf) Module 4 cycle NGS Variants cleaning and annotation March 217 2 / 15

Protocole for de-novo assemby Reads (fastq) Quality control Reads cleaning Filtrering adaptators, quality filtering and trimming Reads (fastq) Quality control Assembly Contigs, scaffolds Metrics Assembly cleaning Contigs, scaffolds Metrics March 217 21 / 15

Quality scores Score for base A Score for base C Score for base T Score for base A Read A C T A... Reference Mean Score for read A Alignment score Sample1 Sample2 Score for variant calling : A/T A T A T A T Score for sample genotyping : A/T A T A Score for sample genotyping : NA March 217 22 / 15

Quality scores Quality = probability of no mistake on Base calling Read alignment Variant calling Sample genotyping Depends on algorithm or on protocole Protocole Illumina # protocole PacBio Protocole Illumina v1 # protocole Illumina v2 Instrument's evolution Computation type Probability depends on instrument bias Base calling algorithm # alignment algorithm # variant detection algorithm Differents tools for the same operation BWA # Bowtie2 for alignment March 217 23 / 15

Galaxy Connection Upload data Working with datasets and histories Adding local reference Converting to fastqsanger format March 217 24 / 15

Data for this tutorial Data from Human genome from Hapmap project http://www.hapmap.org/ Reference : small region from chromosome 2 2:38-53 (assembly GRCh37) file GRCh37_region1.fasta Reads: Illumina paired-end (2x1bp) for 3 samples (HG96, HG11 and HG13) files HGXXX_1.fastq, HGXXX_2.fastq (only reads for this small region, for reasons of speed) Dowload files on billile wkiki : https://wikis.univ-lille1.fr/bilille/formation Main goals for this first part of tutorial Upload reference and reads for one sample (HG11) Work with histories, datasets and tools March 217 25 / 15

Connect : Galaxy v1 1 Enter IP number 2 Click on Galaxy icon IP simple IP + session 3 Menu User / login 4 Username : user@galaxy.ifb.fr Password : ifbuser Username + Password 5 Menu User => Check that you are connected March 217 26 / 15

Connect : Galaxy v2 1 Enter IP number 2 Menu User / login 3 Username : bilillen Password : bilillen 4 Menu User => Check that you are connected March 217 27 / 15

History : «Folder» containing a set of data Default name = «Unamed history» 1 Rename history => TP1 2 Explore history menu 3 Create new history March 217 28 / 15

List histories, go back to TP1 1 List all histories 2 Go back to TP1 history March 217 29 / 15

Dataset ~ «Data file» Upload reference in a dataset 1 Tools Get Data / Upload File 3 fasta 2 Choose file 4 unspecified 5 start 6 close 2 Choose file GRCh37_regions1.fasta 3 Choose fasta format (! not csfasta) 4 Keep «Unspecified» as genome 5 Run with start 6 Close March 217 3 / 15

Dataset : summary, attributes, full data 1 Click on dataset name show summary of attributes and data 2 Click on the eye show data 3 Click on pencil show attributes March 217 31 / 15

Add a local reference (TP_ref) 1 Menu User / Custom Builds 2 2 Choose name TP_ref 3 Choose fasta format 4 Choose dataset n 1 : GRCh37_regions1.fasta 5 Submit 3 4 5 Reference is now available March 217 32 / 15

Check / Change database attribute 1 Analyse Data 2 1 Menu Analyze Data 2 Click on dataset name to see summary => database attribute is «?» 3 Click on pencil to change attributes 4 Choose TP_ref database 5 Save 6 check database attribute is now «TP_ref» 6 3 4 5 March 217 33 / 15

Upload reads (fastq) for sample HG11 1 Tools: Get Data / Upload File 2 Choose files HG11_1_fastq and HG11_2.fastq 3 Choose «fastq» format 4 choose «TP_ref» genome 5 Run with start 6 Close 7 Check attributes 1 7 3 fastq 2 choose local file 4 TP_ref 5 start 6 close March 217 34 / 15

Look at a fastq file identifiant sequence quality 1 read What is the file size? How many reads? Sizes of reads? Which is the quality coding? Galaxy : always uses Sanger => conversion tool : groomer Source : https://fr.wikipedia.org/wiki/fastq March 217 35 / 15

Convert to Sanger format : groomer tool 1 2 3 6 4 5 1 Tools: FASTQ Groomer 2 Choose to «groom» many files 3 Choose files HG11_1_fastq and HG11_2.fastq 4 Choose «Sanger & Illumina 1.8+» format 5 Execute Create 2 new datasets (N 4 et 5) 6 Check new datasets attributes What are the sizes of new datasets? How many reads? Which is the quality coding? What are the names of new datasets? March 217 36 / 15

Rename datasets 2 HG11_OK_1.fastq 1 4 3 For each new datasets : 1 Click on pencil to change attributes 2 Change the name 3 Save 4 Check new datasets names After changin a dataset name, how can we retreive dataset origin? March 217 37 / 15

Retreive dataset origin This dataset results from groomer tool, applied on dataset 2 (HG11_1.fastq) March 217 38 / 15

Reads quality control Per base quality Per read mean quality Read size Adaptators Duplicated reads March 217 39 / 15

Reads quality control (fastqc) Andrews, S. FastQC A Quality Control tool for High Throughput Sequence Data. 1 2 1,2 Choose tool : FastQC 3 Choose datasets n 4 et 5 4 Execute Create 4 new datasets For each fastq file : 1 «raw data» and 1 «Webpage» 3 4 March 217 4 / 15

Manage fastqc result datasets 1 2 1 3 1 Look quickly at dataset content (we will deeply look at that later) 2 Remove «RawData» datasets 3 Rename «Webpage»datasets HG11_1.QC et HG11_2.QC March 217 41 / 15

FastqQC : Summary & Basic Statistics March 217 42 / 15

FastQC : Per base sequence quality Median Mean Good Quality Mean Quality Bad quality March 217 43 / 15

Fasqc : Per base sequence quality Example OK Example KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 44 / 15

Fastqc : Per sequence quality score Example OK Example KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 45 / 15

FastQC : Sequence Length Distribution & Per sequence GC content March 217 46 / 15

FastQC : Per base sequence content Example OK Example KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 47 / 15

FastQC : Per base N content Example OK Example KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 48 / 15

FastQC : Overrepresented sequences Example OK Example KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 49 / 15

FastQC : Sequence Duplication Levels Example OK Example KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 5 / 15

FastQC Adapter Content Example OK Example KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 51 / 15

FastQC Kmer Content Example OK Example KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 52 / 15

FastQC Example with PacBio Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ pacbio_srr7514_fastqc.html March 217 53 / 15

Cleaning Reads Filtrering adaptators Filtering & trimming reads Comparing quality before and after cleaning March 217 54 / 15

Filtering & trimming Filtering = remove reads Based on quality or size criteria Trimming = remove read ends Fixed number of bases Bases < quality March 217 55 / 15

Trimming Cut bad quality bases at the end of reads Exemple OK Exemple KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 56 / 15

Filtering Remove reads with bad mean quality Exemple OK Exemple KO Source : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ bad_sequence_fastqc.html March 217 57 / 15

Reads cleaning (Trimmomatic) 1/2 Bolger, A. M. and Lohse, M. and Usadel, B. (214). Trimmomatic: a flexible trimmer for Illumina sequence data. In Bioinformatics, 3 (15), pp. 2114 212 1 Choose files 2 Parameters for adaptators March 217 58 / 15

Reads cleaning (Trimmomatic) 2/2 Add operations (cleaning steps) : 1 : LEADING : Cut bad quality 5' bases 2 : TRAILNG:Cut bad quality 3' bases 3 : SLIDINGWINDOW : Cut bases with bad mean quality in a sliding window 4 : AVGQUAL : remove reads with bad mean quality 5 : MINLEN : Remove small size reads March 217 59 / 15

Trimmomatic : Results Unpaired reads2 (corresponding reads1 has been removed during cleaning) Unpaired reads1 (corresponding reads2 has been removed during cleaning) Reads2 after cleaning Reads1 after cleaning How many paired reads after cleaning? Are there any trace or log of what happens during cleaning? From summary, click on «i» icon Look content of «stdout» This contains all messages send by this tool during execution. March 217 6 / 15

Trimmomatic : 2nd try Run this tool again after changing AVGQUAL parameter value to 3 1 1 Use one of the datasets produced by previous analysis 2 Click on «Run this job again» icon 2 All parameters are pre-sets with values used in the previous execution 3 3 Change only the parameter in step n 4 AVGQUAL value 25 How many paired reads after new cleaning? Work only with paired dataset produced by first cleaning (remove second try) Rename the datatsets HG11_1_clean.fastq and HG11_2_clean.fastq Run quality control on these 2 datasets. Compare quality control before and after cleaning March 217 61 / 15

Compare quality control before / after cleaning Solution 1 : Open a second web browser window Connect to access history and datasets Visualize datasets HG11_OK_1.CQ HG11_clean_1.QC March 217 62 / 15

Compare quality control before / after cleaning Solution 2 : Use Galaxy «Scratchbook» to manage Galaxy windows 1 «Enable Scratchbook» 2 Visualize dataset HG11_clean_1.fastq 3 Visualize dataset HG11_OK_1.fastq March 217 63 / 15

Quality control before & after cleaning HG11_OK_1.QC HG11_clean_1.QC March 217 64 / 15

Quality control before & after cleaning HG11_OK_1.QC HG11_clean_1.QC March 217 65 / 15

Reads alignment SAM/BAM Format SAM results for single-end and paired-end reads BWA & Bowtie2 alignment tools Manage duplicated reads (picard / MarkDuplicates) Count alignments (samtools / flagstat) Compute deepth and coverage (Deeptools/PlotCoverage) March 217 66 / 15

SAM - Sequence Alignment/Map Alignement Coding to a SAM format file @SQ SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147 ref 37 3 3 3 3 17 3 8M2I4M1D3M 3S6M1P1I4M 5S6M 6M14N5M 6H5M 9M = 37 39 = 7 39 TTAGATAAAGGATACTG AAAAGATAAGGATA GCCTAAGCTAA ATAGCTTCAGC TAGGC CAGCGGCAT SA:Z:ref,29,,6H5M,17,; SA:Z:ref,9,+,5S6M,3,1; NM:i:1 Source : http://samtools.github.io/hts-specs/samv1.pdf March 217 67 / 15

SAM File format Header @SQ : Reference sequence SN:ref : sequence name, LN:45 : sequence size 1 line per read, 12 columns @SQ SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147 ref 37 3 3 3 3 17 3 8M2I4M1D3M 3S6M1P1I4M 5S6M 6M14N5M 6H5M 9M = 37 39 = 7 39 TTAGATAAAGGATACTG AAAAGATAAGGATA GCCTAAGCTAA ATAGCTTCAGC TAGGC CAGCGGCAT SA:Z:ref,29,,6H5M,17,; SA:Z:ref,9,+,5S6M,3,1; NM:i:1 1 log 1 ( p ) P = error probability on position Mapping tools estimation based on missmatch, insertions, deletions multiple alignements Source : http://samtools.github.io/hts-specs/samv1.pdf March 217 68 / 15

SAM - Flags @SQ SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147 ref 37 1 : + 2 : + 32 : + 64 : 99 3 3 3 3 17 3 8M2I4M1D3M 3S6M1P1I4M 5S6M 6M14N5M 6H5M 9M = 37 39 = 7 39 TTAGATAAAGGATACTG AAAAGATAAGGATA GCCTAAGCTAA ATAGCTTCAGC TAGGC CAGCGGCAT SA:Z:ref,29,,6H5M,17,; SA:Z:ref,9,+,5S6M,3,1; NM:i:1 template having multiple segments in sequencing each segment properly aligned according to the aligner SEQ of the next segment in the template being reverse complemented the first segment in the template Source : http://samtools.github.io/hts-specs/samv1.pdf March 217 69 / 15

SAM - Flags @SQ SN:ref LN:45 r1 99 ref 7 r2 ref 9 r3 ref 9 r4 ref 16 r3 264 ref 29 r1 147 ref 37 3 3 3 3 17 3 8M2I4M1D3M 3S6M1P1I4M 5S6M 6M14N5M 6H5M 9M = 37 39 = 7 39 TTAGATAAAGGATACTG AAAAGATAAGGATA GCCTAAGCTAA ATAGCTTCAGC TAGGC CAGCGGCAT SA:Z:ref,29,,6H5M,17,; SA:Z:ref,9,+,5S6M,3,1; NM:i:1 https://broadinstitute.github.io/picard/explain-flags.html March 217 7 / 15

BAM : Binary sam Same data as in SAM «binary» format, more compact Smaller files Faster treatment for computers BAI : Index for BAM file Speed up data search and retrieve in a BAM file March 217 71 / 15

SAM Examples for single-end reads S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S11 S12 S13 S14 S15 S16 S17 25 26 27 28 29 3 31 32 33 34 35 35 37 37 39 4 41 lecture parfaite, bonne qualité lecture parfaite, mauvaise qualité lecture parfaite, bonne qualité sauf sur les 2 dernières bases lecture parfaite, bonne qualité sauf sur les 6 dernières bases insertion de 5 bases en position 5, bonne qualité deletion de 5 bases en position 5, bonne qualité 5 substitutions réparties sur le read, bonne qualité 5 substitutions au début du read, bonne qualité 5 substitutions au milieu du read, bonne qualité 5 substitutions au début, mauvaise qualité lecture parfaite, bonne qualité, duplicaion 1 lecture parfaite, bonne qualité, duplication 2 lecture parfaite, bonne qualité lecture parfaite, bonne qualité, décalé de 1bp lecture non alignée, bonne qualité lecture alignée 2 fois, bonne qualité lecture alignée 1 fois, bonne qualité SAM flags read paired read mapped in proper pair read unmapped mate unmapped read reverse strand mate reverse strand first in pair second in pair not primary alignment read fails platform/vendor quality checks read is PCR or optical duplicate supplementary alignment BWA QNAME S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S12 S12 S13 S14 S15 S16 S17 FLAG 4 RNAME POS 1 11 21 31 41 51 61 76 81 96 11 11 121 122 151 161 MAPQ 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 124 x x Bowtie2 CIGAR 1M 1M 1M 1M 48M5I47M 5M5D45M 1M 5S95M 1M 5S95M 1M 1M 1M 99M 1M 1M NEXT PNEXT TLEN QNAME S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S12 S12 S13 S14 S15 S16 S17 March 217 FLAG 4 RNAME POS 1 11 21 31 41 51 61 72 81 92 11 11 121 122 151 21 MAPQ 42 42 42 42 24 24 23 4 23 4 42 42 42 42 1 1 CIGAR 1M 1M 1M 1M 48M5I47M 5M5D45M 1M 4M1I95M 1M 4M1I95M 1M 1M 1M 99M 1M 1M NEXT PNEXT TLEN 72 / 15

SAM Examples for single-end reads S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S11 S12 S13 S14 S15 S16 S17 25 26 27 28 29 3 31 32 33 34 35 35 37 37 39 4 41 lecture parfaite, bonne qualité lecture parfaite, mauvaise qualité lecture parfaite, bonne qualité sauf sur les 2 dernières bases lecture parfaite, bonne qualité sauf sur les 6 dernières bases insertion de 5 bases en position 5, bonne qualité deletion de 5 bases en position 5, bonne qualité 5 substitutions réparties sur le read, bonne qualité 5 substitutions au début du read, bonne qualité 5 substitutions au milieu du read, bonne qualité 5 substitutions au début, mauvaise qualité lecture parfaite, bonne qualité, duplicaion 1 lecture parfaite, bonne qualité, duplication 2 lecture parfaite, bonne qualité lecture parfaite, bonne qualité, décalé de 1bp lecture non alignée, bonne qualité lecture alignée 2 fois, bonne qualité lecture alignée 1 fois, bonne qualité SAM flags read paired read mapped in proper pair read unmapped mate unmapped read reverse strand mate reverse strand first in pair second in pair not primary alignment read fails platform/vendor quality checks read is PCR or optical duplicate supplementary alignment BWA QNAME FLAG RNAME S1 S2 S3 Duplicated reads S4 S5 S6 S7 S8 S9 S1 S12 S12 S13 S14 S15 4 S16 S17 POS 1 11 21 31 41 51 61 76 81 96 11 11 121 122 151 161 MAPQ 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 124 x x Bowtie2 CIGAR 1M 1M 1M 1M 48M5I47M 5M5D45M 1M 5S95M 1M 5S95M 1M 1M 1M 99M 1M 1M NEXT PNEXT TLEN Score 6 / Unaligned reads QNAME FLAG S1 S2 Score [-42] S3 S4 S5 S6 S7 S8 S9 S1 S12 S12 S13 S14 S15 4 S16 S17 March 217 RNAME POS 1 11 21 31 41 51 61 72 81 92 11 11 121 122 151 21 MAPQ 42 42 42 42 24 24 23 4 23 4 42 42 42 42 1 1 CIGAR 1M 1M 1M 1M 48M5I47M 5M5D45M 1M 4M1I95M 1M 4M1I95M 1M 1M 1M 99M 1M 1M NEXT PNEXT TLEN Multiple Alignements 73 / 15

SAM : Examples for paired-end reads BWA QNAME FLAG RNAME POS MAPQ CIGAR P1_F 97 C1 21 6 1M P1_R 145 C1 41 6 1M P2_F 97 C1 31 6 1M P2_R 145 C1 111 6 1M P3_F 73 C1 51 6 1M P3_R 133 C1 51 P4_F 97 C1 61 6 1M P4_R 145 C2 11 6 1M P5_F 97 C1 71 6 1M P5_R 145 C1 171 1M P6_F 97 C1 81 6 1M P6_R 145 C1 271 1M P7_F 65 C1 121 6 1M P7_R 129 C1 141 6 1M 1 2 3 4 5 6 7 Bowtie2 RNEXT PNEXT TLEN = 41 3 = 21-3 = 111 9 = 31-9 = 51 = 51 C2 11 C1 61 = 171 11 = 71-11 = 271 2 = 81-2 = 141 21 = 121-21 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 C1 QNAME FLAG RNAME POS MAPQ CIGAR P1_F 99 C1 21 42 1M P1_R 147 C1 41 42 1M P2_F 97 C1 31 42 1M P2_R 145 C1 111 42 1M P3_F 73 C1 51 42 1M P3_R 133 C1 51 P4_F 97 C1 61 42 1M P4_R 145 C2 11 42 1M P5_F 99 C1 71 42 1M P5_R 147 C1 91 42 1M P6_F 99 C1 81 42 1M P6_R 147 C1 11 42 1M P7_F 65 C1 121 42 1M R7_R 129 C1 141 42 1M 1 C2 R1 R2 R3 R4 R5 R6 R7 2 3 4 RNEXT PNEXT TLEN = 41 3 = 21-3 = 111 9 = 31-9 = 51 = 51 C2 11 C1 61 = 91 3 = 71-3 = 11 3 = 81-3 = 141 3 = 121-3 SAM flags 97 145 99 147 73 133 65 129 read paired x x x x x x x x read mapped in proper pair x x read unmapped x mate unmapped x read reverse strand mate reverse strand x x x x first in pair x x x x second in pair x x x x not primary alignment read fails platform/vendor quality checks read is PCR or optical duplicate supplementary alignment https://broadinstitute.github.io/picard/explain-flags.html March 217 74 / 15

SAM : Examples for paired-end reads Difference on flags BWA QNAME FLAG RNAME POS MAPQ CIGAR P1_F 97 C1 21 6 1M P1_R 145 C1 41 6 1M P2_F 97 C1 31 6 1M P2_R 145 C1 111 6 1M P3_F 73 C1 51 6 1M P3_R 133 C1 51 P4_F 97 C1 61 6 1M P4_R 145 C2 11 6 1M P5_F 97 C1 71 6 1M P5_R 145 C1 171 1M P6_F 97 C1 81 6 1M P6_R 145 C1 271 1M P7_F 65 C1 121 6 1M P7_R 129 C1 141 6 1M RNEXT PNEXT TLEN = 41 3 = 21-3 = 111 9 = 31-9 = 51 = 51 C2 11 C1 61 = 171 11 = 71-11 = 271 2 = 81-2 = 141 21 = 121-21 Bowtie2 QNAME FLAG RNAME POS MAPQ CIGAR P1_F 99 C1 21 42 1M P1_R 147 C1 41 42 1M P2_F 97 C1 31 42 1M P2_R 145 C1 111 42 1M P3_F 73 C1 51 42 1M P3_R 133 C1 51 P4_F 97 C1 61 42 1M P4_R 145 C2 11 42 1M P5_F 99 C1 71 42 1M P5_R 147 C1 91 42 1M P6_F 99 C1 81 42 1M P6_R 147 C1 11 42 1M P7_F 65 C1 121 42 1M R7_R 129 C1 141 42 1M Difference on positions when duplicated sequences 1 2 3 4 5 6 7 8 9 1 11 12 13 14 15 16 17 18 19 2 21 22 23 C1 R1 R2 R3 R4 R5 R6 R7 1 C2 F R Bowtie2 R 2 3 4 RNEXT PNEXT TLEN = 41 3 = 21-3 = 111 9 = 31-9 = 51 = 51 C2 11 C1 61 = 91 3 = 71-3 = 11 3 = 81-3 = 141 3 = 121-3 SAM flags 97 145 99 147 73 133 65 129 read paired x x x x x x x x read mapped in proper pair x x read unmapped x mate unmapped x read reverse strand mate reverse strand x x x x first in pair x x x x second in pair x x x x not primary alignment read fails platform/vendor quality checks read is PCR or optical duplicate supplementary alignment https://broadinstitute.github.io/picard/explain-flags.html BWA March 217 75 / 15

Alignment with BWA Li, H. and Durbin, R. (29). Fast and accurate short read alignment with Burrows-Wheeler transform. In Bioinformatics, 25 (14), pp. 1754 176. Li, H. and Durbin, R. (21). Fast and accurate longread alignment with Burrows-Wheeler transform. In Bioinformatics, 26 (5), pp. 589 595. March 217 76 / 15

Alignment with bowtie2 Langmead, Ben and Trapnell, Cole and Pop, Mihai and Salzberg, Steven L (29). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. In Genome Biology, 1 (3), pp. R25. Langmead, Ben and Salzberg, Steven L (212). Fast gapped-read alignment with Bowtie 2. In Nature Methods, 9 (4), pp. 357 359. March 217 77 / 15

Cleaning duplicated reads Source : GATK Marking duplicates https://software.broadinstitute.org/gatk/events/slides/1511/presentations/gatkwh9-3-marking_duplicates.pdf March 217 78 / 15

Picard / MarkDuplicate Additional information about Picard tools is available from Picard web site at http://broadinstitute.github.io/picard/ BWA Bowtie2 UNPAIRED_READS_EXAMINED 18 19 READ_PAIRS_EXAMINED 443 444 SECONDARY_OR_SUPPLEMENTARY_RDS 3 UNMAPPED_READS 22 19 UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES 12 12 READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION,2962,296 ESTIMATED_LIBRARY_SIZE 679728 6865 March 217 79 / 15

Alignment count : samtools flagstat BWA 8129 + in total (QC passed reads + QC failed reads) + secondary 3 + supplementary 24 + duplicates 811 + mapped (99.77%: nan%) 8126 + paired in sequencing 463 + read1 463 + read2 798 + properly paired (98.2%: nan%) 888 + with itself and mate mapped 19 + singletons (.23%: nan%) + with mate mapped to a different chr + with mate mapped to a different chr (mapq>=5) Bowtie 2 8126 + in total (QC passed reads + QC failed reads) + secondary + supplementary 24 + duplicates 814 + mapped (99.73%: nan%) 8126 + paired in sequencing 463 + read1 463 + read2 874 + properly paired (99.36%: nan%) 886 + with itself and mate mapped 18 + singletons (.22%: nan%) + with mate mapped to a different chr + with mate mapped to a different chr (mapq>=5) March 217 8 / 15

Coverage and deepth of coverage Source : Élodie Girard, 5ème Ecole de bioinformatique AVIESAN-IFB 216 http://www.france-bioinformatique.fr/sites/default/files/v1_itmo_216_eg_from_fastq_to_mapping_1.pdf March 217 81 / 15

Computing coverage and deepth of coverage DeepTools2 / plotcoverage Ramírez, Fidel and Ryan, Devon P and Grüning, Björn and Bhardwaj, Vivek and Kilpert, Fabian and Richter, Andreas S and Heyne, Steffen and Dündar, Friederike and Manke, Thomas (216). deeptools2: a next generation web server for deep-sequencing data analysis. In Nucleic Acids Research, 44 (W1), pp. W16 W165 March 217 82 / 15

DeepTools / Plot Coverage March 217 83 / 15

Galaxy Workflow Extract workflow from an history Modify workflow Execute workflow on new data Compare results from 2 workflows (in 2 histories) March 217 84 / 15

Extract Workflow from the history of steps applied to the first sample March 217 85 / 15

Visualize workflow March 217 86 / 15

Modify workflow visualisation March 217 87 / 15

Modify some steps configuration This WF uses 3 input files. Change box name to describe which data is required for each input : eg Reference, Forward fastq, Reverse fastq You can also change any parameter for example for trimmomatic step. March 217 88 / 15

Enable a parameter to be set at run time Parameters for each tool will have the predefined values set in the workflow You can modify this to enable any parameter to be set at run time. Modify Trimmomatic so that Adapter are set at run time Do'nt forget to save your workflow! March 217 89 / 15

Import new data for sample HG13 Importe files HG13_1.fastq and HG_13_2.fastq March 217 9 / 15

Analyze these new data with the same workflow Run the workflow with these new data March 217 91 / 15

Browse results March 217 92 / 15

Gather BWA alignment results for the 2 samples 1/2 Create a new history named results From history TP1 : Copy HG11_BWA_MD.bam dataset March 217 93 / 15

Gather BWA alignment results for the 2 samples 2/2 From history HG13 : Copy dataset «MarkDuplicates on data 13: MarkDuplicates BAM output»... March 217 94 / 15

Visualize deepth of coverage for both samples Rename datasets Run plotcoverage March 217 95 / 15

Galaxy Best Practices Manage disk space Export analysis results (datasets and histories) Export / Import analysis protocoles (workflow) March 217 96 / 15

Manage disk space Global disk space Disk space per dataset Disk space per history March 217 97 / 15

Export analysis results : datasets Image Text Bam March 217 HTML + files 98 / 15

Export analysis results : histories March 217 99 / 15

Export / import analysis protocoles : workflow Export Import March 217 1 / 15

Galaxy - Visualisation Create a visualisation Add datasets Change visualisation parameters March 217 11 / 15

Create a new visualisation 1 Menu Visualisation / New Track Browser 2 Choose name TP2_vizu 3 Choose reference TP2_ref 4 Create 2 3 4 5 Save March 217 12 / 15

Visualize BAM files 1/3 1 Add datasets 2 Choose BAM files 3 Add March 217 13 / 15

Visualize BAM files 2/3 Zoom and move March 217 14 / 15

Visualize BAM files 3/3 1 Set Display Mode 2 Coverage March 217 15 / 15