Sequence Alignment: Mo1va1on and Algorithms

Similar documents
Sequence Alignment: Mo1va1on and Algorithms. Lecture 2: August 23, 2012

EECS 4425: Introductory Computational Bioinformatics Fall Suprakash Datta

Read Mapping and Variant Calling

Mar%n Norling. Uppsala, November 15th 2016

Module: Sequence Alignment Theory and Applica8ons Session: BLAST

NGS Data and Sequence Alignment

24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:

Basic Local Alignment Search Tool (BLAST)

BLAST MCDB 187. Friday, February 8, 13

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

Lecture 5 Advanced BLAST

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Biology 644: Bioinformatics

Lecture 12. Short read aligners

The SAM Format Specification (v1.3 draft)

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

Sequence mapping and assembly. Alistair Ward - Boston College

Database Similarity Searching

Basics on bioinforma-cs Lecture 4. Concita Cantarella

Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene s function

EECS730: Introduction to Bioinformatics

BLAST & Genome assembly

INTRODUCTION AUX FORMATS DE FICHIERS

BLAST & Genome assembly

Bioinformatics. Sequence alignment BLAST Significance. Next time Protein Structure

BLAST, Profile, and PSI-BLAST

Database Searching Using BLAST

Bioinformatics for Biologists

The SAM Format Specification (v1.3-r837)

Short Read Alignment. Mapping Reads to a Reference

NGS Data Analysis. Roberto Preste

Introduction to Computational Molecular Biology

High-throughout sequencing and using short-read aligners. Simon Anders

Computational Genomics and Molecular Biology, Fall

RNA-seq. Manpreet S. Katari

Bioinformatics explained: BLAST. March 8, 2007

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Mapping NGS reads for genomics studies

Sequence alignment theory and applications Session 3: BLAST algorithm

Sequence Alignment. GS , Introduc8on to Bioinforma8cs The University of Texas GSBS program, Fall 2013

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Lecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Sequence Alignment & Search

Similarity Searches on Sequence Databases

Computational Molecular Biology

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.

Under the Hood of Alignment Algorithms for NGS Researchers

BGGN 213 Foundations of Bioinformatics Barry Grant

Lecture 2 Pairwise sequence alignment. Principles Computational Biology Teresa Przytycka, PhD

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

Next generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

B L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture

Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties

PPI Network Alignment Advanced Topics in Computa8onal Genomics

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

TCCAGGTG-GAT TGCAAGTGCG-T. Local Sequence Alignment & Heuristic Local Aligners. Review: Probabilistic Interpretation. Chance or true homology?

Perl for Biologists. Session 8. April 30, Practical examples. (/home/jarekp/perl_08) Jon Zhang

Alignments BLAST, BLAT

Bioinformatics in next generation sequencing projects

RCAC. Job files Example: Running seqyclean (a module)

Heuristic methods for pairwise alignment:

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

NGS Analysis Using Galaxy

From Smith-Waterman to BLAST

CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.

INTRODUCTION TO BIOINFORMATICS

Computational Molecular Biology

Genomic Files. University of Massachusetts Medical School. October, 2014

Genomic Files. University of Massachusetts Medical School. October, 2015

INTRODUCTION TO BIOINFORMATICS

Chapter 4: Blast. Chaochun Wei Fall 2014

COS 551: Introduction to Computational Molecular Biology Lecture: Oct 17, 2000 Lecturer: Mona Singh Scribe: Jacob Brenner 1. Database Searching

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Today s Lecture. Multiple sequence alignment. Improved scoring of pairwise alignments. Affine gap penalties Profiles

Sequence alignment algorithms

Illumina Next Generation Sequencing Data analysis

MetaPhyler Usage Manual

Similarity searches in biological sequence databases

Read Mapping. Slides by Carl Kingsford

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Dynamic Programming & Smith-Waterman algorithm

Reconstructing long sequences from overlapping sequence fragment. Searching databases for related sequences and subsequences

Algorithmic Approaches for Biological Data, Lecture #20

Introduction to Phylogenetics Week 2. Databases and Sequence Formats

Introduc)on to annota)on with Artemis. Download presenta.on and data

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p Multiple alignment

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing

HISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim

Algorithms in Bioinformatics: A Practical Introduction. Database Search

Read Naming Format Specification

Sequence comparison: Local alignment

Transcription:

Sequence Alignment: Mo1va1on and Algorithms

Mo1va1on and Introduc1on

Importance of Sequence Alignment For DNA, RNA and amino acid sequences, high sequence similarity usually implies significant func1onal or structural similarity. imply structure from the sequence similarity suggest gene func1on Frequently, we aren t interested in exact matches of the sequences but approximate matches.

Inexact Alignment Where is GATTACA approximately in the human genome? And how do we efficiently find them? Answers to these ques1ons depend on: 1. The defini1on of approximately Hamming Distance, Edit distance, Sequence Similarity 2. Efficiency depends on the data characteris1cs and goals

In this lecture Sequence alignment and alignment sotware are used all over bioinforma1cs for different purposes. The goal of this lecture is to understand sequence alignment, mo1vate it in the context of assembly, and seeing what sotware is out there.

Two examples where you would use sequence similarity 1. Use a known genome to help assemble an unknown genome. 2. Annota1ng a genome.

Two examples where you would use sequence similarity 1. Use a known genome to help assemble an unknown genome. 2. Annota1ng a genome.

What is a reference genome? A reference genome is a representa1ve example of a species set of genes. As they are oten assembled from the sequencing of DNA from a number of donors, reference genomes do not accurately represent the set of genes of any individual. BUT the reference is very, very, VERY similar in many loca1ons to the individual.

Sample Prepara*on 10

Sample Prepara*on Fragments 11

Sample Prepara*on Fragments Sequencing Next Generation Sequencing (NGS) ACGTAGAATCGACCATG ACGTAGAATACGTAGAA GGGACGTAGAATACGAC Reads 12

Sample Prepara*on Fragments Sequencing Reads Assembly ACGTAGAATACGTAGAA ACGTAGAATCGACCATG GGGACGTAGAATACGAC ACGTAGAATACGTAGAAACAGATTAGAGAG Contigs 13

Sample Prepara*on Fragments Sequencing Reads Assembly ACGTAGAATACGTAGAA ACGTAGAATCGACCATG GGGACGTAGAATACGAC BUT WE DON T HAVE TO BEGIN FROM SCRATCH!!! ACGTAGAATACGTAGAAACAGATTAGAGAG Contigs 14

How do we use the reference? The reads from the individual will be very similar but not iden1cal to the reference genome. Ideally, we would like to use the informa1on from the reference! Otherwise, we are throwing up valuable informa1on. Hence, the first step is to find where in the genome the sequences are similar to.

Two examples where you would use sequence similarity 1. Use a known genome to help assemble an unknown genome. 2. Annota9ng a genome.

Sequence Alignment Before we can make compara1ve statements about two sequences, we have to produce a pairwise sequence alignment What is the op1mal alignment between two sequences? Match/mismatch? Gaps/extensions? Is an op1mal alignment always significant?

Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v, w): the minimum number of operations required to transform v into w.

Edit Distance vs Hamming Distance Hamming distance always compares the i th le^er of v with the j th le^er of w V = ATATATAT W = TATATATA Edit distance may compare the i th le^er of v with the j th le^er of w V = -ATATATAT W = TATATATA- d(v, w)=8 computing Hamming distance is a trivial task d(v, w)=2 computing edit distance is a non-trivial task

Edit Distance vs Hamming Distance Hamming distance always compares the i th le^er of v with the j th le^er of w V = ATATATAT W = TATATATA Edit distance may compare the i th le^er of v with the j th le^er of w V = -ATATATAT W = TATATATA- d(v, w)=8 d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i such that the score of the alignment is op1mized?

Edit Distance: Example TGCATAT! ATCCGAT in 5 steps TGCATAT! (delete last T) TGCATA! (delete last A) TGCAT! (insert A at front) ATGCAT! (subs1tute C for 3 rd G) ATCCAT! (insert G before last A) ATCCGAT (Done)

Compu1ng Edit Distance

Dynamic Programming Algorithm Dynamic Programming is a method for solving complex problems by breaking them down into simpler subproblems. Provides the best (op1mal) alignment between two sequences. Includes matches, mismatches, and gaps to maximize the number of matched characters. Score: match, mismatch, gap (affine vs. non- affine)

Dynamic Programming Visualized 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 Corresponding path (0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)

Dynamic Programming Visualized and represent indels in v and w with score 0. represent matches with score 1. The score of the alignment path is 5.

Dynamic Programming Visualized Every path in the edit graph corresponds to an alignment:

Score for match: +1 Score for mismatch: 0 Score for gap: 0 Define Rules Let s i,j be the total score at posi1on i, j then we have that it is equal to the maximum of {s i- 1, j- 1 + 1 (if v i = w j ), s i- 1, j (otherwise: gap or mismatch)}

Dynamic Programming Example Ini1alize 1 st row and 1 st column to be all zeroes. Or, to be more precise, ini1alize 0 th row and 0 th column to be all zeroes.

Dynamic Programming Example S i,j = maximum of : S i-1, j-1 S i-1, j S i, j-1 "value from NW +1, if v i = w j " value from North (top) " value from West (left)

Alignment: Backtracking Arrows show where the score originated. if from the top if from the let if v i = w j

Backtracking Example Find a match in row and column 2. i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since v i = w j, s i,j = s i-1,j-1 +1 s 2,2 = [s 1,1 = 1] + 1 s 2,5 = [s 1,4 = 1] + 1 s 4,2 = [s 3,1 = 1] + 1 s 5,2 = [s 4,1 = 1] + 1 s 7,2 = [s 6,1 = 1] + 1

Backtracking Example Con1nuing with the dynamic programming algorithm gives this result.

Scoring Matrices The alignment score represent odds of obtaining that score between sequences known to be related to that obtained by chance alignment between unrelated sequences When the correct scoring matrix is used, alignment sta1s1cs are meaningful Different scoring schemes for DNA and protein sequences

Examples of Scoring Matrices Amino acid subs1tu1on matrices: PAM BLOSUM DNA subs1tu1on matrices: DNA is less conserved than protein sequences. Less effec1ve to compare coding regions at nucleo1de level.

Scoring Indels: Naive Approach A fixed penalty σ is given to every indel: - σ for 1 indel, - 2σ for 2 consecu1ve indels - 3σ for 3 consecu1ve indels, etc. Can be too severe penalty for a series of 100 consecu1ve indels

Affine Gap Penal1es A series of k indels oten come as a single event rather than a series of k single nucleo1de events: This is more likely. Normal scoring would give the same score for both alignments This is less likely.

Accoun1ng for Gaps Score for a gap of length x is - (ρ + σx), where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large rela1ve to σ: gap extension penalty because you do not want to add too much of a penalty for extending the gap.

Affine Gap Penal1es Gap penal1es: - ρ- σ when there is 1 indel - ρ- 2σ when there are 2 indels - ρ- 3σ when there are 3 indels, etc. - ρ- x σ (- gap opening - x gap extensions) Somehow reduced penal1es (as compared to naïve scoring) are given to runs of horizontal and ver1cal edges

Heuris1c Algorithms Make reasonable assump1ons about nature of sequence alignments and try out only most likely Make alignments reasonable assump1ons about nature of sequence alignments and try out only most likely alignments Full perfect match (word, k- tuple) Much faster than DP algorithms but less sensi1ve Extend alignment un1l:

Exis1ng Tools

Different Sequence Alignment Database Search: BLAST, FASTA, HMMER Mul1ple Sequence Alignment: ClustalW, FSA Genomic Analysis: BLAT Short Read Sequence Alignment: BWA, Bow9e, drfast, GSNAP, SHRiMP, SOAP, MAQ

BLAST Basic Local Alignment Search Tool BLAST is faster than Smith- Waterman (DP algorithm), it cannot guarantee the op1mal alignments of the query and database sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences and iden1fy library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences.

Blast Versions BLASTN: Compares a nucleo1de query to a nucleo1de database BLASTP: Compares a protein query to a protein database BLASTX: Compares a translated nucleo1de query to protein database TBLASTN: Compares a protein query to a translated nucleo1de database TBLASTX: Compares a translated nucleo1de query to a translated nucleo1de database

hip://blast.ncbi.nlm.nih.gov/

Typical Uses of BLAST Which bacterial species have a protein that is related in lineage to a certain protein with known amino- acid sequence? Where does a certain sequence of DNA originate? What other genes encode proteins that exhibit structures or mo1fs such as ones thtat have been determined?

Mul1ple Alignment A mul1ple sequence alignment is a sequence alignment of three or more biological sequences (protein, DNA, RNA) The query sequences are assumed to have some sort of evolu1onary rela1onship by which they share some sort of lineage Many tools are available but this is a really complex computa1onal problem Used in phylogene1cs

Short Read Alignment SW Bow9e: memory- efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35- bp reads per hours Burrows- Wheeler Aligner (BWA): an aliger that implements two algorithms: bwa- short and BWA- SW. The former works for query sequences shorter than 200bp and the la^er for longer sequences up to around 100kbp.

Sequence Alignment/Map Format Sequence Reads Alignment SoMware SAM File Resequencing RNA Seq SNPs

SAM format A tab- delimited text format consis1ng of a header sec1on, which is op1onal, and an alignment sec1on Example Headers: @HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e @SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5 @RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37 Example Alignments: 1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37 19:20389:F:275+18M2D19M 99 1 17644 0 37M = 17919 314 TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 RG:Z:UM0098:1 XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37 19:20389:F:275+18M2D19M 147 1 17919 0 18M2D19M = 17644-314 GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19 9:21597+10M2I25M:R:-209 83 1 21678 0 8M2I27M = 21469-244 CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT <;9<<5><<<<><<<>><<><>><9>><>>>9>>><> XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35

The Alignment Column

Bitwise Flags FLAG: bitwise FLAG. Each bit is explained in the following table:

Bitwise Representa1on 1 = 00000001! paired- end read 2 = 00000010! mapped as proper pair 4 = 00000100! unmapped read 8 = 00001000! read mate unmapped 16 = 00010000! read mapped on reverse strand Example: The flag 11! 1 + 2 + 8 = 00001011 (condi1ons 1, 2, 8) Flags 0, 4, and 16 are the flags most commonly used.

Mapping Quality Phred score, iden1cal to the quality measure in the fastq file. quality Q, probability P: P = 10 ^ (- Q / 10.0) If Q=30, P=1/1000!on average, one of out 1000 alignments will be wrong As good as this sounds it is not easy to compute such a quality.

Mapping Quality The repeat structure of the reference. Reads falling in repe11ve regions usually get very low mapping quality The base quality of the read. Low quality means the observed read sequence is possibly wrong, and wrong sequence may lead to a wrong alignment The sensi1vity of the alignment algorithm. The true hit is more likely to be missed by an algorithm with low sensi1vity, which also causes mapping errors Paired end or not. Reads mapped in pairs are more likely to be correct

BWA Specific High Scores A read alignment with a mapping quality 30 or above usually implies: The overall base quality of the read is good. The best alignment has few mismatches. The read has few or just one good hit on the reference, which means the current alignment is s1ll the best even if one or two bases are actually muta1ons or sequencing errors.

BWA Specific Low Scores Surprisingly difficult to track down the exact behavior Q=0!if a read can be aligned equally well to mul1ple posi1ons, BWA will randomly pick one posi1on and give it a mapping quality zero. Q=25!the edit distance equals mismatches and is greater than zero

What to do with low quality scores? Find repeat structures in the genome/con1g. Determine if there is a problem with your alignment or data (i.e. all the reads mapped with low quality scores). Filter them out. Very common to write a perl/ python script to filter out poorly aligned reads. Many, many, many other possibili1es.

The Alignment Column

CIGAR String CIGAR string is a compact representa1on of how the read aligned to the reference genome at that exact posi1on. More specifically, the CIGAR string is a sequence of of base lengths and the associated opera1on. match/mismatch with the reference. deleted/inserted from the from the reference.

Example of CIGAR RefPos: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Reference: C C A T A C T G A A C T G A C T A A C Read: A C T A G A A T G G C T In the SAM file you will have the following fields: POS: 5 CIGAR: 3M1I3M1D5M

Final Comments BAM is a compressed version of the SAM file format. There are mul1ple programs that convert BAM files to SAM files and vice versa. Tablet (h^p://bioinf.scri.ac.uk/tablet/) is an easy to use, program that allows you to visualize an alignment. You simply give it a sam file and a fasta file and it reads the sam file and shows you the alignment.