Sequence Alignment. GS , Introduc8on to Bioinforma8cs The University of Texas GSBS program, Fall 2013
|
|
- Antonia Lawrence
- 5 years ago
- Views:
Transcription
1 Sequence Alignment GS , Introduc8on to Bioinforma8cs The University of Texas GSBS program, Fall 2013 Ken Chen, Ph.D. Department of Bioinforma8cs and Computa8onal Biology UT MD Anderson Cancer Center 1
2 Acknowledgements Slides benefited from Canadian Bioinforma8cs Workshops ( Stuart M. Brown, NYU School of Medicine Wikipedia Unknown online contributors 2
3 Outline Sequence alignment (50 mins) Reading: Bioinforma8cs and Func8onal Genomics, Chapter 3-6 (page ) Break (10 mins) NGS alignment (30 mins) Lab (50 minutes) 3
4 What is a sequence alignment? tcctctgcctctgccatcat---caaccccaaagt!! tcctgtgcatctgcaatcatgggcaaccccaaagt! Wikipedia: a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to iden8fy regions of similarity that may be a consequence of func8onal, structural, or evolu8onary rela8onships between the sequences. Ques8ons: What is the process? Arrange the sequences of DNA, RNA, or protein What is the purpose? Iden8fy func8onal, structural, or evolu8onary rela8onships among sequences 4
5 Understanding our genome PP
6 Understanding genomic varia8ons SNPs, haplotypes and tag SNPs. 6
7 What differences are expected? Genome evolu*on SNV (or SNP): ATA AGA Inser*on: AA AGA Dele*on: ACTG ACG Indel: an inser*on or a dele*on DNV (or DNP): ATGA AGCA MNV (or MNP): ATGTA AGCGA 7
8 What differences are expected? reference Structural Variation (SV) Large Dele8on Tandem Duplica8on VNTR Dispersed Duplica8on Novel Inser8ons SINE/LINE Inser8ons Inversions Transloca8on Complex 8
9 What makes a good alignment? Two sequences 1. TGTG! 2. TAG! Subject: Query: TGTG!.! T-AG! TGTG!.! TA-G! T-AG!.! TGTG! TA-G!.! TGTG! Ques8ons: Are these 4 alignments equally good? T- >A, G- >A, which is more likely in your study? Transi8on/Transversion Ra8o? Muta8on mechanism? Deamina8on? T- A, A- G, which is more likely? Sequence context? Two species, which one of these two sequence is ancestral? 9
10 Muta8on spectra in colon cancer cell- lines 10
11 Why is it a problem? Why is it a challenging problem? 1. Massive amount of data 2. Solu8on (hypothesis) space is huge (infinite) Finding needles in a haystack Why is it s8ll a problem? 1. Not all possible hypotheses are examined Good at subs8tu8ons and indels, but poor at structural varia8ons 2. Non- unique solu8ons? 3. No solu8on (twilight zone)? When sequences are very different. 4. Scoring systems not appropriate for the problem 5. Efficiency 11
12 A mathema8cal defini8on for pair- Two sequences: wise alignment All possible alignments in between: Each alignment (hypothesis): Has a score: F(a) X = [x 1, x 2,..., x m ],Y = [y 1, y 2,..., y n ] a A(X,Y ) The alignment problem is to find: a * = argmax a A(X,Y ) F(a) A(X,Y ) 12
13 a * = argmax a A(X,Y ) How to compute scores? F(a) Sequence 1 Sequence 2 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact Example: A simple scoring system A G C T A G C T Match: 1 Mismatch: 0 Score = 5 13
14 What about indels? Ques8ons: 1. Are shorter indels always more likely than longer indels? TGTAG!! T---G! Affine gap penal8es (Gotoh, 1982):! γ(n) = d + (n 1) e!!!!!!!! gap! gap!!!open!!extend! Nature Gene*cs 44, (2012) In Coding Region? Frame- shil/inframe γ(n) d e Zhan et al. BMC Genomics :557 doi: / n 14
15 How to explore the hypothesis space? a * = argmax a A(X,Y ) F(a) Q: How to align?!! TGTG! TAG!! 1. This looks best:!!t!!!!t!! 2. Which one is better?!!tg!!!tg!!!!!.!!t-!!!ta!! 3! Promising hypotheses only 15
16 Alignment path matrix Scoring system: Match=1, mismatch=0, gap=- 1 T! G T G T - 1 -T! -T! 1 T-! -T! A - 2 -T! T-! G - 3 Each cell represents the best par8al alignment 16
17 Alignment path matrix Scoring system: Match=1, mismatch=0, gap=- 1 T! G T G T T! 0 T! A TG! T-! TA! 1 TA! T-G! G - 3 TA-! TG! T-! TGT! T-A! 17
18 Fill in matrix and remember the choices Scoring system: Match=1, mismatch=0, gap=- 1 T! G T G T A G
19 Trace Back from the end Scoring system: Match=1, mismatch=0, gap=- 1 T! G T G Subject T Q uer y A G GTGT! GA-T! flip TGTG! T-AG! 19
20 Dynamic Programming When a large search space can be structured into a succession of stages, such that: " " " the initial stage contains trivial solutions to sub-problems" each partial solution can be calculated by recurring a fixed number of partial solutions in an earlier stage" the final stage contains the overall solution Question:" Does it guarantee to find the best solution?" 20
21 The Needleman- Wunsch algorithm 1. Create a table of size (m+1)x(n+1) for sequences X and Y of lengths m and n, 2. Fill table entries (m:1) and (1:n) with the values: i F i,1 = σ (x k, ), F 1, j = σ (, y k ) k=1 3. Star8ng from the top lel, compute each entry using the recursive rela8on: " $ F i, j = max# $ % $ 4. Perform the trace- back procedure from the bopom- right corner j k=1 F i 1, j 1 +σ (x i, y j ) F i 1, j +σ (x i, ) F i, j 1 +σ (, y j ) & $ ' $ ( $ (1970, J Mol Biol. 48(3):443-53) Limita8ons: subs8tu8ons and indels only 21
22 Local/global alignment Global alignments, which apempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. Needleman Wunsch algorithm Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence mo8fs within their larger sequence context. Smith Waterman algorithm 22
23 Break 23
24 Types of sequence alignment blastp: protein to protein (1) blastn: DNA to DNA (1) blastx: DNA to protein (6) tblastn: protein to DNA (6) tblastx: DNA to DNA (36) Number of possible amino acid sequences in the universe? =20 n? 24
25 BLAST algorithm 1. Make a k- leper word list of the query sequence 3. Scan the database sequences for exact matches Extend the exact matches to high- scoring segment pair (H 2. Generate a list of high scoring words: PQG, PEG, PAG, QPE, QAE,. 4. Evaluate the significance of the HSP score. 25
26 BLAST scores E = Kmne λs E value: the number of different alignments with scores equivalent to or beper than S that are expected to occur by chance in a database search m, n: length of the query sequence and length of the en8re database, respec8vely K, λ: constant derived by Karlin and Altschul, 1990; They depend upon the subs8tu8on matrix, gap penal8es, and sequence composi8on (the leper frequencies) λs ln K S' =, E = mn 2 S' ln2 S : bit score. A normalized score so alignments obtained from employing different scoring matrices in separate BLAST searches and be compared P =1 e E Small E values (0.05 or less) correspond closely to the P values 26
27 Basic Local Alignment Search Tools 27
28 Example CCGCAGGCAACCGCCAATTTCACTGCCAAGGTTCGTTGGCAGACCGTCCTGGCTTCAAAACGACCG ATAACGGTAAGTCTTGGCACGTAGGTGGTTATTTGATCGTTGGGATGATTGTGGTTAACAACAATC TACATACACATTTTCATATGACCGCCTTCGTTAATAAGCTTATATAGACATAAATATATAAGGTGC CATGTATTTAACAGAGCAGATTATGGACAGGCCAAAACCTAGAACAGTAAAGGAACTAGCAGACAC TCTTGTGATTCCTTTAATAGATTTGTTGATACCTTGTAAATTTTGCAATAGATTTTTATCTTATTT TGAGCTACTTAATTTTGATCACAAGTGTTTACAGCTTATTTGGACAGAGGAGGATTTGGTGTATGG ACTCTGTAGTAGCTGTGCTTATGCGTCTGCACAGTTAGAATTTACACATTTTTTTCAATTTGCTGT AGTTGGAAAAGATATAGAAACTGTAGAAGGAACAGCTATTGGAAATATTTGTATTAGGTGTCGCTA CTGTTTTAAGTTATTAGACTTAGTGGAGAAGT! hpp://blast.ncbi.nlm.nih.gov/blast.cgi >gi ref NC_ Human papillomavirus type 9, complete genome 28
29 The next generation sequencing paired-end alignment Levy et al., 2007; Wheeler et al., 2008; Bentley et al., 2008; Ley et al., 2008 DNA Samples NGS Meyerson et al.,
30 Challenges in aligning NGS reads against human genomes Large size (3Gbp) 45% Repeats Transposons Segmental duplica8ons Simple repeats (VNTR, homo- polymers) Large structural polymorphism The human leukocyte an8gen (HLA) regions on chr6 Gap or unfinished regions peri- centromere, sub- telomere ~5Mb unique to ethnic groups (e.g., African, Asian) Finishing errors (1/10,000 bp) Reference represents minor alleles 30
31 Challenges in aligning NGS reads Short reads bp (versus a very long reference) Non- unique alignment Sensi8ve to sequencing errors Massive amount of short reads 1.5 billion 100 bp reads to cover human genome 50 8mes (50x) 1 Illumina 600GB 100 bp run = 6 billion reads Small insert size bp libraries < AluY(300bp), L1Hs(6000 bp) 31
32 Hashing (full- text search) 32
33 Burrows Wheeler Transforma8on 33
34 Illumina AB SOLiD Roche 454 Helicos gapped all alignments multithreaded Bowtie X X X X BWA X X X X X X BFAST X X X X X X X Corona Lite X X ELAND X GenomeMapper X X X X gnumap X X X X karma X X X * MAQ X X MOSAIK X X X X X X X MrFAST X X X MrsFAST X X Novoalign X X X * RMAP X X SeqMap X X X SHRiMP X X X X X X Slider X X SOAP2 X X X SSAHA2 X X X X SOCS X X SXOligoSearch X X X X Zoom X X * X 34 Slides by M. Stromberg
35 NGS sequence data FASTQ format Readname uniquely id a read Sequencing informa8on: sequencer, lane, date, and etc. Paired end status, barcode, and etc. Nucleo8de sequence {A,C,G,T,N} Base Quality ASC II symbols, each represen8ng an integer quality score of the corresponding base, i.e., how likely the corresponding base is an error, as determined from image processing (intensity, chas8ty etc.) $ zcat G1-1_ATCACG_L001_R1_002.fastq.gz 1:N:0:ATCACG! CGGAAAAAAACGGAATTATCGAATGGAATCGAAGAGAATCATCGAATGGACCCGAATGGAATCATCTCATGGAATGGAATGGAATAATCCATGGACTCGA! +! CCCFFFFFGGHHHJJJIJIEECDGIIIIJIIIGCEHGIJJGEHCCF9FHGGGIDHFFFFDDCEDEDCCDEDDDDD@>CDDC:@CCAACD:AAC>>ACC@A! $ zcat G1-1_ATCACG_L001_R2_002.fastq.gz 2:N:0:ATCACG! ACTCGATGATTCCATTCGATTCCATTCGATGATGATTGCATTCGAGTCCATGGATTATTCCATTCCATTCCATGAGATGATTCCATTCGGGTCCATTCGA! +! CCCFFFFFHHHHHJDIJJJJJJJIIJJJIFJJJIJEIJGHIJIJIJHIJIGCEGIDIJJJIJJJJEIJJJIIIIIJJGIICHIJIIIGGGIGHHHFFFF@! 35
36 Sequence Alignment/Map (SAM) format One entry (line) per alignment One read can have mul8ple entry Flag: encodes 11 boolean values strand of alignment is this unique alignment? CIGAR: encodes mismatches in alignment Bases QVs SRR _2 16 chr M * 0 0 Read name Flag Ref. seq Ref. pos Map quality CIGAR string Matepair info CCTTGTTTGGAAGTAGGGTTTTGCACCTGGAACC FGGEGFG??;DDDDDFAFDDFF?AFGGEEGFAGG Module 34 bioinformatics.ca
37 SAM/BAM format SAM = text, BAM = binary Header Ref Seq Length Alignment tool Module 34 bioinformatics.ca
38 Mapping quality of reads Mul8ple solu8on: How likely each one is, compara8vely? Is there a best one? P(pos read,refseq) = P(read refseq, pos) P(read, pos refseq) pos Li H., Ruan J., and D. R Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 38
39 BWA: alignment via Burrows- Wheeler transforma8on Program: bwa (alignment via Burrows-Wheeler transformation)! Version: r16! Contact: Heng Li Usage: bwa <command> [options]!! Command: index index sequences in the FASTA format! aln gapped/ungapped alignment! samse generate alignment (single ended)! sampe generate alignment (paired ended)! bwasw BWA-SW for long queries!! fa2pac convert FASTA to PAC format! pac2bwt generate BWT from PAC! pac2bwtgen alternative algorithm for generating BWT! bwtupdate update.bwt to the new format! pac_rev generate reverse PAC! bwt2sa generate SA from BWT and Occ! pac2cspac convert PAC to color-space PAC! stdsw standard SW/NW alignment! 39
40 (Sequence Alignment/Map) samtools Program: samtools (Tools for alignments in the SAM format)! Version: dev (r982:313)!! Usage: samtools <command> [options]!! Command: view SAM<->BAM conversion! sort sort alignment file! mpileup multi-way pileup! depth compute the depth! faidx index/extract FASTA! tview text alignment viewer! index index alignment! idxstats BAM index stats (r595 or later)! fixmate fix mate information! flagstat simple stats! calmd recalculate MD/NM tags and '=' bases! merge merge sorted alignments! rmdup remove PCR duplicates! reheader replace BAM header! cat concatenate BAMs! targetcut cut fosmid regions (for fosmid pool only)! phase phase heterozygotes! 40
41 Integra8ve Genome Viewer Nature Biotechnology 29, (2011) doi: /nbt
42 Lab (50 minutes) hpp:// hpp://odin.mdacc.tmc.edu/~kchen3/seqlab/seqlab.htm 1. Iden8fy sequence altera8ons that evolved the 2 nd sequence from the 1 st (10-15 min) Compare 3 algorithms BLAST, Needleman, and Smith- Waterman Interpreta8on of alignment results 2. NGS sequence alignment (25-30 min) Generate NGS synthe8c data using wgsim Perform reference guided alignment using bwa View and understand alignment using samtools View alignment in IGV 42
43 Lab session 1 pair- wise alignment Pair- wise nucleo*de sequence alignment Compare the following 2 sequences ACGGGCTGAGTGGAGAGCTGTGACCCACAGAGATC GTAATCACGGGCTGAGTGTCACAGCTCTCCCCACAGAGATCGTGAAG using and NCBI Nucleo8de BLAST (bl2seq), Needleman- Wunsch Algorithm, Smith- Waterman Algorithm Ques8ons: Did you get same results from the 3 aligners? Why or why not? What events are required to evolve seq2 from seq1? 43
44 BLAST 44
45 Needleman- Wunsch 45
46 Smith- Waterman 46
47 Lab session 1 BLAST ACGGGCTGAGTGGAGAGCTGTGACCCACAGAGATC!! 1. Inversion! ACGGGCTGAGTGGAGAGCTGTGACCCACAGAGATC! ACGGGCTGAGTGTCACAGCTCTCCCCACAGAGATC!! 2. Insertions! ACGGGCTGAGTGGAGAGCTGTGACCCACAGAGATC! GTAATCACGGGCTGAGTGTCACAGCTCTCCCCACAGAGATCGTGAAG!!! 47
48 Session 2: Iden8fy DNA origin 48
49 Which species has the following DNA sequence?!aatcttaaaagcaggttatataggctaaatagaactaatcattgttttagacatacttat!!tgactctaagaggaaagatgaagtactatgttttaaagaatattatattacagaattata!!gaaattagatctcttacctaaactcttcataatgcttgctctgataggaaaatgagatct!!actgttttcctttacttactacacctcagatatatttcttcatgaagacctcacagtaaa!!aataggtgattttggtctagctacagtgaaatctcgatggagtgggtcccatcagtttga!!acagttgtctggatccattttgtggatggtaagaattgaggctatttttccactgattaa!!atttttggccctgagatgctgctgagttactagaaagtcattgaaggtctcaactatagt!!attttcatagttcccagtattcacaaaaatcagtgttcttattttttatgtaaatagatt!!ttttaacttttttctttacccttaaaacgaatattttgaaaccagtttc!! 49
50 Session 2: NGS sequence alignment Open a terminal, under /Applica8ons/U8li8es in Finder mkdir Desktop/seqaln Download the reference sequence hpp://odin.mdacc.tmc.edu/~kchen3/seqlab/hg18.chr20.10k_200k.fa by holding the control key and click on the hyperlink and select Save Linked File As it is in Desktop/seqaln You may need to rename the downloaded file: mv hg18.chr20.10k_200k.fa.txt hg18.chr20.10k_200k.fa Download bwa, save in Desktop/seqaln Download samtools, save in Desktop/seqaln Download wgsim, save in Desktop/seqaln cd Desktop/seqlan chmod +x bwa samtools wgsim Produce synthe8c reads from the reference sequence./wgsim - N hg18.chr20.10k_200k.fa sim_1.fq sim_2.fq > sim.muta8on.txt Take a look at sim_1.fq and sim_2.fq. Can you tell which reads are paired? Alignment synthe8c reads to the reference using bwa./bwa index hg18.chr20.10k_200k.fa./bwa aln hg18.chr20.10k_200k.fa sim_1.fq > sim_1.sai./bwa aln hg18.chr20.10k_200k.fa sim_2.fq > sim_2.sai./bwa sampe hg18.chr20.10k_200k.fa sim_1.sai sim_2.sai sim_1.fq sim_2.fq > sim.sam Examine sim.sam, e.g., more sim.sam Do you know the meaning of each column? Can you tell which reads are paired? Prepare bam file./samtools view - b - S sim.sam > sim.bam./samtools sort sim.bam sim.sorted./samtools index sim.sorted.bam What are the file sizes of sim.bam, and sim.sam? How much difference? View bam file in IGV launch igv, double- clicking Finder/Applica8ons/IGV 2.1, File/Import Genome, select fasta from /Desktop/seqaln/hg18.chr20.10k_200k.fa, and create a name hg18.chr20sim load sim.sorted.bam navigate in IGV Can you save an image? 50
51 51
52 52
SAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call
SAMtools http://samtools.sourceforge.net/ SAM/BAM mapping BAM SAM BAM BAM sort & indexing (ex: IGV) mapping SNP call SAMtools NGS Program: samtools (Tools for alignments in the SAM format) Version: 0.1.19
More informationSequence Alignment: Mo1va1on and Algorithms. Lecture 2: August 23, 2012
Sequence Alignment: Mo1va1on and Algorithms Lecture 2: August 23, 2012 Mo1va1on and Introduc1on Importance of Sequence Alignment For DNA, RNA and amino acid sequences, high sequence similarity usually
More informationSequence Alignment: Mo1va1on and Algorithms
Sequence Alignment: Mo1va1on and Algorithms Mo1va1on and Introduc1on Importance of Sequence Alignment For DNA, RNA and amino acid sequences, high sequence similarity usually implies significant func1onal
More informationWelcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.
Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,
More information24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:
24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid
More informationRead Mapping and Variant Calling
Read Mapping and Variant Calling Whole Genome Resequencing Sequencing mul:ple individuals from the same species Reference genome is already available Discover varia:ons in the genomes between and within
More informationUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,
More informationINTRODUCTION AUX FORMATS DE FICHIERS
INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan
More informationNGS Analysis Using Galaxy
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
More informationNext generation sequencing: assembly by mapping reads. Laurent Falquet, Vital-IT Helsinki, June 3, 2010
Next generation sequencing: assembly by mapping reads Laurent Falquet, Vital-IT Helsinki, June 3, 2010 Overview What is assembly by mapping? Methods BWT File formats Tools Issues Visualization Discussion
More informationBioinformatics in next generation sequencing projects
Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational
More informationHigh-throughout sequencing and using short-read aligners. Simon Anders
High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel
More informationSAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012
SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................
More informationRNA-seq. Manpreet S. Katari
RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene
More informationSAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.
Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference
More informationB L A S T! BLAST: Basic local alignment search tool. Copyright notice. February 6, Pairwise alignment: key points. Outline of tonight s lecture
February 6, 2008 BLAST: Basic local alignment search tool B L A S T! Jonathan Pevsner, Ph.D. Introduction to Bioinformatics pevsner@jhmi.edu 4.633.0 Copyright notice Many of the images in this powerpoint
More informationMapping NGS reads for genomics studies
Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization
More informationSequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.
Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD
More informationBLAST MCDB 187. Friday, February 8, 13
BLAST MCDB 187 BLAST Basic Local Alignment Sequence Tool Uses shortcut to compute alignments of a sequence against a database very quickly Typically takes about a minute to align a sequence against a database
More informationSequence mapping and assembly. Alistair Ward - Boston College
Sequence mapping and assembly Alistair Ward - Boston College Sequenced a genome? Fragmented a genome -> DNA library PCR amplification Sequence reads (ends of DNA fragment for mate pairs) We no longer have
More informationPairwise Sequence Alignment. Zhongming Zhao, PhD
Pairwise Sequence Alignment Zhongming Zhao, PhD Email: zhongming.zhao@vanderbilt.edu http://bioinfo.mc.vanderbilt.edu/ Sequence Similarity match mismatch A T T A C G C G T A C C A T A T T A T G C G A T
More informationAligning reads: tools and theory
Aligning reads: tools and theory Genome Sequence read :LM-Mel-14neg :LM-Mel-42neg :LM-Mel-14neg :LM-Mel-14pos :LM-Mel-42neg :LM-Mel-14neg :LM-Mel-42neg :LM-Mel-14neg chrx: 152139280 152139290 152139300
More informationNext Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010
Next Generation Sequence Alignment on the BRC Cluster Steve Newhouse 22 July 2010 Overview Practical guide to processing next generation sequencing data on the cluster No details on the inner workings
More informationHandling sam and vcf data, quality control
Handling sam and vcf data, quality control We continue with the earlier analyses and get some new data: cd ~/session_3 wget http://wasabiapp.org/vbox/data/session_4/file3.tgz tar xzf file3.tgz wget http://wasabiapp.org/vbox/data/session_4/file4.tgz
More informationNGS Data and Sequence Alignment
Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local
More informationAtlas-SNP2 DOCUMENTATION V1.1 April 26, 2010
Atlas-SNP2 DOCUMENTATION V1.1 April 26, 2010 Contact: Jin Yu (jy2@bcm.tmc.edu), and Fuli Yu (fyu@bcm.tmc.edu) Human Genome Sequencing Center (HGSC) at Baylor College of Medicine (BCM) Houston TX, USA 1
More informationCBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection
CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for
More informationIntroduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015
Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG
More informationSequence Alignment. GBIO0002 Archana Bhardwaj University of Liege
Sequence Alignment GBIO0002 Archana Bhardwaj University of Liege 1 What is Sequence Alignment? A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity.
More informationMapping, Alignment and SNP Calling
Mapping, Alignment and SNP Calling Heng Li Broad Institute MPG Next Gen Workshop 2011 Heng Li (Broad Institute) Mapping, alignment and SNP calling 17 February 2011 1 / 19 Outline 1 Mapping Messages from
More informationNGS Data Analysis. Roberto Preste
NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr
More informationGSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu
GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics
More informationAs of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be
48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and
More informationIntroduction to Computational Molecular Biology
18.417 Introduction to Computational Molecular Biology Lecture 13: October 21, 2004 Scribe: Eitan Reich Lecturer: Ross Lippert Editor: Peter Lee 13.1 Introduction We have been looking at algorithms to
More informationWilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment
An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi
More informationBioinformatics for Biologists
Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Director Bioinformatics & Research Computing Whitehead Institute Topics to Cover
More informationBasics on bioinforma-cs Lecture 4. Concita Cantarella
Basics on bioinforma-cs Lecture 4 Concita Cantarella concita.cantarella@entecra.it; concita.cantarella@gmail.com Why compare sequences Sequence comparison is a way of arranging the sequences of DNA, RNA
More informationLecture 12. Short read aligners
Lecture 12 Short read aligners Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationVariation among genomes
Variation among genomes Comparing genomes The reference genome http://www.ncbi.nlm.nih.gov/nuccore/26556996 Arabidopsis thaliana, a model plant Col-0 variety is from Landsberg, Germany Ler is a mutant
More informationGPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units
GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies
More informationAccurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing
Proposal for diploma thesis Accurate Long-Read Alignment using Similarity Based Multiple Pattern Alignment and Prefix Tree Indexing Astrid Rheinländer 01-09-2010 Supervisor: Prof. Dr. Ulf Leser Motivation
More informationModule: Sequence Alignment Theory and Applica8ons Session: BLAST
Module: Sequence Alignment Theory and Applica8ons Session: BLAST Learning Objec8ves and Outcomes v Understand the principles of the BLAST algorithm v Understand the different BLAST algorithms, parameters
More informationDatabase Searching Using BLAST
Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain
More informationAn Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST
An Analysis of Pairwise Sequence Alignment Algorithm Complexities: Needleman-Wunsch, Smith-Waterman, FASTA, BLAST and Gapped BLAST Alexander Chan 5075504 Biochemistry 218 Final Project An Analysis of Pairwise
More informationBiology 644: Bioinformatics
Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find
More informationLecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:
Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating
More informationCS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores.
CS 284A: Algorithms for Computational Biology Notes on Lecture: BLAST. The statistics of alignment scores. prepared by Oleksii Kuchaiev, based on presentation by Xiaohui Xie on February 20th. 1 Introduction
More informationVariant calling using SAMtools
Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel
More informationPre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory
Pre-processing and quality control of sequence data Barbera van Schaik KEBB - Bioinformatics Laboratory b.d.vanschaik@amc.uva.nl Topic: quality control and prepare data for the interesting stuf Keep Throw
More informationCompares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.
Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the
More informationReads Alignment and Variant Calling
Reads Alignment and Variant Calling CB2-201 Computational Biology and Bioinformatics February 22, 2016 Emidio Capriotti http://biofold.org/ Institute for Mathematical Modeling of Biological Systems Department
More informationBasic Local Alignment Search Tool (BLAST)
BLAST 26.04.2018 Basic Local Alignment Search Tool (BLAST) BLAST (Altshul-1990) is an heuristic Pairwise Alignment composed by six-steps that search for local similarities. The most used access point to
More informationSequence Alignment & Search
Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version
More informationPreparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers
Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions
More informationCISC 636 Computational Biology & Bioinformatics (Fall 2016)
CISC 636 Computational Biology & Bioinformatics (Fall 2016) Sequence pairwise alignment Score statistics: E-value and p-value Heuristic algorithms: BLAST and FASTA Database search: gene finding and annotations
More informationSimilarity Searches on Sequence Databases
Similarity Searches on Sequence Databases Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Zürich, October 2004 Swiss Institute of Bioinformatics Swiss EMBnet node Outline Importance of
More informationMasher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2, Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical
More informationKart: a divide-and-conquer algorithm for NGS read alignment
Bioinformatics, 33(15), 2017, 2281 2287 doi: 10.1093/bioinformatics/btx189 Advance Access Publication Date: 4 April 2017 Original Paper Sequence analysis Kart: a divide-and-conquer algorithm for NGS read
More informationIntroduc)on to annota)on with Artemis. Download presenta.on and data
Introduc)on to annota)on with Artemis Download presenta.on and data Annota)on Assign an informa)on to genomic sequences???? Genome annota)on 1. Iden.fying genomic elements by: Predic)on (structural annota.on
More informationBioinformatics for High-throughput Sequencing
Bioinformatics for High-throughput Sequencing An Overview Simon Anders EBI is an Outstation of the European Molecular Biology Laboratory. Overview In recent years, new sequencing schemes, also called high-throughput
More informationGalaxy Platform For NGS Data Analyses
Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account
More informationGBS Bioinformatics Pipeline(s) Overview
GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from
More informationBLAST. NCBI BLAST Basic Local Alignment Search Tool
BLAST NCBI BLAST Basic Local Alignment Search Tool http://www.ncbi.nlm.nih.gov/blast/ Global versus local alignments Global alignments: Attempt to align every residue in every sequence, Most useful when
More informationMapping and Viewing Deep Sequencing Data bowtie2, samtools, igv
Mapping and Viewing Deep Sequencing Data bowtie2, samtools, igv Frederick J Tan Bioinformatics Research Faculty Carnegie Institution of Washington, Department of Embryology tan@ciwemb.edu 27 August 2013
More informationWilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST
A Simple Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at http://www.ncbi.nih.gov/blast/
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationRNA-seq Data Analysis
Seyed Abolfazl Motahari RNA-seq Data Analysis Basics Next Generation Sequencing Biological Samples Data Cost Data Volume Big Data Analysis in Biology تحلیل داده ها کنترل سیستمهای بیولوژیکی تشخیص بیماریها
More informationMapping Reads to Reference Genome
Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene
More informationFile Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015
File Formats: SAM, BAM, and CRAM UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 / BAM / CRAM NEW! http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and
More informationMapping reads to a reference genome
Introduction Mapping reads to a reference genome Dr. Robert Kofler October 17, 2014 Dr. Robert Kofler Mapping reads to a reference genome October 17, 2014 1 / 52 Introduction RESOURCES the lecture: http://drrobertkofler.wikispaces.com/ngsandeelecture
More informationEnsembl RNASeq Practical. Overview
Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted
More informationDr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata
Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis
More informationAligners. J Fass 23 August 2017
Aligners J Fass 23 August 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-08-23
More informationSMALT Manual. December 9, 2010 Version 0.4.2
SMALT Manual December 9, 2010 Version 0.4.2 Abstract SMALT is a pairwise sequence alignment program for the efficient mapping of DNA sequencing reads onto genomic reference sequences. It uses a combination
More informationChapter 4: Blast. Chaochun Wei Fall 2014
Course organization Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms for Sequence Analysis (Week 3-11)
More informationBioinformatics. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing
Bioinformatics Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing Journal: Bioinformatics Manuscript ID: BIOINF-0-0 Category: Original Paper Date Submitted
More informationPreliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification
Preliminary Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Oct 25 Oct 28 Nov 4 Nov 8 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification OCTOBER BREAK
More informationHeuristic methods for pairwise alignment:
Bi03c_1 Unit 03c: Heuristic methods for pairwise alignment: k-tuple-methods k-tuple-methods for alignment of pairs of sequences Bi03c_2 dynamic programming is too slow for large databases Use heuristic
More informationAgroMarker Finder manual (1.1)
AgroMarker Finder manual (1.1) 1. Introduction 2. Installation 3. How to run? 4. How to use? 5. Java program for calculating of restriction enzyme sites (TaqαI). 1. Introduction AgroMarker Finder (AMF)is
More informationIntroduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)
Introduction to NGS analysis on a Raspberry Pi Beta version 1.1 (04 June 2013)!! Contents Overview Contents... 3! Overview... 4! Download some simulated reads... 5! Quality Control... 7! Map reads using
More informationINTRODUCTION TO BIOINFORMATICS
Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain
More informationNGS Sequence data. Jason Stajich. UC Riverside. jason.stajich[at]ucr.edu. twitter:hyphaltip stajichlab
NGS Sequence data Jason Stajich UC Riverside jason.stajich[at]ucr.edu twitter:hyphaltip stajichlab Lecture available at http://github.com/hyphaltip/cshl_2012_ngs 1/58 NGS sequence data Quality control
More informationChapter 4. Sequence Comparison
1896 1920 1987 2006 Chapter 4. Sequence Comparison 1 Contents 1. Sequence comparison 2. Sequence alignment 3. Sequence mapping Reading materials Required 1. A general method applicable to the search for
More informationPRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR
PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR GOAL OF THIS SESSION Assuming that The audiences know how to perform GWAS
More informationNGS Data Visualization and Exploration Using IGV
1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians
More informationHISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim
HISA2 Fast and sensi0ve alignment against general human popula0on Daehwan Kim infphilo@gmail.com History about BW, FM, XBW, GBW, and GFM BW (1994) BW for Linear path Burrows M, Wheeler DJ: A Block Sor0ng
More informationShort Read Alignment. Mapping Reads to a Reference
Short Read Alignment Mapping Reads to a Reference Brandi Cantarel, Ph.D. & Daehwan Kim, Ph.D. BICF 05/2018 Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements
More informationTutorial 4 BLAST Searching the CHO Genome
Tutorial 4 BLAST Searching the CHO Genome Accessing the CHO Genome BLAST Tool The CHO BLAST server can be accessed by clicking on the BLAST button on the home page or by selecting BLAST from the menu bar
More informationWorkshop on Genomics 2015 Sequence Alignment: An brief introduction. Konrad Paszkiewicz University of Exeter, UK
Workshop on Genomics 2015 Sequence Alignment: An brief introduction Konrad Paszkiewicz University of Exeter, UK k.h.paszkiewicz@exeter.ac.uk Contents Alignment algorithms for short-reads - Background Blast
More informationAnalyzing ChIP- Seq Data in Galaxy
Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...
More informationChIP-seq (NGS) Data Formats
ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/
More informationBioinformatics explained: Smith-Waterman
Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com
More informationMaize genome sequence in FASTA format. Gene annotation file in gff format
Exercise 1. Using Tophat/Cufflinks to analyze RNAseq data. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise
More informationBioinformatics explained: BLAST. March 8, 2007
Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics
More informationFASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.
FASTA INTRODUCTION Definition (by David J. Lipman and William R. Pearson in 1985) - Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence
More informationMapping. Reference. read
Mapping Reference read Assembly vs mapping contig1 contig2 reads bly as s em ll v sa all ma pp all ing vs r efe ren ce Reference What s the problem? Reads differ from the genome due to evolution and sequencing
More informationUNIVERSITY OF OSLO. Department of informatics. Parallel alignment of short sequence reads on graphics processors. Master thesis. Bjørnar Andreas Ruud
UNIVERSITY OF OSLO Department of informatics Parallel alignment of short sequence reads on graphics processors Master thesis Bjørnar Andreas Ruud April 29, 2011 2 Table of Contents 1 Abstract... 7 2 Acknowledgements...
More information