Lecture 12. Short read aligners
|
|
- Kevin Kennedy
- 5 years ago
- Views:
Transcription
1 Lecture 12 Short read aligners
2 Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola Get the ebola genome in FASTA format: efetch -db=nuccore -format=fasta -id=af > ~/reference/ebola/1976.fa
3 Build an index with bwa bwa index ~/reference/ebola/1976.fa
4 Align a paired-end dataset $fastq-dump -X split-files SRR $bwa mem -t 10 ~/reference/ebola/1976.fa SRR _1.fastq SRR _2.fastq > SRR sam
5 SAM The resulting file is in a so-called SAM format. It is one of the most recent bioinformatics data formats, one that by today has become the standard method to store and represent all high-throughput sequencing results A SAM file encompasses all known information about the sample and its alignment.
6 Help on bwa mem
7 Help on bwa mem
8 Sequence Alignment Map (SAM)
9 Overview The SAM format is the standard format for storing sequencing reads mapped to a reference. Its binary analog is BAM. The official specification of the SAM format:
10 Overview The SAM format is a TAB-delimited, line oriented text format with two sections: 1. Header: each line contains some metadata 2. Alignment: each line contains information on an alignment The SAM format specification lists the required and optional content for each of these sections.
11 Installing samtools echo export 'PATH=/opt/genomics/samtools-1.3.1:$ PATH' >> ~/.bashrc source ~/.bashrc
12 SAM header Header lines contain vital metadata about the reference sequences, read and sample information, and (optionally) processing steps and comments. Each header line begins with followed by a two-letter code that distinguishes the different type of metadata records in the header. Following this two-letter code are tab-delimited key-value pairs in the format KEY:VALUE.
13 SAM header
14 SAM store information about the reference sequence. The required key-values are 1. SN: stores the sequence name 2. LN: the sequence length All separate sequences in your reference have a corresponding entry in the header.
15 SAM contain important read group and sample metadata. 1. ID: The read group identifier ID is only required and must be unique. This ID value contains information about the origin of a set of reads. Some software relies on read groups to indicate a technical groups of reads, to account for batch effects. Consequently, it s beneficial to create read groups related to the specific sequencing run. 2. SM: Sample information is the metadata about your experiment s samples. 3. PL: sequencing platform
16 SAM contain metadata about the programs used to create and process a set of SAM/BAM files. Each program must have a unique ID value, and metadata such as program version number (VN) and the exact command line (CL) can be saved in these header entries. Many programs will add these line automatically.
17 samtools The standard way of interacting with SAM/BAM files is through the samtools. samtools view is the general tool for viewing and converting SAM/BAM files. A way to look at an entire SAM/BAM header is with
18 samtools We can see all read groups with samtools view without any arguments returns the entire alignment section without header:
19 SAM alignment The alignment section contains read alignments and usually includes reads that did not align. Each alignment entry is composed of 11 required fields and optional fields after this. 1. QNAME, 2. FLAG, 3. RNAME, 4. POS, 5. MAPQ, 6. CIGAR, 7-8. RNEXT and PNEXT, 9. TLEN, 10. SEQ, 11. QUAL
20 SAM alignment 1. QNAME: the name of the read 2. FLAG: the bitwise flag, which contains information about the alignment 3. RNAME: the reference name which sequence the query aligned to. The reference name must be in the SAM header as an SQ entry. If the read is unaligned, this entry may be *. 4. POS: the position on the reference sequence using 1-based indexing of the first mapping base in the query sequence. This may be zero if the read does not align.
21 SAM alignment MAPQ: the mapping quality, which is a measure of how likely the read is to actually originate from the position it maps to. Mapping quality is estimated by the aligner. Many tools downstream of aligners filter out reads that map with low mapping quality. CIGAR: the CIGAR string, which is a specialized format for describing the alignment. RNEXT and PNEXT (on the next line): the reference name and position of a paired-end read s partner. The value * indicates RNEXT is not available, and = indicates that RNEXT is the same as RNAME. PNEXT will be 0 when not available.
22 SAM alignment TLEN: the template length for paired-end reads SEQ: the original read sequence. This sequence will always be in the orientation it aligned in and this may be the reverse complement of the original read sequence. If your read aligned to the reverse strand, which is information kept in the bitwise flag field, this sequence will be the reverse complement. QUAL: the original read base quality.
23 Bitwise flags Bitwise flags are a very space-efficient and common way to encode attributes. Bitwise flags are like a series of toggle switches, each of which can be either on or off. Each of these toggle switches values are bits (0 or 1) of a binary number. Each bit in a bit-field represents a particular attribute about an alignment, with 1 indicating that the attribute is true and 0 indicating it s false
24 Bitwise flags
25 Bitwise flags As an example, suppose you encounter the bit-flag 147 (0x93 in hexadecimal) and you want to know what this says about this alignment. In binary this number is represented as We see that this corresponds to the attributes paired-end, aligned in proper pair, the sequence is reverse complemented, and the second read in the pair.
26 Primary, secondary and supplementary Primary alignment: the best alignments by score. If there are multiple alignment that match equally well the aligner will designate one as the primary alignment. Secondary alignment: In case of multiple matches one is designated as primary and all others are secondary Supplementary alignment: a read that cannot be represented as a single linear alignment and matches two different locations without significant overlap. These are also known as chimeric alignments. What gets called as chimeric alignment is software dependent.
27 samtools flags samtools flags generates
28 samtools flags samtools flags can translate decimal and hexadecimal flags:
29 samtools flagstat To generate statistics on the flags:
30 Properly aligned pairs R2 unmapped R2 mapped to a different chromosome incorrect orientation correct orientation but in valid distance
31 CIGAR strings CIGAR (Compact Idiosyncratic Gapped Alignment Representation) strings are another specialized way to encode information about an aligned sequence. While bitwise flags store true/false properties about an alignment, CIGAR strings encode information about which bases of an alignment are matches/mismatches, insertions, deletions, soft or hard clipped, and so on.
32 CIGAR strings As an example: A basic CIGAR string contains concatenated pairs of integer lengths and character operations.
33 CIGAR strings Operation Description M Alignment match (note that this could be a sequence match or mismatch) I Insertion, the nucleotide is present in the read but not in the reference D Deletion, the nucleotide is present in the reference but not in the read N Skipped region (from reference) S Soft-clipped region (soft-clipped regions are present in sequence in SEQ filed) H Hard-clipped region (not in sequence in SEQ field) P Padding = Sequence match X Sequence mismatch
34 CIGAR strings Soft clipping is when only part of the query sequence is aligned to the reference, leaving some portion of the query sequence unaligned. It occurs when an aligner can partially map a read to a location, but the alignment at the end of the sequence is questionable. Hard clipping is similar, but hard-clipped regions are not present in the sequence stored in the SAM filed SEQ.
35 CIGAR strings Clipped alignment: In local alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. 3S8M1D6M4S
36 CIGAR strings Spliced alignment: In cdna-to-genome alignment, we want to distinguish introns from deletions in exons. 9M32N8M
37 CIGAR strings 51M: a fully aligned 51bp read without insertions or deletions. By the SAM format specification, M means there s an alignment match, not that all bases in the query and reference sequence are identical. 43S6M1I26M: the first 43bp are soft clipped, the next 6 were matches/mismatches, then a 1bp insertion to the reference, and finally, 26 matches/mismatches. The SAM format specification mandates that all M, I, S, =, X operations lengths must add to the length of the seuqnece.
38 Padded alignment Most short-read aligners do not present how inserted sequences are aligned against each other. Alignment with inserted sequences fully aligned is called padded alignment. Padded alignment is produced by de novo assemblers and is important for an alignment viewer to display the alignment properly. To store padded alignment, we introduce P which can be considered as a silent deletion from padded reference sequence.
39 Padded alignment REF: TCA--GAC R1 : TCAGAGAC R2 : TCA-AGAC R3 : TCA--GAC Padded CIGAR R1 : 3M2I3M R2 : 3M1P1I3M R3 : 3M2P3M Unpadded CIGAR R1 : 3M2I3M R2 : 3M1I3M R3 : 6M
40 samtools
41 samtools view
42 Converting between SAM and BAM samtools view allows us to convert SAM to BAM with the b option. Similarly, we can go from BAM to SAM:
43 Converting between SAM and BAM When converting BAM to SAM, samtools view will not include the SAM header by default. SAM files without headers cannot be converted into BAM files. We can include the header with h:
44 Converting between SAM and BAM We only need to convert BAM to SAM when manually inspecting files. In general, it is better to store files in BAM format, as it s more space efficient, compatible with all samtools subcommands, and faster to process because tools can directly read in binary values rather than require parsing SAM strings.
45 Sort and index We sort by alignment position and index a BAM file to allow for fast random access to reads aligned within a certain region. We sort a BAM file with samtools sort:
46 Sort and index
47 Sort and index Sorting a large number of alignment can be very computationally intensive, so samtools sort has options that allow you to increase the memory allocation and parallelize sorting across multiple threads.
48 Sort and index Position-sorted BAM files are the starting point for most later processing steps such as SNP calling and extracting alignments from specific regions. Sorted BAM files are much more disk-space efficient than unsorted BAM files. Often, we want to work with alignments within a particular region in the genome. Iterating through an entire BAM file just to work with a subset of reads at a potion would be inefficient. Consequently, BAM files can be indexed. The BAM file must be sorted first, and we cannot index SAM file.
49 Sort and index To index a position-sorted BAM file, we use: This creates a file named SRR sorted.bam.bai, which contain the index for the BAM file.
50 Extracting alignments With a position-sorted and indexed BAM file, we can extract specific regions of an alignment with samtools view As an example, get a GenBank format file for the Ebola virus: $efetch -db=nuccore -format=gb -id=af > AF gb
51 Extracting alignments Pick a gene NP: Then, let s take a look at some alignments in this region
52 Extracting alignments We can also count the number of alignments that overlap with this region:
53 Filtering alignments samtools view has options for filtering alignments based on bitwise flags, mapping quality, read group. Two options to filter based on bitwise flags: -f: only outputs reads with the specified flags -F: only outputs reads without the specified flags.
54 Filtering alignments Suppose you want to output all reads that are unmapped.
55 Filtering alignments It is also possible to output reads with multiple bitwise flags set. For example, we could find the first reads that aligned in a proper pair alignment.
56 Filtering alignments We can use F option to extract alignments that do not have any of the bits set of the supplied flag argument. For example, suppose we wanted to extract all aligned reads
57 Filtering alignments Suppose you want to extract all reads that did not align in a proper pair (the read is aligned and paired, but not aligned in a proper pair). A naïve approach with F 2 is incorrect because both unmapped reads and unpaired reads will be included.
58 Filtering alignments We do this by combining bits:
SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.
Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference
More informationThe SAM Format Specification (v1.3 draft)
The SAM Format Specification (v1.3 draft) The SAM Format Specification Working Group July 15, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text
More informationINTRODUCTION AUX FORMATS DE FICHIERS
INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan
More informationThe SAM Format Specification (v1.3-r837)
The SAM Format Specification (v1.3-r837) The SAM Format Specification Working Group November 18, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited
More informationFile Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015
File Formats: SAM, BAM, and CRAM UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 / BAM / CRAM NEW! http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and
More informationGenomic Files. University of Massachusetts Medical School. October, 2014
.. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationGenomic Files. University of Massachusetts Medical School. October, 2015
.. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationSequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.
Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,
More informationHigh-throughout sequencing and using short-read aligners. Simon Anders
High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel
More informationSAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012
SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................
More informationRead Naming Format Specification
Read Naming Format Specification Karel Břinda Valentina Boeva Gregory Kucherov Version 0.1.3 (4 August 2015) Abstract This document provides a standard for naming simulated Next-Generation Sequencing (Ngs)
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,
More informationNGS Data Analysis. Roberto Preste
NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr
More informationWelcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.
Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your
More informationSequence Alignment/Map Format Specification
Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 28 Feb 2014 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing
More informationSequence Alignment/Map Format Specification
Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 2 Sep 2016 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing
More informationThe SAM Format Specification (v1.4-r956)
The SAM Format Specification (v1.4-r956) The SAM Format Specification Working Group April 12, 2011 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text
More informationFrom fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /
From fastq to vcf Overview of resequencing analysis samples fastq fastq fastq fastq mapping bam bam bam bam variant calling samples 18917 C A 0/0 0/0 0/0 0/0 18969 G T 0/0 0/0 0/0 0/0 19022 G T 0/1 1/1
More informationSequence Alignment/Map Optional Fields Specification
Sequence Alignment/Map Optional Fields Specification The SAM/BAM Format Specification Working Group 14 Jul 2017 The master version of this document can be found at https://github.com/samtools/hts-specs.
More informationEnsembl RNASeq Practical. Overview
Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted
More informationRNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF
RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au
More informationChIP-seq (NGS) Data Formats
ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/
More informationGoal: Learn how to use various tool to extract information from RNAseq reads.
ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2017 Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): Output(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta
More informationBioinformatics in next generation sequencing projects
Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational
More informationThe SAM Format Specification (v1.4-r994)
The SAM Format Specification (v1.4-r994) The SAM Format Specification Working Group January 27, 2012 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text
More informationTiling Assembly for Annotation-independent Novel Gene Discovery
Tiling Assembly for Annotation-independent Novel Gene Discovery By Jennifer Lopez and Kenneth Watanabe Last edited on September 7, 2015 by Kenneth Watanabe The following procedure explains how to run the
More informationLecture 8. Sequence alignments
Lecture 8 Sequence alignments DATA FORMATS bioawk bioawk is a program that extends awk s powerful processing of tabular data to processing tasks involving common bioinformatics formats like FASTA/FASTQ,
More informationCycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN. Sophie Gallina CNRS Evo-Eco-Paléo (EEP)
Cycle «Analyse de données de séquençage à haut-débit» Module 1/5 Analyse ADN Sophie Gallina CNRS Evo-Eco-Paléo (EEP) (sophie.gallina@univ-lille1.fr) Module 1/5 Analyse DNA NGS Introduction Galaxy : upload
More informationSequence Alignment/Map Format Specification
Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 14 Jul 2017 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing
More informationmerantk Version 1.1.1a
DIVISION OF BIOINFORMATICS - INNSBRUCK MEDICAL UNIVERSITY merantk Version 1.1.1a User manual Dietmar Rieder 1/12/2016 Page 1 Contents 1. Introduction... 3 1.1. Purpose of this document... 3 1.2. System
More informationSupplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.
Supplementary Figure 1 Fast read-mapping algorithm of BrowserGenome. (a) Indexing strategy: The genome sequence of interest is divided into non-overlapping 12-mers. A Hook table is generated that contains
More informationRead Mapping and Variant Calling
Read Mapping and Variant Calling Whole Genome Resequencing Sequencing mul:ple individuals from the same species Reference genome is already available Discover varia:ons in the genomes between and within
More informationNGS Analyses with Galaxy
1 NGS Analyses with Galaxy Introduction Every living organism on our planet possesses a genome that is composed of one or several DNA (deoxyribonucleotide acid) molecules determining the way the organism
More informationUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window
More informationBenchmarking of RNA-seq aligners
Lecture 17 RNA-seq Alignment STAR Benchmarking of RNA-seq aligners Benchmarking of RNA-seq aligners Benchmarking of RNA-seq aligners Benchmarking of RNA-seq aligners Based on this analysis the most reliable
More informationSMALT Manual. December 9, 2010 Version 0.4.2
SMALT Manual December 9, 2010 Version 0.4.2 Abstract SMALT is a pairwise sequence alignment program for the efficient mapping of DNA sequencing reads onto genomic reference sequences. It uses a combination
More informationDemultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)
next generation sequencing analysis guidelines Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs) See what more we can do for you at www.idtdna.com. For Research Use Only
More informationv0.3.0 May 18, 2016 SNPsplit operates in two stages:
May 18, 2016 v0.3.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.
More informationMerge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.
Preface p. xiii Ideology: Data Skills for Robust and Reproducible Bioinformatics How to Learn Bioinformatics p. 1 Why Bioinformatics? Biology's Growing Data p. 1 Learning Data Skills to Learn Bioinformatics
More informationMapping NGS reads for genomics studies
Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization
More informationBriefly: Bioinformatics File Formats. J Fass September 2018
Briefly: Bioinformatics File Formats J Fass September 2018 Overview ASCII Text Sequence Fasta, Fastq ~Annotation TSV, CSV, BED, GFF, GTF, VCF, SAM Binary (Data, Compressed, Executable) Data HDF5 BAM /
More informationWorking with aligned nucleotides (WORK- IN-PROGRESS!)
Working with aligned nucleotides (WORK- IN-PROGRESS!) Hervé Pagès Last modified: January 2014; Compiled: November 17, 2017 Contents 1 Introduction.............................. 1 2 Load the aligned reads
More informationv0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting
October 08, 2015 v0.2.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.
More informationRNA-seq. Manpreet S. Katari
RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene
More informationUMass High Performance Computing Center
UMass High Performance Computing Center University of Massachusetts Medical School February, 2019 Challenges of Genomic Data 2 / 93 It is getting easier and cheaper to produce bigger genomic data every
More informationSequence Analysis Pipeline
Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation
More informationStandard output. Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines:
Lecture 18 RNA-seq Alignment Standard output Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines: Filtering of the alignments STAR performs
More informationNGS Data and Sequence Alignment
Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local
More informationMapping reads to a reference genome
Introduction Mapping reads to a reference genome Dr. Robert Kofler October 17, 2014 Dr. Robert Kofler Mapping reads to a reference genome October 17, 2014 1 / 52 Introduction RESOURCES the lecture: http://drrobertkofler.wikispaces.com/ngsandeelecture
More informationCORE Year 1 Whole Genome Sequencing Final Data Format Requirements
CORE Year 1 Whole Genome Sequencing Final Data Format Requirements To all incumbent contractors of CORE year 1 WGS contracts, the following acts as the agreed to sample parameters issued by NHLBI for data
More informationIdentiyfing splice junctions from RNA-Seq data
Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010 Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice
More informationIntroduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)
Introduction to NGS analysis on a Raspberry Pi Beta version 1.1 (04 June 2013)!! Contents Overview Contents... 3! Overview... 4! Download some simulated reads... 5! Quality Control... 7! Map reads using
More informationm6aviewer Version Documentation
m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.
More informationRCAC. Job files Example: Running seqyclean (a module)
RCAC Job files Why? When you log into an RCAC server you are using a special server designed for multiple users. This is called a frontend node ( or sometimes a head node). There are (I think) three front
More informationNGS Analysis Using Galaxy
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
More informationSAMtools. SAM BAM. mapping. BAM sort & indexing (ex: IGV) SNP call
SAMtools http://samtools.sourceforge.net/ SAM/BAM mapping BAM SAM BAM BAM sort & indexing (ex: IGV) mapping SNP call SAMtools NGS Program: samtools (Tools for alignments in the SAM format) Version: 0.1.19
More informationDr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata
Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis
More informationCRAM format specification (version 2.1)
CRAM format specification (version 2.1) cram-dev@ebi.ac.uk 23 Apr 2018 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version c8b9990 from that
More informationPreparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers
Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions
More informationVariant calling using SAMtools
Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel
More informationGalaxy Platform For NGS Data Analyses
Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account
More informationHelpful Galaxy screencasts are available at:
This user guide serves as a simplified, graphic version of the CloudMap paper for applicationoriented end-users. For more details, please see the CloudMap paper. Video versions of these user guides and
More informationBioinformatics Framework
Persona: A High-Performance Bioinformatics Framework Stuart Byma 1, Sam Whitlock 1, Laura Flueratoru 2, Ethan Tseng 3, Christos Kozyrakis 4, Edouard Bugnion 1, James Larus 1 EPFL 1, U. Polytehnica of Bucharest
More informationImporting sequence assemblies from BAM and SAM files
BioNumerics Tutorial: Importing sequence assemblies from BAM and SAM files 1 Aim With the BioNumerics BAM import routine, a sequence assembly in BAM or SAM format can be imported in BioNumerics. A BAM
More informationExome sequencing. Jong Kyoung Kim
Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic
More informationDindel User Guide, version 1.0
Dindel User Guide, version 1.0 Kees Albers University of Cambridge, Wellcome Trust Sanger Institute caa@sanger.ac.uk October 26, 2010 Contents 1 Introduction 2 2 Requirements 2 3 Optional input 3 4 Dindel
More informationAn Introduction to Linux and Bowtie
An Introduction to Linux and Bowtie Cavan Reilly November 10, 2017 Table of contents Introduction to UNIX-like operating systems Installing programs Bowtie SAMtools Introduction to Linux In order to use
More informationVariation among genomes
Variation among genomes Comparing genomes The reference genome http://www.ncbi.nlm.nih.gov/nuccore/26556996 Arabidopsis thaliana, a model plant Col-0 variety is from Landsberg, Germany Ler is a mutant
More informationNext Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010
Next Generation Sequence Alignment on the BRC Cluster Steve Newhouse 22 July 2010 Overview Practical guide to processing next generation sequencing data on the cluster No details on the inner workings
More informationThe software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).
Release Notes Agilent SureCall 4.0 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional
More informationall M 2M_gt_15 2M_8_15 2M_1_7 gt_2m TopHat2
Pairs processed per second 6, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 72,318 418 1,666 49,495 21,123 69,984 35,694 1,9 71,538 3,5 17,381 61,223 69,39 55 19,579 44,79 65,126 96 5,115 33,6 61,787
More informationMapping. Reference. read
Mapping Reference read Assembly vs mapping contig1 contig2 reads bly as s em ll v sa all ma pp all ing vs r efe ren ce Reference What s the problem? Reads differ from the genome due to evolution and sequencing
More informationLong Read RNA-seq Mapper
UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...
More informationWilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment
An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi
More informationRsubread package: high-performance read alignment, quantification and mutation discovery
Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For
More informationSequence Mapping and Assembly
Practical Introduction Sequence Mapping and Assembly December 8, 2014 Mary Kate Wing University of Michigan Center for Statistical Genetics Goals of This Session Learn basics of sequence data file formats
More informationSSAHA2 Manual. September 1, 2010 Version 0.3
SSAHA2 Manual September 1, 2010 Version 0.3 Abstract SSAHA2 maps DNA sequencing reads onto a genomic reference sequence using a combination of word hashing and dynamic programming. Reads from most types
More informationMar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037
Mar 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Inc. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated
More informationAgroMarker Finder manual (1.1)
AgroMarker Finder manual (1.1) 1. Introduction 2. Installation 3. How to run? 4. How to use? 5. Java program for calculating of restriction enzyme sites (TaqαI). 1. Introduction AgroMarker Finder (AMF)is
More informationAnalyzing Variant Call results using EuPathDB Galaxy, Part II
Analyzing Variant Call results using EuPathDB Galaxy, Part II In this exercise, we will work in groups to examine the results from the SNP analysis workflow that we started yesterday. The first step is
More informationSequence Alignment: Mo1va1on and Algorithms
Sequence Alignment: Mo1va1on and Algorithms Mo1va1on and Introduc1on Importance of Sequence Alignment For DNA, RNA and amino acid sequences, high sequence similarity usually implies significant func1onal
More informationSlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching
SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James
More informationRead mapping with BWA and BOWTIE
Read mapping with BWA and BOWTIE Before We Start In order to save a lot of typing, and to allow us some flexibility in designing these courses, we will establish a UNIX shell variable BASE to point to
More informationReads Alignment and Variant Calling
Reads Alignment and Variant Calling CB2-201 Computational Biology and Bioinformatics February 22, 2016 Emidio Capriotti http://biofold.org/ Institute for Mathematical Modeling of Biological Systems Department
More informationMapping and Viewing Deep Sequencing Data bowtie2, samtools, igv
Mapping and Viewing Deep Sequencing Data bowtie2, samtools, igv Frederick J Tan Bioinformatics Research Faculty Carnegie Institution of Washington, Department of Embryology tan@ciwemb.edu 27 August 2013
More informationRsubread package: high-performance read alignment, quantification and mutation discovery
Rsubread package: high-performance read alignment, quantification and mutation discovery Wei Shi 14 September 2015 1 Introduction This vignette provides a brief description to the Rsubread package. For
More informationPackage Rbowtie. January 21, 2019
Type Package Title R bowtie wrapper Version 1.23.1 Date 2019-01-17 Package Rbowtie January 21, 2019 Author Florian Hahne, Anita Lerch, Michael B Stadler Maintainer Michael Stadler
More informationHIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)
HIPPIE User Manual (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu) OVERVIEW OF HIPPIE o Flowchart of HIPPIE o Requirements PREPARE DIRECTORY STRUCTURE FOR HIPPIE EXECUTION o
More informationMaize genome sequence in FASTA format. Gene annotation file in gff format
Exercise 1. Using Tophat/Cufflinks to analyze RNAseq data. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise
More informationThe software and data for the RNA-Seq exercise are already available on the USB system
BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software
More informationMapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6
Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis
More informationOur data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:
Practical Course in Genome Bioinformatics 19.2.2016 (CORRECTED 22.2.2016) Exercises - Day 5 http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2016/ Answer the 5 questions (Q1-Q5) according
More informationPerl for Biologists. Session 8. April 30, Practical examples. (/home/jarekp/perl_08) Jon Zhang
Perl for Biologists Session 8 April 30, 2014 Practical examples (/home/jarekp/perl_08) Jon Zhang Session 8: Examples CBSU Perl for Biologists 1.1 1 Review of Session 7 Regular expression: a specific pattern
More informationMIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September
MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September 27 2014 Static Dynamic Static Minimum Information for Reporting
More informationPre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory
Pre-processing and quality control of sequence data Barbera van Schaik KEBB - Bioinformatics Laboratory b.d.vanschaik@amc.uva.nl Topic: quality control and prepare data for the interesting stuf Keep Throw
More informationResequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight
Resequencing Analysis (Pseudomonas aeruginosa MAPO1 ) 1 Workflow Import NGS raw data Trim reads Import Reference Sequence Reference Mapping QC on reads Variant detection Case Study Pseudomonas aeruginosa
More informationSentieon Documentation
Sentieon Documentation Release 201808.03 Sentieon, Inc Dec 21, 2018 Sentieon Manual 1 Introduction 1 1.1 Description.............................................. 1 1.2 Benefits and Value..........................................
More informationSuper-Fast Genome BWA-Bam-Sort on GLAD
1 Hututa Technologies Limited Super-Fast Genome BWA-Bam-Sort on GLAD Zhiqiang Ma, Wangjun Lv and Lin Gu May 2016 1 2 Executive Summary Aligning the sequenced reads in FASTQ files and converting the resulted
More informationSep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037
Sep 2017 DRAGEN TM Quick Start Guide www.edicogenome.com info@edicogenome.com Edico Genome Corp. 3344 North Torrey Pines Court, Plaza Level, La Jolla, CA 92037 Notice Contents of this document and associated
More information