Goal: Learn how to use various tool to extract information from RNAseq reads.

Size: px
Start display at page:

Download "Goal: Learn how to use various tool to extract information from RNAseq reads."

Transcription

1 ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2017 Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): Output(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta Moryzae_70-15_*_RNA_sample_{1-2}.fastq magnaporthe_oryzae-70-15_8_transcripts.gtf 70-15_RNA_sample_{1-3}_thout directory 70-15_RNA_sample_{1-3}_clout directory merged.gtf file gene_exp.diff file 4.1 Mapping RNAseq Reads to a Genome Assembly We will use TopHat2 to align RNAseq reads to a genome assembly of the fungal strain from which they were derived (strain 70-15). Trapnell et al. (2009) TopHat: discovering splice junctions with RNAseq. Bioinformatics 25: TopHat2 uses the Bowtie2 alignment engine to map RNA seq reads to the genome assembly. Bowtie2 utilizes an indexed transformation of the genome assembly to perform its alignment, so the first step is to create the relevant indexes. Usage: bowtie2-build [options] -f <reference_genome> <index_prefix> Where <reference genome> is the path to the genome multifasta file and <index_prefix> is the name to be given to the index. Change to the rnaseq directory. Remember, there is no need to leave this directory. All operations, such as listing of subdirectories, etc. can be performed from this location. Essentials of Next Generation Sequencing 2017 Page 1 of 16

2 Generate the bowtie index: bowtie2-build f magnaporthe_oryzae_70-15_8_supercontigs.fasta \ Moryzae -f specifies the name of a multifasta file, or a directory containing multiple fasta files Create a new directory called index and place the resulting index files inside it (note: the relevant files will have a.bt2 suffix). Use Tophat2 to map each set of RNAseq reads to the bowtie index: Usage: tophat2 [options] o <output_dir> <path-to-indexes> <input-file(s)> tophat2 -p 1 -o 70-15_mycelial_RNA_sample_1_thout index/moryzae \ Moryzae_70-15_mycelial_RNA_sample_1.fastq -p number of processors to use (select 1). Note: you only have one available to you for this exercise but normally you would run as many as are available. -o name of output directory TopHat2 invoked with the above command will produce an output folder (70-15_mycelial_RNA_sample1_thout) containing several files and a subdirectory containing log files: accepted_hits.bam: align_summary.txt contains alignment information for all of the reads that were successfully mapped to the genome. provides an overall summary of alignment statistics left_kept_reads_info: minimum read length, maximum read length; total reads; successfully mapped read. insertions.bed: deletions.bed: junctions.bed: logs: prep_reads.info unmapped.bam lists nucleotide insertions in the input sequences lists nucleotide deletions in the input sequences lists splice junctions records summary data from intermediate steps provides information on filtering of reads contains.bam entries for unmapped reads Use a command line function to take a look at the results in the accepted_hits.bam file. Hint: to view the file, you will either need to change into the output directory created by TopHat, or specific the complete path to the file you wish to view. Essentials of Next Generation Sequencing 2017 Page 2 of 16

3 Does the output make any sense? No? Let s quit the Unix function (^c) and use samtools to convert the.bam file into the human-readable.sam format: samtools view 70-15_mycelial_RNA_sample_1_thout/accepted_hits.bam Whoa! Did you catch all that? Quit the process (^c) and try piping the results through the more command line function. Next use re-direction to write the output to a file. Repeat the mapping process for the remaining sequence files (remember that you need to be in the rnaseq directory): Moryzae_70-15_mycelial_RNA_sample_2.fastq Moryzae_70-15_spore_RNA_sample_1.fastq Hint: you can use the up arrow key to copy the previous command to the current command line buffer. However, you must remember to change the input and output names to prevent overwriting of previous results. 4.2 Assembling Transcripts From RNAseq Data We will use cufflinks to build transcripts from RNAseq reads and compare expression profiles between different RNA samples: Trapnell et al. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology 28: The first step in differential gene expression analysis is to identify the gene from which each sequence read is derived. Cufflinks examines the raw RNAseq mapping results and attempts to reconstruct complete transcripts and identify transcript isoforms based on overlapping alignments. Usage: cufflinks [options] o <output_dir> <path/to/accepted_hits.bam> Make sure you are in the rnaseq directory. Run cufflinks, providing a reference transcriptome in the form of a.gtf file. All one line: cufflinks p 1 g magnaporthe_oryzae_70-15_8_transcripts.gtf \ o 70-15_mycelial_RNA_sample_1_clout \ 70-15_mycelial_RNA_sample_1_thout/accepted_hits.bam -o name of output directory p number of processors to use -g/--gtf-guide tells cufflinks to use the provided reference annotation to guide transcript assembly but also to report novel transcripts/isoforms Essentials of Next Generation Sequencing 2017 Page 3 of 16

4 Notes: A) Omitting the g option (and accompanying.gtf file specification) from the above command would tell the program to generate a de novo transcript assembly. Alternatively, one can use -G/--GTF which will tell the program to assemble only those reads that correspond to previously identified genes/transcripts. B) The developers recommend that you assemble your replicates individually, i) to speed computation; and ii) to simplify junction identification. Therefore, you will need to run cufflinks separately for each of your.bam files. With the above example, the results will be saved in a directory named 70-15_mycelial_RNA_sample_1_clout Re-run cufflinks using each of your accepted_hits.bam files, remembering to change sample_1 in both the input and output folder names. Examine one of the.gtf files produced by cufflinks. See if you can determine what information is contained in the various columns. 4.3 Merging Transcript Assemblies We will use cuffmerge to generate a super-assembly of transcripts based on the mapping information from all three RNAseq datasets. Cuffmerge identifies overlaps between alignment data for different RNAseq datasets. In this way, it can assemble complete transcripts for genes whose expression levels are too low to allow full transcript reconstruction from a single sequencing lane. Usage: cuffmerge [options] <list_of_gtf_files> Make sure you are in the rnaseq directory Open a text editor and create a list of the.gtf files that will be incorporated into the superassembly. The list should have the following format:./70-15_mycelial_rna_sample_1_clout/transcripts.gtf./70-15_mycelial_rna_sample_2_clout/transcripts.gtf./70-15_spore_rna_sample_1_clout/transcripts.gtf etc. Here is another example of where a file created by a standard text editor such as Word will not be read properly by the cuffmerge program and will produce an error. Include the.gtf files for the three datasets and save the file using the name assemblies.txt. Run cuffmerge (changing filenames as necessary): cuffmerge p 1 s magnaporthe_oryzae-70-15_8_supercontigs.fasta \ -g magnaporthe_oryzae-70-15_8_transcripts.gtf assemblies.txt -s points to the genome sequence which is used in the classification of transfrags that do not correspond to known genes -p number of processors to use -g include the reference annotation in the merging operation Essentials of Next Generation Sequencing 2017 Page 4 of 16

5 Examine the merged.gtf file produced by cuffmerge inside of merged_asm. Use command line tools to interrogate the file to identify novel transcripts that have not been previously identified. Note: these will lack MGG identifiers. 4.4 Differential Gene Expression Analysis We will use cuffdiff to determine if any genes are differentially expressed in one of the RNAseq datasets. To compare gene expression levels, it is necessary to have a set of genes that one wants to interrogate. For our purposes, we will have cuffdiff use the merged.gtf file produced by cuffmerge, which combines existing gene annotations (if available) with new information (novel transcripts, isoforms, etc.) generated from the RNAseq data. It then uses the alignment data (in the.bam files) to calculate and compare abundances. Usage: cuffdiff [options] <transcripts.gtf> <sample1.replicate1.bam,sample1.replicate2.bam > <sample2.replicate1.bam,sample2.replicate2.bam > Note: experimental replicates are separated with commas; datasets being compared are separated by a space (i.e.: Set1_rep1,Set1_rep2 Set2_rep1,Set2_rep2) For our experiment, we will compare transcript abundance in spores versus two replicates of mycelium Run cuffdiff as follows (do not put a space between the comma): cuffdiff -o diff_out p 1 L mycelium,spores \ u merged_asm/merged.gtf \./70-15_mycelial_RNA_sample_1_thout/accepted_hits.bam,\./70-15_mycelial_RNA_sample_2_thout/accepted_hits.bam \./70-15_spore_RNA_sample_1_thout/accepted_hits.bam -o output directory where results will be deposited -p number of processors to use -L Labels to use for the three samples being compared. These labels will appear at the top of the relevant columns in the various output files. -u Tells cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome Be sure not to put spaces around the comma! By default cuffdiff writes results to a file named gene_exp.diff, inside of your defined output folder. The gene expression differences are written to the file named gene_exp.diff. View the header of this file and see if you can determine what information is contained in the various columns. If necessary, look at the description of the output columns in the following Appendix, or look at the online cuffdiff manual (cufflinks.cbcb.umd.edu/manual.html) Essentials of Next Generation Sequencing 2017 Page 5 of 16

6 Use the command line to produce a list that contains the identities of the genes that show significant differences in their expression levels (only the names of the genes and nothing else). Write this list to a file. Hint: You will need to use awk. Examine the junctions.bed file to determine if the RNAseq data support the existence of novel transcript isoforms (as evidenced by the presence of novel splice junctions). Use the command line to determine how many novel junctions are robust (supported by at least 10 sequencing reads). Essentials of Next Generation Sequencing 2017 Page 6 of 16

7 How to interpret a number of useful output files 1. The Sequence Alignment/MAP (SAM) format (Tophat2/bowtie2) This is the default output format for any NGS alignment program. It, or it s.bam equivalent, serves as input to many tertiary analysis programs..bam files are simply binary versions of.sam files. a. Mandatory fields in the SAM format No. Name Description 1 QNAME Query NAME of the read or the read pair 2 FLAG Bitwise FLAG (pairing, strand, mate strand, etc.) 3 RNAME Reference sequence NAME 4 POS 1-Based leftmost POSition of clipped alignment 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR Extended CIGAR string (operations: MIDNSHP) 7 MRNM Mate Reference NaMe ( = if same as RNAME) 8 MPOS 1-Based leftmost Mate POSition 9 ISIZE Inferred Insert SIZE 10 SEQ Query SEQuence on the same strand as the reference 11 QUAL Query QUALity (ASCII-33=Phred base quality) 1. QNAME and FLAG are required for all alignments. If the mapping position of the query is not available, RNAME and CIGAR are set as *, and POS and MAPQ as 0. If the query is unpaired or pairing information is not available, MRNM equals *, and MPOS and ISIZE equal 0. SEQ and QUAL can both be absent, represented as a star *. If QUAL is not a star, it must be of the same length as SEQ. 2. The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in different alignment records, representing multiple or split hits. The maximum string length is If SQ is present in the header, RNAME and MRNM must appear in an SQ header record. 4. Field MAPQ considers pairing in calculation if the read is paired. Providing MAPQ is recommended. If such a calculation is difficult, 255 should be applied, indicating the mapping quality is not available. 5. If the two reads in a pair are mapped to the same reference, ISIZE equals the difference between the coordinate of the 5ʼ -end of the mate and of the 5ʼ -end of the current read; otherwise ISIZE equals 0 (by the 5ʼ -end we mean the 5ʼ -end of the original read, so for Illumina short-insert paired end reads this calculates the difference in mapping coordinates of Essentials of Next Generation Sequencing 2017 Page 7 of 16

8 the outer edges of the original sequenced fragment). ISIZE is negative if the mate is mapped to a smaller coordinate than the current read. 6. Color alignments are stored as normal nucleotide alignments with additional tags describing the raw color sequences, qualities and color-specific properties (see also Note 5 in section 2.2.4). 7. All mapped reads are represented on the forward genomic strand. The bases are reverse complemented from the unmapped read sequence and the quality scores and cigar strings are recorded consistently with the bases. This applies to information in the mate tags (R2, Q2, S2, etc.) and any other tags that are strand sensitive. The strand bits in the flag simply indicates whether this reverse complement transform was applied from the original read sequence to obtain the bases listed in the SAM file. b. SAM File Header Lines - Record Types and Tags Type Tag Description VN* File format version. SO Sort order. Valid values are: unsorted, queryname or coordinate. HD - header Group order (full sorting is not imposed in a group). Valid values are: none, GO query or reference. Sequence name. Unique among all sequence records in the file. The value SN* of this field is used in alignment records. LN* Sequence length. Genome assembly identifier. Refers to the reference genome assembly in an SQ Sequence AS unambiguous form. Example: HG18. dictionary MD5 checksum of the sequence in the uppercase (gaps and space are M5 removed) UR URI of the sequence SP Species. ID* Unique read group identifier. The value of the ID field is used in the RG tags of alignment records. SM* Sample (use pool name where a pool is being sequenced) LB Library DS Description RG - read group PU Platform unit (e.g. lane for Illumina or slide for SOLiD); should be a full, unambiguous identifier PI Predicted median insert size (maybe different from the actual median insert size) CN Name of sequencing center producing the read. DT Date the run was produced (ISO 8601 date or date/time). PL Platform/technology used to produce the read. ID* Program name PG - Program VN Program version CL Command line CO - comment One-line text comments Essentials of Next Generation Sequencing 2017 Page 8 of 16

9 c. Interpretation of bitwise flags in.sam/.bam files Flag 0x0001 0x0002 Description the read is paired in sequencing, no matter whether it is mapped in a pair the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment) 1 0x0004 the query sequence itself is unmapped 0x0008 the mate is unmapped 1 0x0010 strand of the query (0 for forward; 1 for reverse strand) 0x0020 strand of the mate 1 0x0040 the read is the first read in a pair 1,2 0x0080 the read is the second read in a pair 1,2 0x0100 0x0200 0x0400 the alignment is not primary (a read having split hits may have multiple primary alignment records) the read fails platform/vendor quality checks the read is either a PCR duplicate or an optical duplicate Essentials of Next Generation Sequencing 2017 Page 9 of 16

10 d. CIGAR String Operations The CIGAR string describes the alignment between the sequence read and the reference genome. Operation BAM Description M 0 alignment match (can be a sequence match or mismatch) I 1 insertion to the reference D 2 deletion from the reference N 3 skipped region from the reference S 4 soft clipping (clipped sequences present in SEQ) H 5 hard clipping (clipped sequences NOT present in SEQ) P 6 padding (silent deletion from padded reference) = 7 sequence match X 8 sequence mismatch H can only be present as the first and/or last operation. S may only have H operations between them and the ends of the CIGAR string. For mrna-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined. Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ Essentials of Next Generation Sequencing 2017 Page 10 of 16

11 2. Junctions.bed (Tophat2) [seqname] [start] [end] [id] [score] [strand] [thickstart] [thickend] [r,g,b] [block_count] [block_sizes] [block_locations] "start" is the start position of the leftmost read that contains the junction. "end" is the end position of the rightmost read that contains the junction. "id" is the junctions id, e.g. JUNC0001 "score" is the number of reads that contain the junction. "strand" is either + or -. "thickstart" and "thickend" don't seem to have any effect on display for a junctions track. TopHat sets them as equal to start and end respectively. "r","g" and "b" are the red, green, and blue values. They affect the color of the display in a browser. "block_count", "block_sizes" and "block_locations": The block_count will always be 2. The two blocks specify the regions on either side of the junction. "block_sizes" tells you how large each region is, and "block_locations" tells you, relative to the "start" being 0, where the two blocks occur. Therefore, the first block_location will always be zero. 3. Insertions.bed (Tophat2) [chrom] [chromstart] [chromend] [name] [score]: track name=insertions description="tophat insertions" Chromosome_ G 1 Chromosome_ C 1 Chromosome_ C 1 Chromosome_ G 1 Chromosome_ C 1 Chromosome_ A 4 Chromosome_ C 4 Chromosome_ A 1 Chromosome_ A 2 Chromosome_ GA 1 Chromosome_ A 2 Chromosome_ C 1 Chromosome_ T 1 Chromosome_ G 1 Chromosome_ T 1 Chromosome_ GA 37 Chromosome_ T 1 Chromosome_ T 1 Chromosome_ TT 19 Notes: Track name is the name given to the relevant track in a genome browser. ChromStart and chromend indicate the base position where the insertion occurred; name field indicates inserted base(s); score indicates depth of sequence coverage. Essentials of Next Generation Sequencing 2017 Page 11 of 16

12 4. Deletions.bed (Tophat2) [chrom] [chromstart] [chromend] [name] [score] track name=deletions description="tophat deletions" Chromosome_ Chromosome_ Chromosome_ Chromosome_ Chromosome_ Chromosome_ Chromosome_ Chromosome_ Chromosome_ Chromosome_ Chromosome_ Chromosome_ Chromosome_ Note: Track name is the name given to the relevant track in a genome browser. ChromStart indicates first deleted base; chromend indicates first retained base; name field indicates a deletion; score indicates depth of sequence coverage. Essentials of Next Generation Sequencing 2017 Page 12 of 16

13 5. GFF/GTF File Format - Definition and supported options The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. The following documentation is based on the Version 2 specifications. The GTF (General Transfer Format) is identical to GFF version 2. Fields Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.' 1. Seqname name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below. 2. Source name of the program that generated this feature, or the data source (database or project name) 3. Feature feature type name, e.g. Gene, Variation, Similarity 4. Start Start position of the feature, with sequence numbering starting at End End position of the feature, with sequence numbering starting at Score A floating point value. 7. Strand defined as + (forward) or - (reverse). 8. Frame One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.. 9. Attribute A semicolon-separated list of tag-value pairs, providing additional information about each feature. Essentials of Next Generation Sequencing 2017 Page 13 of 16

14 Example: Chromosome_8.1 Cufflinks transcript gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; full_read_support "yes"; Chromosome_8.1 Cufflinks exon gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; Chromosome_8.1 Cufflinks transcript gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; full_read_support "yes"; Chromosome_8.1 Cufflinks exon gene_id "CUFF.2"; transcript_id "CUFF.2.1"; exon_number "1"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; Chromosome_8.1 Cufflinks transcript gene_id "CUFF.2"; transcript_id "MGG_01951T0"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; full_read_support "no"; Chromosome_8.1 Cufflinks exon gene_id "CUFF.2"; transcript_id "MGG_01951T0"; exon_number "1"; FPKM " "; frac " "; conf_lo " "; conf_hi " "; cov " "; Track lines Although not part of the formal GFF specification, Ensembl will use track lines to further configure sets of features. Track lines should be placed at the beginning of the list of features they are to affect. The track line consists of the word 'track' followed by space-separated key=value pairs - see the example below. Valid parameters used by Ensembl are: name - unique name to identify this track when parsing the file description - Label to be displayed under the track in Region in Detail priority - integer defining the order in which to display tracks, if multiple tracks are defined. More information For more information about this file format, see the documentation on the Sanger Institute website. Essentials of Next Generation Sequencing 2017 Page 14 of 16

15 6. CUFFDIFF output (.diff files) These are the main output file that one queries to identify genes that are differentially expressed between groups. There are 14 columns containing the following information: 1. test_id This is an arbitrary name given to describe the test. Defined in the attributes field of the.gtf file 2. gene_id This is a systematic gene identifier. Defined in the attributes field of the.gtf file. In the above example, the names under gene are actually systematic identifiers (common names have not been assigned to the genes in this annotation) 3. gene This would normally be a common gene name (e.g. GAPDH, TRP1, etc.). Defined in the attributes field of the.gtf file. 4. locus Position in reference genome 5. sample_1 Common name given to test group 1 6. sample_2 Common name given to test group 2 7. status Statement about the statistical test (YES test was performed; or NOTEST) 8. value_1 Average FPKM for samples in test group 1 9. value_2 Average FPKM for samples in test group log2(fold_change) Fold-change in FPKM in group 2, relative to group test_stat Test statistic value 12. p_value Corresponding p-value 13. q_value Corresponding q-value (p-value adjusted for false discovery rate due to multiple hypothesis testing) 14. significant Whether or not q-value is significant Essentials of Next Generation Sequencing 2017 Page 15 of 16

16 Example (gene_exp.diff): test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant XLOC_ XLOC_ MGG_01946 Chromosome_8.1: control expt OK no XLOC_ XLOC_ MGG_15984 Chromosome_8.1: control expt NOTEST no XLOC_ XLOC_ MGG_01950 Chromosome_8.1: control expt NOTES no XLOC_ XLOC_ MGG_01960 Chromosome_8.1: control expt NOTEST no XLOC_ XLOC_ MGG_01963 Chromosome_8.1: control expt NOTEST inf no XLOC_ XLOC_ MGG_15986 Chromosome_8.1: control expt OK no Essentials of Next Generation Sequencing 2017 Page 16 of 16

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2014 UNIVERSITY OF KENTUCKY AGTC Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta

More information

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples. Services Performed The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples. SERVICE Sample Received Sample Quality Evaluated Sample Prepared for Sequencing

More information

TopHat, Cufflinks, Cuffdiff

TopHat, Cufflinks, Cuffdiff TopHat, Cufflinks, Cuffdiff Andreas Gisel Institute for Biomedical Technologies - CNR, Bari TopHat TopHat TopHat TopHat is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon

More information

The SAM Format Specification (v1.3 draft)

The SAM Format Specification (v1.3 draft) The SAM Format Specification (v1.3 draft) The SAM Format Specification Working Group July 15, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text

More information

The SAM Format Specification (v1.3-r837)

The SAM Format Specification (v1.3-r837) The SAM Format Specification (v1.3-r837) The SAM Format Specification Working Group November 18, 2010 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited

More information

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

File Formats: SAM, BAM, and CRAM. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 File Formats: SAM, BAM, and CRAM UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 / BAM / CRAM NEW! http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and

More information

Sequence Analysis Pipeline

Sequence Analysis Pipeline Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation

More information

Lecture 12. Short read aligners

Lecture 12. Short read aligners Lecture 12 Short read aligners Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola

More information

RNA-Seq Analysis With the Tuxedo Suite

RNA-Seq Analysis With the Tuxedo Suite June 2016 RNA-Seq Analysis With the Tuxedo Suite Dena Leshkowitz Introduction In this exercise we will learn how to analyse RNA-Seq data using the Tuxedo Suite tools: Tophat, Cuffmerge, Cufflinks and Cuffdiff.

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,

More information

RNA-seq. Manpreet S. Katari

RNA-seq. Manpreet S. Katari RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene

More information

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au

More information

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,

More information

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012 SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................

More information

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment: Cyverse tutorial 1 Logging in to Cyverse and data management Open an Internet browser window and navigate to the Cyverse discovery environment: https://de.cyverse.org/de/ Click Log in with your CyVerse

More information

NGS FASTQ file format

NGS FASTQ file format NGS FASTQ file format Line1: Begins with @ and followed by a sequence idenefier and opeonal descripeon Line2: Raw sequence leiers Line3: + Line4: Encodes the quality values for the sequence in Line2 (see

More information

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary Cufflinks RNA-Seq analysis tools - Getting Started 1 of 6 14.07.2011 09:42 Cufflinks Transcript assembly, differential expression, and differential regulation for RNA-Seq Site Map Home Getting started

More information

INTRODUCTION AUX FORMATS DE FICHIERS

INTRODUCTION AUX FORMATS DE FICHIERS INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan

More information

High-throughout sequencing and using short-read aligners. Simon Anders

High-throughout sequencing and using short-read aligners. Simon Anders High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel

More information

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Although a little- bit long, this is an easy exercise

More information

Bioinformatics in next generation sequencing projects

Bioinformatics in next generation sequencing projects Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational

More information

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012 David Crossman, Ph.D. UAB Heflin Center for Genomic Science GCC2012 Wednesday, July 25, 2012 Galaxy Splash Page Colors Random Galaxy icons/colors Queued Running Completed Download/Save Failed Icons Display

More information

Genomic Files. University of Massachusetts Medical School. October, 2014

Genomic Files. University of Massachusetts Medical School. October, 2014 .. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional. Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference

More information

Ensembl RNASeq Practical. Overview

Ensembl RNASeq Practical. Overview Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted

More information

Genomic Files. University of Massachusetts Medical School. October, 2015

Genomic Files. University of Massachusetts Medical School. October, 2015 .. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

Questions about Cufflinks should be sent to Please do not technical questions to Cufflinks contributors directly.

Questions about Cufflinks should be sent to Please do not  technical questions to Cufflinks contributors directly. Cufflinks RNA-Seq analysis tools - User's Manual 1 of 22 14.07.2011 09:42 Cufflinks Transcript assembly, differential expression, and differential regulation for RNA-Seq Please Note If you have questions

More information

Tiling Assembly for Annotation-independent Novel Gene Discovery

Tiling Assembly for Annotation-independent Novel Gene Discovery Tiling Assembly for Annotation-independent Novel Gene Discovery By Jennifer Lopez and Kenneth Watanabe Last edited on September 7, 2015 by Kenneth Watanabe The following procedure explains how to run the

More information

RNASeq2017 Course Salerno, September 27-29, 2017

RNASeq2017 Course Salerno, September 27-29, 2017 RNASeq2017 Course Salerno, September 27-29, 2017 RNA- seq Hands on Exercise Fabrizio Ferrè, University of Bologna Alma Mater (fabrizio.ferre@unibo.it) Hands- on tutorial based on the EBI teaching materials

More information

New releases and related tools will be announced through the mailing list

New releases and related tools will be announced through the mailing list Cufflinks Transcript assembly, differential expression, and differential regulation for RNA-Seq Please Note If you have questions about how to use Cufflinks or would like more information about the software,

More information

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems. Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD

More information

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Tophat Gene expression estimation cufflinks Confidence intervals Gene expression changes (separate use case) Sample

More information

A Tutorial: Genome- based RNA- Seq Analysis Using the TUXEDO Package

A Tutorial: Genome- based RNA- Seq Analysis Using the TUXEDO Package A Tutorial: Genome- based RNA- Seq Analysis Using the TUXEDO Package The following data and software resources are required for following the tutorial. Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat

More information

Advanced UCSC Browser Functions

Advanced UCSC Browser Functions Advanced UCSC Browser Functions Dr. Thomas Randall tarandal@email.unc.edu bioinformatics.unc.edu UCSC Browser: genome.ucsc.edu Overview Custom Tracks adding your own datasets Utilities custom tools for

More information

TP RNA-seq : Differential expression analysis

TP RNA-seq : Differential expression analysis TP RNA-seq : Differential expression analysis Overview of RNA-seq analysis Fusion transcripts detection Differential expresssion Gene level RNA-seq Transcript level Transcripts and isoforms detection 2

More information

Maize genome sequence in FASTA format. Gene annotation file in gff format

Maize genome sequence in FASTA format. Gene annotation file in gff format Exercise 1. Using Tophat/Cufflinks to analyze RNAseq data. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise

More information

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis

More information

NGS Data Analysis. Roberto Preste

NGS Data Analysis. Roberto Preste NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr

More information

Galaxy Platform For NGS Data Analyses

Galaxy Platform For NGS Data Analyses Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account

More information

Read Naming Format Specification

Read Naming Format Specification Read Naming Format Specification Karel Břinda Valentina Boeva Gregory Kucherov Version 0.1.3 (4 August 2015) Abstract This document provides a standard for naming simulated Next-Generation Sequencing (Ngs)

More information

NGS Data Visualization and Exploration Using IGV

NGS Data Visualization and Exploration Using IGV 1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians

More information

RNA Sequencing with TopHat Alignment v1.0 and Cufflinks Assembly & DE v1.1 App Guide

RNA Sequencing with TopHat Alignment v1.0 and Cufflinks Assembly & DE v1.1 App Guide RNA Sequencing with TopHat Alignment v1.0 and Cufflinks Assembly & DE v1.1 App Guide For Research Use Only. Not for use in diagnostic procedures. Introduction 3 Set Analysis Parameters TopHat 4 Analysis

More information

RNA Sequencing with TopHat and Cufflinks

RNA Sequencing with TopHat and Cufflinks RNA Sequencing with TopHat and Cufflinks Introduction 3 Run TopHat App 4 TopHat App Output 5 Run Cufflinks 18 Cufflinks App Output 20 RNAseq Methods 27 Technical Assistance ILLUMINA PROPRIETARY 15050962

More information

NGS Analysis Using Galaxy

NGS Analysis Using Galaxy NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises

More information

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub trinityrnaseq / RNASeq_Trinity_Tuxedo_Workshop Trinity De novo Transcriptome Assembly Workshop Brian Haas edited this page on Oct 17, 2015 14 revisions De novo RNA-Seq Assembly and Analysis Using Trinity

More information

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013 RNAseq analysis: SNP calling BTI bioinformatics course, spring 2013 RNAseq overview RNAseq overview Choose technology 454 Illumina SOLiD 3 rd generation (Ion Torrent, PacBio) Library types Single reads

More information

Reference guided RNA-seq data analysis using BioHPC Lab computers

Reference guided RNA-seq data analysis using BioHPC Lab computers Reference guided RNA-seq data analysis using BioHPC Lab computers This document assumes that you already know some basics of how to use a Linux computer. Some of the command lines in this document are

More information

The SAM Format Specification (v1.4-r956)

The SAM Format Specification (v1.4-r956) The SAM Format Specification (v1.4-r956) The SAM Format Specification Working Group April 12, 2011 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text

More information

RNA-seq Data Analysis

RNA-seq Data Analysis Seyed Abolfazl Motahari RNA-seq Data Analysis Basics Next Generation Sequencing Biological Samples Data Cost Data Volume Big Data Analysis in Biology تحلیل داده ها کنترل سیستمهای بیولوژیکی تشخیص بیماریها

More information

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:

Our data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there: Practical Course in Genome Bioinformatics 19.2.2016 (CORRECTED 22.2.2016) Exercises - Day 5 http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2016/ Answer the 5 questions (Q1-Q5) according

More information

Evaluate NimbleGen SeqCap RNA Target Enrichment Data

Evaluate NimbleGen SeqCap RNA Target Enrichment Data Roche Sequencing Technical Note November 2014 How To Evaluate NimbleGen SeqCap RNA Target Enrichment Data 1. OVERVIEW Analysis of NimbleGen SeqCap RNA target enrichment data generated using an Illumina

More information

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis

More information

Identiyfing splice junctions from RNA-Seq data

Identiyfing splice junctions from RNA-Seq data Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010 Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice

More information

The software and data for the RNA-Seq exercise are already available on the USB system

The software and data for the RNA-Seq exercise are already available on the USB system BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software

More information

merantk Version 1.1.1a

merantk Version 1.1.1a DIVISION OF BIOINFORMATICS - INNSBRUCK MEDICAL UNIVERSITY merantk Version 1.1.1a User manual Dietmar Rieder 1/12/2016 Page 1 Contents 1. Introduction... 3 1.1. Purpose of this document... 3 1.2. System

More information

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software:

Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software: A Tutorial: De novo RNA- Seq Assembly and Analysis Using Trinity and edger The following data and software resources are required for following the tutorial: Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat

More information

RASER: Reads Aligner for SNPs and Editing sites of RNA (version 0.51) Manual

RASER: Reads Aligner for SNPs and Editing sites of RNA (version 0.51) Manual RASER: Reads Aligner for SNPs and Editing sites of RNA (version 0.51) Manual July 02, 2015 1 Index 1. System requirement and how to download RASER source code...3 2. Installation...3 3. Making index files...3

More information

v0.3.0 May 18, 2016 SNPsplit operates in two stages:

v0.3.0 May 18, 2016 SNPsplit operates in two stages: May 18, 2016 v0.3.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.

More information

Exercise 2: Browser-Based Annotation and RNA-Seq Data

Exercise 2: Browser-Based Annotation and RNA-Seq Data Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler July 24, 2018 This exercise continues your introduction to practical issues in comparative annotation. You ll be annotating genomic sequence

More information

Short Read Sequencing Analysis Workshop

Short Read Sequencing Analysis Workshop Short Read Sequencing Analysis Workshop Day 8: Introduc/on to RNA-seq Analysis In-class slides Day 7 Homework 1.) 14 GABPA ChIP-seq peaks 2.) Error: Dataset too large (> 100000). Rerun with larger maxsize

More information

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) Genome Informatics (Part 1) https://bioboot.github.io/bggn213_f17/lectures/#14 Dr. Barry Grant Nov 2017 Overview: The purpose of this lab session is

More information

Analyzing ChIP- Seq Data in Galaxy

Analyzing ChIP- Seq Data in Galaxy Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...

More information

RNA-Seq data analysis software. User Guide 023UG050V0200

RNA-Seq data analysis software. User Guide 023UG050V0200 RNA-Seq data analysis software User Guide 023UG050V0200 FOR RESEARCH USE ONLY. NOT INTENDED FOR DIAGNOSTIC OR THERAPEUTIC USE. INFORMATION IN THIS DOCUMENT IS SUBJECT TO CHANGE WITHOUT NOTICE. Lexogen

More information

RNA-Seq data analysis software. User Guide 023UG050V0100

RNA-Seq data analysis software. User Guide 023UG050V0100 RNA-Seq data analysis software User Guide 023UG050V0100 FOR RESEARCH USE ONLY. NOT INTENDED FOR DIAGNOSTIC OR THERAPEUTIC USE. INFORMATION IN THIS DOCUMENT IS SUBJECT TO CHANGE WITHOUT NOTICE. Lexogen

More information

Package roar. August 31, 2018

Package roar. August 31, 2018 Type Package Package roar August 31, 2018 Title Identify differential APA usage from RNA-seq alignments Version 1.16.0 Date 2016-03-21 Author Elena Grassi Maintainer Elena Grassi Identify

More information

Benchmarking of RNA-seq aligners

Benchmarking of RNA-seq aligners Lecture 17 RNA-seq Alignment STAR Benchmarking of RNA-seq aligners Benchmarking of RNA-seq aligners Benchmarking of RNA-seq aligners Benchmarking of RNA-seq aligners Based on this analysis the most reliable

More information

Analyzing Variant Call results using EuPathDB Galaxy, Part II

Analyzing Variant Call results using EuPathDB Galaxy, Part II Analyzing Variant Call results using EuPathDB Galaxy, Part II In this exercise, we will work in groups to examine the results from the SNP analysis workflow that we started yesterday. The first step is

More information

Single/paired-end RNAseq analysis with Galaxy

Single/paired-end RNAseq analysis with Galaxy October 016 Single/paired-end RNAseq analysis with Galaxy Contents: 1. Introduction. Quality control 3. Alignment 4. Normalization and read counts 5. Workflow overview 6. Sample data set to test the paired-end

More information

ASAP - Allele-specific alignment pipeline

ASAP - Allele-specific alignment pipeline ASAP - Allele-specific alignment pipeline Jan 09, 2012 (1) ASAP - Quick Reference ASAP needs a working version of Perl and is run from the command line. Furthermore, Bowtie needs to be installed on your

More information

Tutorial: RNA-Seq analysis part I: Getting started

Tutorial: RNA-Seq analysis part I: Getting started : RNA-Seq analysis part I: Getting started August 9, 2012 CLC bio Finlandsgade 10-12 8200 Aarhus N Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com support@clcbio.com : RNA-Seq analysis

More information

Genetics 211 Genomics Winter 2014 Problem Set 4

Genetics 211 Genomics Winter 2014 Problem Set 4 Genomics - Part 1 due Friday, 2/21/2014 by 9:00am Part 2 due Friday, 3/7/2014 by 9:00am For this problem set, we re going to use real data from a high-throughput sequencing project to look for differential

More information

RNA-Seq data analysis software. User Guide 023UG050V0210

RNA-Seq data analysis software. User Guide 023UG050V0210 RNA-Seq data analysis software User Guide 023UG050V0210 FOR RESEARCH USE ONLY. NOT INTENDED FOR DIAGNOSTIC OR THERAPEUTIC USE. INFORMATION IN THIS DOCUMENT IS SUBJECT TO CHANGE WITHOUT NOTICE. Lexogen

More information

The SAM Format Specification (v1.4-r994)

The SAM Format Specification (v1.4-r994) The SAM Format Specification (v1.4-r994) The SAM Format Specification Working Group January 27, 2012 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text

More information

Genome representa;on concepts. Week 12, Lecture 24. Coordinate systems. Genomic coordinates brief overview 11/13/14

Genome representa;on concepts. Week 12, Lecture 24. Coordinate systems. Genomic coordinates brief overview 11/13/14 2014 - BMMB 852D: Applied Bioinforma;cs Week 12, Lecture 24 István Albert Biochemistry and Molecular Biology and Bioinforma;cs Consul;ng Center Penn State Genome representa;on concepts At the simplest

More information

Using Galaxy: RNA-seq

Using Galaxy: RNA-seq Using Galaxy: RNA-seq Stanford University September 23, 2014 Jennifer Hillman-Jackson Galaxy Team Penn State University http://galaxyproject.org/ The Agenda Introduction RNA-seq Example - Data Prep: QC

More information

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012) USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are

More information

Sequence Alignment/Map Optional Fields Specification

Sequence Alignment/Map Optional Fields Specification Sequence Alignment/Map Optional Fields Specification The SAM/BAM Format Specification Working Group 14 Jul 2017 The master version of this document can be found at https://github.com/samtools/hts-specs.

More information

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi

More information

Exeter Sequencing Service

Exeter Sequencing Service Exeter Sequencing Service A guide to your denovo RNA-seq results An overview Once your results are ready, you will receive an email with a password-protected link to them. Click the link to access your

More information

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome. Supplementary Figure 1 Fast read-mapping algorithm of BrowserGenome. (a) Indexing strategy: The genome sequence of interest is divided into non-overlapping 12-mers. A Hook table is generated that contains

More information

Short Read Alignment. Mapping Reads to a Reference

Short Read Alignment. Mapping Reads to a Reference Short Read Alignment Mapping Reads to a Reference Brandi Cantarel, Ph.D. & Daehwan Kim, Ph.D. BICF 05/2018 Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements

More information

Briefly: Bioinformatics File Formats. J Fass September 2018

Briefly: Bioinformatics File Formats. J Fass September 2018 Briefly: Bioinformatics File Formats J Fass September 2018 Overview ASCII Text Sequence Fasta, Fastq ~Annotation TSV, CSV, BED, GFF, GTF, VCF, SAM Binary (Data, Compressed, Executable) Data HDF5 BAM /

More information

1. Quality control software FASTQC:

1. Quality control software FASTQC: ITBI2017-2018, Class-Exercise5, 1-11-2017, M-Reczko 1. Quality control software FASTQC: https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc Documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/help/

More information

Sequence Alignment/Map Format Specification

Sequence Alignment/Map Format Specification Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 28 Feb 2014 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing

More information

Visualization using CummeRbund 2014 Overview

Visualization using CummeRbund 2014 Overview Visualization using CummeRbund 2014 Overview In this lab, we'll look at how to use cummerbund to visualize our gene expression results from cuffdiff. CummeRbund is part of the tuxedo pipeline and it is

More information

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST A Simple Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at http://www.ncbi.nih.gov/blast/

More information

From the Schnable Lab:

From the Schnable Lab: From the Schnable Lab: Yang Zhang and Daniel Ngu s Pipeline for Processing RNA-seq Data (As of November 17, 2016) yzhang91@unl.edu dngu2@huskers.unl.edu Pre-processing the reads: The alignment software

More information

Tutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures

Tutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures : RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and February 24, 2014 Sample to Insight : RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and : RNA-Seq Analysis

More information

m6aviewer Version Documentation

m6aviewer Version Documentation m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.

More information

ChIP-seq (NGS) Data Formats

ChIP-seq (NGS) Data Formats ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/

More information

Mapping NGS reads for genomics studies

Mapping NGS reads for genomics studies Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization

More information

NGS Analyses with Galaxy

NGS Analyses with Galaxy 1 NGS Analyses with Galaxy Introduction Every living organism on our planet possesses a genome that is composed of one or several DNA (deoxyribonucleotide acid) molecules determining the way the organism

More information

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September 27 2014 Static Dynamic Static Minimum Information for Reporting

More information

Aligners. J Fass 21 June 2017

Aligners. J Fass 21 June 2017 Aligners J Fass 21 June 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-06-21

More information

Standard output. Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines:

Standard output. Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines: Lecture 18 RNA-seq Alignment Standard output Some of the output files can be redirected into the standard output, which may facilitate in creating the pipelines: Filtering of the alignments STAR performs

More information

Sequence Alignment/Map Format Specification

Sequence Alignment/Map Format Specification Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 2 Sep 2016 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing

More information

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting

v0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting October 08, 2015 v0.2.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.

More information

How to use the DEGseq Package

How to use the DEGseq Package How to use the DEGseq Package Likun Wang 1,2 and Xi Wang 1. October 30, 2018 1 MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST /Department of Automation, Tsinghua University. 2

More information

Galaxy workshop at the Winter School Igor Makunin

Galaxy workshop at the Winter School Igor Makunin Galaxy workshop at the Winter School 2016 Igor Makunin i.makunin@uq.edu.au Winter school, UQ, July 6, 2016 Plan Overview of the Genomics Virtual Lab Introduce Galaxy, a web based platform for analysis

More information