Differential gene expression analysis
|
|
- Imogen Pearson
- 5 years ago
- Views:
Transcription
1 Differential gene expression analysis Overview In this exercise, we will analyze RNA-seq data to measure changes in gene expression levels between wild-type and a mutant strain of the bacterium Listeria monocytogenes. Learning Objectives Review mapping reads with an example of how to use qsub to map many data sets in parallel on TACC. Review samtools and SAM/BAM conversion. Learn how to use bedtools/htseq to count reads overlapping genes. Become familiar with basic R usage and installing BioConductor modules. Learn how to use edger/deseq to identify differentially expressed genes. Table of Contents Overview Learning Objectives Table of Contents Preliminary Download data files If you want to skip the read alignment step... Using the R environment for statistical computing Hints for working with R Bioconductor modules for R Create BAM file of mapped reads Map reads using Bowtie Convert Bowtie output to BAM Optional Exercise Count reads mapping to genes bedtools Optional: HTseq Analyze differential gene expression DESeq Exercises Optional: edger Comparison Additional Points From here... Preliminary Download data files Copy the data files for this example into your $SCRATCH space: cds cp -r $BI/ngs_course/listeria_RNA_seq/data listeria_rna_seq File Name Description Sample SRR fastq Single-end Illumina 36-bp reads wild-type, biological replicate 1 SRR fastq Single-end Illumina 36-bp reads? sigb mutant, biological replicate 1 SRR fastq Single-end Illumina 36-bp reads wild-type, biological replicate 2
2 SRR fastq Single-end Illumina 36-bp reads? sigb mutant, biological replicate 2 NC_ fasta Reference Genome sequence (FASTA) Listeria monocytogenes strain 10403S NC_ gff Reference Genome features (GFF) Listeria monocytogenes strain 10403S This data was submitted to the Sequence Read Archive (SRA) to accompany this paper: Oliver, H.F., et al. (2009) Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics 10:641. Pubmed You can view the data in the ENA SRA here: If you want to skip the read alignment step... To get right to the new stuff, you can copy the mapped read BAM files and the reference sequence files that you will need using these commands: cds cp -r $BI/ngs_course/listeria_RNA_seq/mapped_data listeria_rna_seq Then, skip over the #Create BAM file of mapped reads section below. Using the R environment for statistical computing Many of the modules for doing statistical tests on NGS data have been written in the "R" language for statistical computing. If you're not familiar with R, then this section is probably going to be a bit confusing. (You might be thinking "Stop with the new languages already guys! Uncle!") To orient you, we are going to run the R command, which launches the R shell inside our terminal. Like the bash shell that we normally use, the R shell interprets commands, but now they are R commands rather than bash commands. The prompt changes from login1$ to > when you are in the R shell, to help clue you in to this fact. The R shell is inside the bash shell. So when you quit R, you will be back where you were in the bash shell. R is the favorite language of pirates. R is a very common scripting language used in statistics. There are whole courses on using R going on in other SSI classrooms as we speak! Inside the R universe, you have access to an incredibly large number of useful statistical functions (Fisher's exact test, nonlinear least-squares fitting, ANOVA...). R also has advanced functionality for producing plots and graphs as output. We'll take advantage of all of this here. You are well on your way to becoming denizens of the polyglot bioinformatics community now. Regrettably, R is a bit of it's own bizarro world, as far as how its commands work. (Futhermore, Googling "R" to get help can be very frustrating.) The conventions of most other programming and scripting languages seem to have been re-invented by someone who wanted to do everything their own way in R. Just like we wrote shell scripts in bash, you can write R scripts that carry out complicated analyses. Do not copy the > characters in the R examples. They are the R prompt to remind you which commands are to be run inside the R shell! Hints for working with R Don't forget: it's q() to quit. For help, type?command. Try?read.table. The q key gets you out of help, just like for a man page. The left arrow <- (less-than-dash) is the same as an equals sign =. You can use them interchangeably. The prompt we will sometimes be showing for R is >. Don't type this for a command. It is like the login1$ at the beginning of the bash prompt when you log in to Lonestar. It just means that you are in the R shell. You can type the name of a variable to have its value displayed. Like this... > x < > x [1] 21
3 Bioconductor modules for R Like other languages, R can be expanded by loading modules. The R equivalent of Bioperl or Biopython is Bioconductor. Bioconductor can theoretically do things for you like convert sequences (none of us use it for that), but where it really shines is in doing statistical tests (where is it second-to-none in this list of languages). Many functions for analyzing microarray data are implemented in R, and this strength has now carried over to the analysis of RNAseq data. Here's how you install two modules that we will need for this exercise: The install commands may take several minutes to complete. You can read ahead while they run or even open a new terminal window and connect it to Lonestar and continue onward in the tutorial as you wait for R. Starting R and loading the modules for this tutorial login1$ module load R login1$ R R version ( ) -- "Security Blanket" Copyright (C) 2013 The R Foundation for Statistical Computing ISBN Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > source(" Warning in install.packages("biocinstaller", repos = a["biocsoft", "URL"]) : 'lib = "/opt/apps/r/2.15.3/lib64/r/library"' is not writable Would you like to use a personal library instead? (y/n) y Would you like to create a personal library ~/R/x86_64-unknown-linux-gnu-library/2.15 to install packages into? (y/n) y... > bioclite("deseq")... > bioclite("edger")... > q() Save workspace image? [y/n/c]: n When you start R later, you will not need to re-install the modules. You can load them with just these commands:
4 Starting R and loading modules after they are installed login1$ R > library("deseq") > library("edger") These commands will work for any Bioconductor module! Create BAM file of mapped reads Map reads using Bowtie For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers.) We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson. Be careful we are now mapping single-end reads, so you may have to look at the bowtie help to figure out how to do that! You will need to first build the index file, just once and in "interactive mode" is fine (it's fast, so you don't need an idev shell). Then, you will need to submit a commands file with four lines to the TACC queue using qsub. Please give the final output files the names: SRR sam, SRR sam, SRR sam, SRR sam. I just want a little hint Remember, bowtie-build once then bowtie for each separate sample. Please take me through all of the steps... module load bowtie bowtie-build NC_ fasta NC_ Now create a bowtie_commands file that looks like this using nano or another text editor: bowtie -p 3 -S NC_ SRR fastq SRR sam bowtie -p 3 -S NC_ SRR fastq SRR sam bowtie -p 3 -S NC_ SRR fastq SRR sam bowtie -p 3 -S NC_ SRR fastq SRR sam Remember that there are 12 processors per node on Lonestar, so we choose to use 3 for each of the 4 jobs with the -p 3 option. Create the launcher script and run it: Remember that you cannot qsub from within an idev shell!
5 module load python launcher_creator.py -n bowtie -q development -j bowtie_commands -t 0:30:00 qsub bowtie.sge Convert Bowtie output to BAM Create a new samtools_commands file so that you convert all of these files from SAM to sorted and indexed BAM all at one time by using qsub. Linux expert tip: you can string together commands all on one line, so that they are sent to the same core one after another by separating them on the line with &&. Note the use of the variable $FILE, by which we set a variable's value int he first part of the line and use it over and over in the latter part of the line. This is a mini use of shell scripting. Contents of samtools_commands file FILE=SRR && samtools import NC_ fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam FILE=SRR && samtools import NC_ fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam FILE=SRR && samtools import NC_ fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam FILE=SRR && samtools import NC_ fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam Create a new launcher script and submit this new job to the queue. Be sure you have samtools loaded as the node that your job launches on will inherit your current environment, including whatever modules you have loaded: module load samtools I'd like to see the commands for this qsub... launcher_creator.py -n samtools -q development -j samtools_commands -t 0:30:00 qsub samtools.sge Optional Exercise Is this a strand-specific RNA-seq library? Try using IGV to view some of the BAM file data and examine the reads mapped to each gene.
6 Count reads mapping to genes bedtools bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats. We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use: module load bedtools bedtools bedtools multicov The multicov command takes a feature file (GFF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with the -s option. Unfortunately, this option only exists for the multicov command in a version of bedtools that is newer than the module on TACC, so we don't include it in the example command below. Note: Remember that the chromosome names in your gff file should match the way the chromosomes are named in the reference fasta file used in the mapping step. For example, if BAM file used for mapping contains chr1, chrx etc, the GFF file must also call the chromosomes as chr1, chrx and so on. Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep. This is all one line... grep '^NC_017544[[:space:]]*GenBank[[:space:]]*gene' NC_ gff > NC_ genes.gff What is this doing? It's taking all the lines that begin with (^), then "NC_017544", then any number of spaces or tabs, then "GenBank", then any number of spaces or tabs, then "gene". Use head to see the before and after. head -n 50 NC_ gff head -n 50 NC_ genes.gff In order to use the bedtools command on our data, do the following: bedtools multicov -bams SRR bam SRR bam SRR bam SRR bam -bed NC_ genes.gff > gene_counts.gff Then take a peek at the data... head gene_counts.gff
7 Optional: HTseq HTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads, and it does not have many functions other than that. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. Why do we learn this? Well, you may want to care about reads mapped on intersection when you count reads. Please take a look at this page, and if this sophisticated counting method looks useful for you, use HTseq. Otherwise, use bedtools. grep "^NC_017544" NC_ gff > count_ref.gff samtools view SRR bam htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count1.gff samtools view SRR bam htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count2.gff samtools view SRR bam htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count3.gff samtools view SRR bam htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count4.gff join count1.gff count2.gff join - count3.gff join - count4.gff > gene_counts_htseq.gff #if you have many samples, use for-loop and join gene_counts_htseq.gff has 5 more lines than gene_counts.gff. Check out the last 5 lines. They are basic statistics. wc -l gene_counts_htseq.gff tail gene_counts_htseq.gff The basic statistics (last 5 lines) is useful to know, but should be removed to use it as a input file for DEGseq head gene_counts_htseq.gff > gene_counts_htseq.tab Finally, gene_counts_htseq.tab is ready to use. HTseq-count is strand-specific in default. Therefore, read counts for each gene in gene_counts _HTseq.gff are approximately a half counts in gene_counts.gff for the corresponding gene. Analyze differential gene expression DESeq DESeq Manual and Instructions Our data that is cluttered with a lot of extra columns and one column stuffed with tag=value information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files. We can do this cleanup many ways, but a quick one is to use the Unix string editor sed. This command replaces the entire beginning of the line up to locus_tag= with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl or Python script that would check to be sure that each line had the locus_tag (they do), among other things.
8 Reformatting gene_counts.gff head gene_counts.gff sed 's/^.*locus_tag=//' gene_counts.gff > gene_counts.tab After it has run, take a peek at the new file: head gene_counts.tab Be very careful how you copy and paste from the example below. Do not copy the > characters. Some commands are spread across multiple lines. The > are missing at the beginning of the lines after the first one in these cases. So this: > y <- c( 1:10 ) > y Is the same as: > y <- c(1:10) > y It's ok to copy across the multiple lines and paste into R as long as you get all the way to the closing parenthesis. The commands for this example are also described in the DESeq vignette (PDF). Using DESeq
9 login1$ R... > library("deseq") > counts = read.delim("gene_counts.tab", header=f, row.names=1) > head(counts) > colnames(counts) = c("wt1", "mut1", "wt2", "mut2") > head(counts) > my.design <- data.frame( row.names = colnames( counts ), condition = c( "wt", "mut", "wt", "mut"), libtype = c( "single-end", "single-end", "single-end", "single-end" ) ) > conds <- factor(my.design$condition) > cds <- newcountdataset( counts, conds ) > cds > cds <- estimatesizefactors( cds ) > sizefactors( cds ) > cds <- estimatedispersions( cds ) > pdf("deseq-dispersion_estimates.pdf") > plot( rowmeans( counts( cds, normalized=true ) ), fitinfo(cds)$pergenedispests, pch = '.', log="xy" ) > xg <- 10^seq( -.5, 5, length.out=300 ) > lines( xg, fitinfo(cds)$dispfun( xg ), col="red" ) > dev.off() > result <- nbinomtest( cds, "wt", "mut" ) > head(result) > result = result[order(result$pval), ] > head(result) > write.csv(result, "DESeq-wt-vs-mut.csv") > pdf("deseq-ma-plot.pdf") > plot( result$basemean, result$log2foldchange, log="x", pch=20, cex=.3, col = ifelse( result$padj <.1, "red", "black" ) ) > dev.off() > q() Save workspace image? [y/n/c]: n login1$ head DESeq-wt-vs-mut.csv
10 DESeq-wt-vs-mut.csv is a comma-delimited file that could be reloaded into R or viewed in Excel. You should copy the two *.pdf files that were created back to your local computer to view them using scp Exercises What are the numbers returned by sizefactors( cds )? Answer... They are, roughly speaking, the relative average coverage of each data set. Specifically, they are the size parameter of the negative binomial fit to the counts per gene per data file. What are the dispersion estimates? Answer... The model assumes there is also a per-gene aspect to the variance in counts observed, that is again fit to a negative binomial distribution (=overdispersed Poisson distribution). In this model, the lower the counts are, the more dispersion relative to the mean is expected (red line in graph). Thus, higher fold changes are required in lowly expressed genes to call the same observed fold-change difference as significant. What was the predominant effect of the mutation on gene expression in this Listeria strain? Optional: edger edger is another R package that you can use to do a similar analysis. edger Manual and Instructions These commands use the negative binomial model, calculate the false discovery rate (FDR ~ adjusted p-value), and make a plot similar to the one from DESeq.
11 Using edger login1$ R... > library("edger") > counts = read.delim("gene_counts.tab", header=f, row.names=1) > colnames(counts) = c("wt1", "mut1", "wt2", "mut2") > head(counts) > group <- factor(c("wt","mut","wt","mut")) > dge = DGEList(counts=counts,group=group) > dge <- estimatecommondisp(dge) > dge <- estimatetagwisedisp(dge) > et <- exacttest(dge) > etp <- toptags(et, n=100000) > etp$table$logfc = -etp$table$logfc > pdf("edger-ma-plot.pdf") > plot( etp$table$logcpm, etp$table$logfc, xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3, col = ifelse( etp$table$fdr <.1, "red", "black" ) ) > dev.off() > write.csv(etp$table, "edger-wt-vs-mut.csv") > q() Save workspace image? [y/n/c]: n login1$ head edger-wt-vs-mut.csv Note that the "FC" fold change calculated is initially the reverse of that for the DESeq example for the output here. It is wt relative to mut. To fix this, we put a negative in there for the log fold change. Comparison Compare the expression changes predicted by DESeq and edger to each other. Does edger or DESeq predict more significant changes? Additional Points In an actual RNAseq analysis, you might want to trim stray adaptor sequences from your data using the tools discussed in Evaluating your raw sequencing data. You can get a lot more information from RNAseq data than you could from a microarray experiment. You can map transcriptional start sites, areas of unexpected transcription, splice sites, etc. - all because you have full sequence information that we have barely used in this example. You can call variants from mapped RNAseq data, just be aware that many regions will have no coverage (because they are not expressed as RNA). From here... Visualize mapped reads in BAM files using IGV to manually check some of the gene counts. Look at the more sophisticated "Tuxedo" suite of RNAseq tools, which performs many functions that are especially useful in Eukaryotic genomes.
Testing for Differential Expression
Testing for Differential Expression Objectives Once we've obtained abundance counts for our genes/exons/transcripts, we are usually interested in identifying those genes/exons/transcripts that are differentially
More informationpreparation methods and new bacterial strains. Parts of the pipeline that can be updated will be annotated in this guide.
BacSeq Introduction The purpose of this guide is to aid current and future Whiteley Lab members and University of Texas microbiologists with bacterial RNA?Seq analysis. Once you have analyzed your data
More informationSequence Analysis Pipeline
Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation
More informationPractical Linux examples: Exercises
Practical Linux examples: Exercises 1. Login (ssh) to the machine that you are assigned for this workshop (assigned machines: https://cbsu.tc.cornell.edu/ww/machines.aspx?i=87 ). Prepare working directory,
More informationExercise 1. RNA-seq alignment and quantification. Part 1. Prepare the working directory. Part 2. Examine qualities of the RNA-seq data files
Exercise 1. RNA-seq alignment and quantification Part 1. Prepare the working directory. 1. Connect to your assigned computer. If you do not know how, follow the instruction at http://cbsu.tc.cornell.edu/lab/doc/remote_access.pdf
More informationThe software and data for the RNA-Seq exercise are already available on the USB system
BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software
More informationExercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads
Exercise 1 Review Setting parameters STAR --quantmode GeneCounts --genomedir genomedb -- runthreadn 2 --outfiltermismatchnmax 2 --readfilesin WTa.fastq.gz --readfilescommand zcat --outfilenameprefix WTa
More informationOur data for today is a small subset of Saimaa ringed seal RNA sequencing data (RNA_seq_reads.fasta). Let s first see how many reads are there:
Practical Course in Genome Bioinformatics 19.2.2016 (CORRECTED 22.2.2016) Exercises - Day 5 http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2016/ Answer the 5 questions (Q1-Q5) according
More informationRead mapping with BWA and BOWTIE
Read mapping with BWA and BOWTIE Before We Start In order to save a lot of typing, and to allow us some flexibility in designing these courses, we will establish a UNIX shell variable BASE to point to
More informationAnalyzing Variant Call results using EuPathDB Galaxy, Part II
Analyzing Variant Call results using EuPathDB Galaxy, Part II In this exercise, we will work in groups to examine the results from the SNP analysis workflow that we started yesterday. The first step is
More informationPreparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers
Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions
More informationCalling variants in diploid or multiploid genomes
Calling variants in diploid or multiploid genomes Diploid genomes The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen: 1. 2.
More informationMaize genome sequence in FASTA format. Gene annotation file in gff format
Exercise 1. Using Tophat/Cufflinks to analyze RNAseq data. Step 1. One of CBSU BioHPC Lab workstations has been allocated for your workshop exercise. The allocations are listed on the workshop exercise
More informationVariant calling using SAMtools
Variant calling using SAMtools Calling variants - a trivial use of an Interactive Session We are going to conduct the variant calling exercises in an interactive idev session just so you can get a feel
More informationTP RNA-seq : Differential expression analysis
TP RNA-seq : Differential expression analysis Overview of RNA-seq analysis Fusion transcripts detection Differential expresssion Gene level RNA-seq Transcript level Transcripts and isoforms detection 2
More informationRNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University
RNA-Seq Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University joshua.ainsley@tufts.edu Day four Quantifying expression Intro to R Differential expression
More informationAnalysis of ChIP-seq data
Before we start: 1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for the class (see separate hand out) 3. Connect to your lab space through the wihtdata network and
More informationChIP-seq Analysis Practical
ChIP-seq Analysis Practical Vladimir Teif (vteif@essex.ac.uk) An updated version of this document will be available at http://generegulation.info/index.php/teaching In this practical we will learn how
More informationGenomic Files. University of Massachusetts Medical School. October, 2015
.. Genomic Files University of Massachusetts Medical School October, 2015 2 / 55. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationAnalyzing ChIP- Seq Data in Galaxy
Analyzing ChIP- Seq Data in Galaxy Lauren Mills RISS ABSTRACT Step- by- step guide to basic ChIP- Seq analysis using the Galaxy platform. Table of Contents Introduction... 3 Links to helpful information...
More informationEnsembl RNASeq Practical. Overview
Ensembl RNASeq Practical The aim of this practical session is to use BWA to align 2 lanes of Zebrafish paired end Illumina RNASeq reads to chromosome 12 of the zebrafish ZV9 assembly. We have restricted
More informationProtocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data
Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data Table of Contents Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification
More informationReference guided RNA-seq data analysis using BioHPC Lab computers
Reference guided RNA-seq data analysis using BioHPC Lab computers This document assumes that you already know some basics of how to use a Linux computer. Some of the command lines in this document are
More informationCLC Server. End User USER MANUAL
CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark
More informationBGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)
BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14) Genome Informatics (Part 1) https://bioboot.github.io/bggn213_f17/lectures/#14 Dr. Barry Grant Nov 2017 Overview: The purpose of this lab session is
More information11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub
trinityrnaseq / RNASeq_Trinity_Tuxedo_Workshop Trinity De novo Transcriptome Assembly Workshop Brian Haas edited this page on Oct 17, 2015 14 revisions De novo RNA-Seq Assembly and Analysis Using Trinity
More informationRNA-Seq Analysis With the Tuxedo Suite
June 2016 RNA-Seq Analysis With the Tuxedo Suite Dena Leshkowitz Introduction In this exercise we will learn how to analyse RNA-Seq data using the Tuxedo Suite tools: Tophat, Cuffmerge, Cufflinks and Cuffdiff.
More informationRNA-seq. Manpreet S. Katari
RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene
More informationTranscript quantification using Salmon and differential expression analysis using bayseq
Introduction to expression analysis (RNA-seq) Transcript quantification using Salmon and differential expression analysis using bayseq Philippine Genome Center University of the Philippines Prepared by
More informationNGS Analysis Using Galaxy
NGS Analysis Using Galaxy Sequences and Alignment Format Galaxy overview and Interface Get;ng Data in Galaxy Analyzing Data in Galaxy Quality Control Mapping Data History and workflow Galaxy Exercises
More informationNGS Data Visualization and Exploration Using IGV
1 What is Galaxy Galaxy for Bioinformaticians Galaxy for Experimental Biologists Using Galaxy for NGS Analysis NGS Data Visualization and Exploration Using IGV 2 What is Galaxy Galaxy for Bioinformaticians
More informationAnalysis of baboon mirna
Analysis of baboon mirna 1. Preparations Background: In the case of baboon (Papio Hamadryas) there is no annotated genome available so we will be using sequences from mirbase for the alignment. MiRBase
More informationMerge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.
Preface p. xiii Ideology: Data Skills for Robust and Reproducible Bioinformatics How to Learn Bioinformatics p. 1 Why Bioinformatics? Biology's Growing Data p. 1 Learning Data Skills to Learn Bioinformatics
More informationBioinformatics? Reads, assembly, annotation, comparative genomics and a bit of phylogeny.
Bioinformatics? Reads, assembly, annotation, comparative genomics and a bit of phylogeny stefano.gaiarsa@unimi.it Linux and the command line PART 1 Survival kit for the bash environment Purpose of the
More informationWeek - 01 Lecture - 04 Downloading and installing Python
Programming, Data Structures and Algorithms in Python Prof. Madhavan Mukund Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 01 Lecture - 04 Downloading and
More informationGenomic Files. University of Massachusetts Medical School. October, 2014
.. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further
More informationNGS Data Analysis. Roberto Preste
NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr
More informationHelpful Galaxy screencasts are available at:
This user guide serves as a simplified, graphic version of the CloudMap paper for applicationoriented end-users. For more details, please see the CloudMap paper. Video versions of these user guides and
More informationMapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6
Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis
More informationm6aviewer Version Documentation
m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.
More informationHIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)
HIPPIE User Manual (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu) OVERVIEW OF HIPPIE o Flowchart of HIPPIE o Requirements PREPARE DIRECTORY STRUCTURE FOR HIPPIE EXECUTION o
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,
More informationQuantification. Part I, using Excel
Quantification In this exercise we will work with RNA-seq data from a study by Serin et al (2017). RNA-seq was performed on Arabidopsis seeds matured at standard temperature (ST, 22 C day/18 C night) or
More informationShort Read Sequencing Analysis Workshop
Short Read Sequencing Analysis Workshop Day 1 Introduc.on to the Workshop Schedule for Week 1 Day 1: Introduc.on Workshop syllabus and schedule Basic considera.ons for sequencing depth, read length, format,
More informationIntegrative Genomics Viewer. Prat Thiru
Integrative Genomics Viewer Prat Thiru 1 Overview User Interface Basics Browsing the Data Data Formats IGV Tools Demo Outline Based on ISMB 2010 Tutorial by Robinson and Thorvaldsdottir 2 Why IGV? IGV
More informationITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013
ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were
More informationThe Very Basics of the R Interpreter
Chapter 2 The Very Basics of the R Interpreter OK, the computer is fired up. We have R installed. It is time to get started. 1. Start R by double-clicking on the R desktop icon. 2. Alternatively, open
More informationIntroduction to Galaxy
Introduction to Galaxy Dr Jason Wong Prince of Wales Clinical School Introductory bioinformatics for human genomics workshop, UNSW Day 1 Thurs 28 th January 2016 Overview What is Galaxy? Description of
More informationChIP-seq (NGS) Data Formats
ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/
More informationMolecular Index Error correction
Molecular Index Error correction Overview: This section provides directions for generating SSCS (Single Strand Consensus Sequence) reads and trimming molecular indexes from raw fastq files. Learning Objectives:
More informationData: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat a.tgz. Software:
A Tutorial: De novo RNA- Seq Assembly and Analysis Using Trinity and edger The following data and software resources are required for following the tutorial: Data: ftp://ftp.broad.mit.edu/pub/users/bhaas/rnaseq_workshop/rnaseq_workshop_dat
More informationConnect to login8.stampede.tacc.utexas.edu. Sample Datasets
Alignment Overview Connect to login8.stampede.tacc.utexas.edu Sample Datasets Reference Genomes Exercise #1: BWA global alignment Yeast ChIP-seq Overview ChIP-seq alignment workflow with BWA Introducing
More informationChIP-seq hands-on practical using Galaxy
ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling
More informationIntroduction to Cancer Genomics
Introduction to Cancer Genomics Gene expression data analysis part I David Gfeller Computational Cancer Biology Ludwig Center for Cancer research david.gfeller@unil.ch 1 Overview 1. Basic understanding
More informationGalaxy Platform For NGS Data Analyses
Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account
More informationWelcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.
Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your
More informationSAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.
Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference
More informationWelcome to GenomeView 101!
Welcome to GenomeView 101! 1. Start your computer 2. Download and extract the example data http://www.broadinstitute.org/~tabeel/broade.zip Suggestion: - Linux, Mac: make new folder in your home directory
More informationDr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata
Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis
More informationDifferential gene expression analysis using RNA-seq
https://abc.med.cornell.edu/ Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, September/October 2018 Friederike Dündar with Luce Skrabanek & Paul Zumbo Day 3: Counting reads
More informationGetting Started with R
Getting Started with R STAT 133 Gaston Sanchez Department of Statistics, UC Berkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133 Tool Some of you may have used
More informationCOMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP. Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas
COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas First of all connect once again to the CBS system: Open ssh shell client. Press Quick
More informationGenomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am
Genomics - Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am One major aspect of functional genomics is measuring the transcript abundance of all genes simultaneously. This was
More informationRNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF
RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au
More informationde.nbi and its Galaxy interface for RNA-Seq
de.nbi and its Galaxy interface for RNA-Seq Jörg Fallmann Thanks to Björn Grüning (RBC-Freiburg) and Sarah Diehl (MPI-Freiburg) Institute for Bioinformatics University of Leipzig http://www.bioinf.uni-leipzig.de/
More informationBioinformatics in next generation sequencing projects
Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational
More informationDifferential Expression
Differential Expression Data In this practical, as before, we will work with RNA-Seq data from Arabidopsis seeds that matured at standard temperature (ST, 22 C day/18 C night) or at high temperature (HT,
More informationv0.2.0 XX:Z:UA - Unassigned XX:Z:G1 - Genome 1-specific XX:Z:G2 - Genome 2-specific XX:Z:CF - Conflicting
October 08, 2015 v0.2.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.
More informationScripting Languages Course 1. Diana Trandabăț
Scripting Languages Course 1 Diana Trandabăț Master in Computational Linguistics - 1 st year 2017-2018 Today s lecture Introduction to scripting languages What is a script? What is a scripting language
More informationChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima
ChIP-seq practical: peak detection and peak annotation Mali Salmon-Divon Remco Loos Myrto Kostadima March 2012 Introduction The goal of this hands-on session is to perform some basic tasks in the analysis
More informationIntroduction to UNIX command-line II
Introduction to UNIX command-line II Boyce Thompson Institute 2017 Prashant Hosmani Class Content Terminal file system navigation Wildcards, shortcuts and special characters File permissions Compression
More informationHow to store and visualize RNA-seq data
How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk EBI is an Outstation of the European Molecular Biology Laboratory. Talk summary How do we archive RNA-seq
More informationAn Introduction to Linux and Bowtie
An Introduction to Linux and Bowtie Cavan Reilly November 10, 2017 Table of contents Introduction to UNIX-like operating systems Installing programs Bowtie SAMtools Introduction to Linux In order to use
More informationPractical Linux Examples
Practical Linux Examples Processing large text file Parallelization of independent tasks Qi Sun & Robert Bukowski Bioinformatics Facility Cornell University http://cbsu.tc.cornell.edu/lab/doc/linux_examples_slides.pdf
More informationUMass High Performance Computing Center
UMass High Performance Computing Center University of Massachusetts Medical School February, 2019 Challenges of Genomic Data 2 / 93 It is getting easier and cheaper to produce bigger genomic data every
More informationA Hands-On Tutorial: RNA Sequencing Using High-Performance Computing
A Hands-On Tutorial: RNA Sequencing Using Computing February 11th and 12th, 2016 1st session (Thursday) Preliminaries: Linux, HPC, command line interface Using HPC: modules, queuing system Presented by:
More informationGalaxy workshop at the Winter School Igor Makunin
Galaxy workshop at the Winter School 2016 Igor Makunin i.makunin@uq.edu.au Winter school, UQ, July 6, 2016 Plan Overview of the Genomics Virtual Lab Introduce Galaxy, a web based platform for analysis
More informationUnix Essentials. BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th
Unix Essentials BaRC Hot Topics Bioinformatics and Research Computing Whitehead Institute October 12 th 2016 http://barc.wi.mit.edu/hot_topics/ 1 Outline Unix overview Logging in to tak Directory structure
More informationChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.
ChIP-seq Analysis BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/ Outline ChIP-seq overview Experimental design Quality control/preprocessing
More informationChIP-Seq Tutorial on Galaxy
1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data
More informationJunctionSeq Package User Manual
JunctionSeq Package User Manual Stephen Hartley National Human Genome Research Institute National Institutes of Health v0.6.10 November 20, 2015 Contents 1 Overview 2 2 Requirements 3 2.1 Alignment.........................................
More informationRNA-Seq analysis with Astrocyte Differential expression and transcriptome assembly
RNA-Seq analysis with Astrocyte Differential expression and transcriptome assembly Beibei Chen Ph.D BICF 9/28/2016 Agenda Launch Workflows using Astrocyte BICF Workflows BICF RNA-seq Workflow Experimental
More informationFinding and Exporting Data. BioMart
September 2017 Finding and Exporting Data Not sure what tool to use to find and export data? BioMart is used to retrieve data for complex queries, involving a few or many genes or even complete genomes.
More informationAgroMarker Finder manual (1.1)
AgroMarker Finder manual (1.1) 1. Introduction 2. Installation 3. How to run? 4. How to use? 5. Java program for calculating of restriction enzyme sites (TaqαI). 1. Introduction AgroMarker Finder (AMF)is
More informationJunctionSeq Package User Manual
JunctionSeq Package User Manual Stephen Hartley National Human Genome Research Institute National Institutes of Health February 16, 2016 JunctionSeq v1.1.3 Contents 1 Overview 2 2 Requirements 3 2.1 Alignment.........................................
More informationIntroduction to Unix
Introduction to Unix Part 1: Navigating directories First we download the directory called "Fisher" from Carmen. This directory contains a sample from the Fisher corpus. The Fisher corpus is a collection
More informationContents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...
Contents Note: pay attention to where you are........................................... 1 Note: Plaintext version................................................... 1 Hello World of the Bash shell 2 Accessing
More information!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468,
!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468, 9"(1(02)1+(',:.;.4(*.',?9@A,!."2.4B.'#A,C(;.
More informationColorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi
Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Although a little- bit long, this is an easy exercise
More informationWorking with Basic Linux. Daniel Balagué
Working with Basic Linux Daniel Balagué How Linux Works? Everything in Linux is either a file or a process. A process is an executing program identified with a PID number. It runs in short or long duration
More informationreplace my_user_id in the commands with your actual user ID
Exercise 1. Alignment with TOPHAT Part 1. Prepare the working directory. 1. Find out the name of the computer that has been reserved for you (https://cbsu.tc.cornell.edu/ww/machines.aspx?i=57 ). Everyone
More informationLecture 5. Essential skills for bioinformatics: Unix/Linux
Lecture 5 Essential skills for bioinformatics: Unix/Linux UNIX DATA TOOLS Text processing with awk We have illustrated two ways awk can come in handy: Filtering data using rules that can combine regular
More informationChIP-seq hands-on practical using Galaxy
ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling
More informationJunctionSeq Package User Manual
JunctionSeq Package User Manual Stephen Hartley National Human Genome Research Institute National Institutes of Health March 30, 2017 JunctionSeq v1.5.4 Contents 1 Overview 2 2 Requirements 3 2.1 Alignment.........................................
More informationv0.3.0 May 18, 2016 SNPsplit operates in two stages:
May 18, 2016 v0.3.0 SNPsplit is an allele-specific alignment sorter which is designed to read alignment files in SAM/ BAM format and determine the allelic origin of reads that cover known SNP positions.
More informationSTA 250: Statistics Lab 1
STA 250: Statistics Lab 1 This lab work is intended to be an introduction to the software R. What follows is a description of the basic functionalities of R, along with a series of tasks that ou d have
More informationPackage RNASeqR. January 8, 2019
Type Package Package RNASeqR January 8, 2019 Title RNASeqR: RNA-Seq workflow for case-control study Version 1.1.3 Date 2018-8-7 Author Maintainer biocviews Genetics, Infrastructure,
More informationPre-Workshop Training materials to move you from Data to Discovery. Get Science Done. Reproducibly.
Pre-Workshop Packet Training materials to move you from Data to Discovery Get Science Done Reproducibly Productively @CyVerseOrg Introduction to CyVerse... 3 What is Cyberinfrastructure?... 3 What to do
More informationChIP- seq Analysis. BaRC Hot Topics - Feb 24 th 2015 BioinformaBcs and Research CompuBng Whitehead InsBtute. hgp://barc.wi.mit.
ChIP- seq Analysis BaRC Hot Topics - Feb 24 th 2015 BioinformaBcs and Research CompuBng Whitehead InsBtute hgp://barc.wi.mit.edu/hot_topics/ Before we start: 1. Log into tak (step 0 on the exercises) 2.
More informationVariation among genomes
Variation among genomes Comparing genomes The reference genome http://www.ncbi.nlm.nih.gov/nuccore/26556996 Arabidopsis thaliana, a model plant Col-0 variety is from Landsberg, Germany Ler is a mutant
More information