Galaxy Platform For NGS Data Analyses

Similar documents
RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

NGS Data Visualization and Exploration Using IGV

NGS Analysis Using Galaxy

Galaxy workshop at the Winter School Igor Makunin

Analyzing ChIP- Seq Data in Galaxy

ChIP-seq (NGS) Data Formats

NGS : reads quality control

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Bioinformatics in next generation sequencing projects

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

RNA-seq. Manpreet S. Katari

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Using Galaxy for NGS Analyses Luce Skrabanek

Analyzing Variant Call results using EuPathDB Galaxy, Part II

Copyright 2014 Regents of the University of Minnesota

Copyright 2014 Regents of the University of Minnesota

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Illumina Next Generation Sequencing Data analysis

ChIP-seq hands-on practical using Galaxy

ChIP-seq hands-on practical using Galaxy

Sequence Analysis Pipeline

Advanced UCSC Browser Functions

Single/paired-end RNAseq analysis with Galaxy

TP RNA-seq : Differential expression analysis

Analysis of ChIP-seq data

Using the Galaxy Local Bioinformatics Cloud at CARC

Helpful Galaxy screencasts are available at:

Mapping NGS reads for genomics studies

Importing your Exeter NGS data into Galaxy:

RNA-seq Data Analysis

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

High-throughout sequencing and using short-read aligners. Simon Anders

DNA / RNA sequencing

Understanding and Pre-processing Raw Illumina Data

de.nbi and its Galaxy interface for RNA-Seq

Introduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

Using Galaxy: RNA-seq

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

INTRODUCTION AUX FORMATS DE FICHIERS

Genomic Files. University of Massachusetts Medical School. October, 2014

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Introduction to Galaxy

NGS Data Analysis. Roberto Preste

Maize genome sequence in FASTA format. Gene annotation file in gff format

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

ChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima

Accessible, Transparent and Reproducible Analysis with Galaxy

Decrypting your genome data privately in the cloud

Workshop 6: DNA Methylation Analysis using Bisulfite Sequencing. Fides D Lay UCLA QCB Fellow

GALAXY BIOINFORMATICS WORKFLOW ENVIRONMENT. Rutger Vos, 3 April 2012

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

NGS Analyses with Galaxy

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

CLC Server. End User USER MANUAL

Integrative Genomics Viewer. Prat Thiru

NGS FASTQ file format

Genomic Data Analysis Services Available for PL-Grid Users

A short Introduction to UCSC Genome Browser

Our typical RNA quantification pipeline

Genomic Files. University of Massachusetts Medical School. October, 2015

RNA-Seq Analysis With the Tuxedo Suite

Evaluate NimbleGen SeqCap RNA Target Enrichment Data

AgroMarker Finder manual (1.1)

Sequence Data Quality Assessment Exercises and Solutions.

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1

RNASeq2017 Course Salerno, September 27-29, 2017

replace my_user_id in the commands with your actual user ID

Sequence Mapping and Assembly

Ensembl RNASeq Practical. Overview

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

From the Schnable Lab:

ChIP-Seq Tutorial on Galaxy

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.

Reference guided RNA-seq data analysis using BioHPC Lab computers

Exome sequencing. Jong Kyoung Kim

Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)

Introduction to NGS analysis on a Raspberry Pi. Beta version 1.1 (04 June 2013)

Galaxy. Daniel Blankenberg The Galaxy Team

Quality assessment of NGS data

Exeter Sequencing Service

NGS Data and Sequence Alignment

ChIP-seq Analysis. BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute.

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Read mapping with BWA and BOWTIE

version /1/2011 Source code Linux x86_64 binary Mac OS X x86_64 binary

Contact: Raymond Hovey Genomics Center - SFS

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

The Galaxy Track Browser: Transforming the Genome Browser from Visualization Tool to Analysis Tool

Welcome to GenomeView 101!

Variation among genomes

HIPPIE User Manual. (v0.0.2-beta, 2015/4/26, Yih-Chii Hwang, yihhwang [at] mail.med.upenn.edu)

Transcription:

Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory

Collaboratory Workshops

Workshop Outline ü Day 1 UCLA galaxy and user account Galaxy web interface and management Tools for NGS analyses and their application Data formats Build/share workflow and history Q and A ü Day 2 Galaxy Tools for RNA-seq analysis Galaxy Tools for ChIP-seq analysis Galaxy Tools for annotation. Q and A *** Published datasets/results will be used in the tutorial

UCLA Galaxy http://galaxy.hoffman2.idre.ucla.edu ü Hardware Headnode (1) 96Gb memory, 12 core Computing nodes (8) 48Gb memory, 12 core Storage 100 Tb disk space ü Galaxy Resource Management - Hoffman2 grid engine Default: 1 core/job bowtie, bwa, tophat, cuffdiff, cufflinks, gatk programs: 4 core/job

UCLA Galaxy http://galaxy.hoffman2.idre.ucla.edu ü galaxy login account: login: your email associated with ucla ü Disk quota: 250Gb/user

Galaxy Account Management

Installed tools Launch analysis and view result History of execu7on and results

Raw Reads *_qseq.txt, *.fastq Upload to Galaxy File transfer protocol (ftp) Format Conversion demultiplex Quality Assessment Process Reads Alignment to Reference Results (sam/bam) Barcode splitter, demultiplex workflow fastqc, compute quality statistics, draw quality score boxplot, draw nuclotides distribution Trim sequences, sickle, scythe bwa, bowtie, bowtie2, tophat Text manipulation toolkit, BEDTools, SAM Tools, java genomics toolkit, picard toolkit Downstream Analyses Visualization BS-Seeker2, cufflinks, cuffdiff, macs, macs2, GATK, CEAS Genome browser, IGV

Repositories of Galaxy Tools https://toolshed.g2.bx.psu.edu

ü History panel contains all datasets that are uploaded and results derived from certain analyses ü A history can be organized, annotated, and managed as a project ü History is sharable. ü Workflow is extracted and built from a history ü Each dataset under a history can be viewed, examined, converted to other formats, and annotated.

Getting Data to the Galaxy h8p:// hgdownload.soe.ucsc.edu/ goldenpath/danrer7/bigzips/ mrna.fa.gz UCSC table browser: allows to upload genome assembly and annotations to the galaxy Data libraries: datasets need to be put on the galaxy server before they can be uploaded.

(Secure) FTP Clients FileZilla: http://filezilla-project.org Host: galaxy.hoffman2.idre.ucla.edu Username and Password: galaxy login

(Secure) FTP Clients https://www.bol.ucla.edu/software/mac/cuteftp/

Upload Data to Galaxy ü Data Transfer from your Hoffman2 Account

File Formats

File Formats ü Formats created by application roadmaps from assembler Velvet, gatk_dbsnp, gatk_recal from GATK, lav and axt from blastz ü Formats used for sequences and sequencing qualities fasta, fastq, fastqsolexa, fastqillumina, fastqsanger ü Formats used for annotations BED (bigbed), GFF (general feature format), GFF3, GTF (gene transfer format), GenePred ü Formats used for NGS alignment information sam (sequence alignment/map), bam (compressed binary version of sam) ü Formats used for displaying continuous-valued data wig (wiggle), bigwig (indexed binary format of WIG), bedgraph ü Formats for variation data vcf (variant call format), pgsnp (personal genome SNP format)

File Formats http://genome.ucsc.edu/faq/faqformat.html

Retrieving Data from UCSC Retrieve knowngene table in two formats from UCSC genome site

Genomes Pre-installed in Galaxy

upload s_1_1_600000_qseq.txt to galaxy or use published qc history qseq file format ü a plain-text file format for sequence reads. ü Each line contains: sequencer identifier, run number, lane number, tile number, x coordinate, y coordinate, index, read number (1 for single, 1 or 2 for paired ends), sequence, quality, filter

FastQ File Format http://en.wikipedia.org/wiki/fastq_format

FastQ Quality Scores

Phred Quality Score Q sanger = - 10 log 10 p Phred quality score P. That the base is called wrong 10 1 in 10 90% 30 1 in 1,000 99.9% 40 1 in 10,000 99.99% Accuracy of the base call 50 1 in 100,000 99.999%

Expected Sequence Quality ü A good quality read will have quality scores all above 28. Trim reads with lower quality score. ü Per base sequence and GC content Ideal reads have no variation with GC content along the length of the read.

Quality Control of Raw Sequences ü Upload s_1_1_600000_qseq.txt ü Run qseq_to_fastq program ü Run Fastqc program Alternatively, use compute quality statistics -> draw quality score boxplot -> draw nuclotides distribution chart programs

FastQ Converter

FastQ Manipulation Sickle is a sliding window trimmer and tries to keep the longest high quality 5 sequence reads. windows of N bases moving from 5 to 3 end are tested for average quality. In the first window that fails to meet >Q, bases are trimmed starting with the first base with quality < Q

FastQ Manipulation Scythe is an adapter trimmer for Illumina reads that employs a Bayesian model to classify contaminant substrings in reads

FastQ Manipulation Run FASTQ trimmer with 15 as offset from 5 end and 30 as offset from 3 end, then run FastQC with trimmed reads

Mapping Reads to a Genome BWA performs gapped alignments and can be used to detect indels and SNPs. BWA is generally used for DNA projects.

RNA-Seq Aligners ü Bowtie It doesn t perform gapped alignments. It runs faster and requires smaller memory footprint. ü Bowtie2 It is fast and can perform local and gapped alignment. It performs better for reads longer than 50bp. Bowtie and bowtie2 use indexed reference genome ü Tophat Most popular splice junction mapper for RNA-Seq reads. It first uses bowtie to align reads, and then analyzes the mapping reads to identify splice junctions between exons.

Bowtie for RNA-Seq Select mm10 as reference genome Select trimmed reads as input for FASTQ file Change Suppress all alignments for a read to 1 (-m 1)

Sequence Alignment/Map Format (SAM) ü A generic nucleotide alignment format that describes the alignment of reads to a reference genome in text format. ü It consists of optional header section and alignment section. http://samtools.github.io/hts-specs/samv1.pdf

Alignment Summary ü Best if more than 80% reads aligned to the reference ü good library if 60% aligned ü less than 20%, not complete reference or sample contamina7on

Picard SAM/BAM Alignment Summary Metrics Uncheck assume the input file is already sorted

Extract Workflow

Workflow Management Published workflow/history listed as shared data A new workflow can be created from scratch or import from a published workflow

Multiplex Sequencing ü During library preparation, adapters are ligated to the DNA fragments. Rd1 SP and Rd2 SP: primer sites Index SP: primer site for the index read P5 and P7: flow cell attachment sites ü Index (barcode) allows for sample identification ü Increase experimental scalability while reduce time and cost ü Attenuate lane effects

Demultiplexing of FastQ Sequences ü Barcode splitter It splits the FastQ data with barcode included in 5 or 3 end of sequence reads. ü Demultiplex workflow The workflow perform demultiplexing of FastQ sequence data with barcodes and sequences in two separate files.

Demultiplexing workflow

Demultiplexing of FastQ Sequences ü Upload s_2_2_1101_cut_qseq.txt, s_2_1_1101_cut_qseq.txt, barcode.txt to galaxy ü Convert qseq files to fastq files ü Run JoinLine program ü Run barcode splitter enhanced program ü Rename dataset to match sample name ü Run QC workflow for the splitted sample sequence datasets as needed.