Using the Galaxy Local Bioinformatics Cloud at CARC

Similar documents
Galaxy workshop at the Winter School Igor Makunin

ChIP-seq hands-on practical using Galaxy

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Galaxy Platform For NGS Data Analyses

ChIP-seq hands-on practical using Galaxy

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Analyzing ChIP- Seq Data in Galaxy

NGS : reads quality control

de.nbi and its Galaxy interface for RNA-Seq

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

NGS FASTQ file format

TP RNA-seq : Differential expression analysis

Copyright 2014 Regents of the University of Minnesota

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Importing your Exeter NGS data into Galaxy:

Single/paired-end RNAseq analysis with Galaxy

Accessible, Transparent and Reproducible Analysis with Galaxy

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Using Galaxy to provide a NGS Analysis Platform

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

ChIP-Seq Tutorial on Galaxy

Centre (CNIO). 3rd Melchor Fernández Almagro St , Madrid, Spain. s/n, Universidad de Vigo, Ourense, Spain.

Copyright 2014 Regents of the University of Minnesota

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

Illumina Next Generation Sequencing Data analysis

A Virtual Machine to teach NGS data analysis. Andreas Gisel CNR - ITB Bari, Italy

NGS Analysis Using Galaxy

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Sequence Analysis Pipeline

How to store and visualize RNA-seq data

Helpful Galaxy screencasts are available at:

Galaxy. Daniel Blankenberg The Galaxy Team

Reference guided RNA-seq data analysis using BioHPC Lab computers

Exercise 1. RNA-seq alignment and quantification. Part 1. Prepare the working directory. Part 2. Examine qualities of the RNA-seq data files

replace my_user_id in the commands with your actual user ID

Introduction to Galaxy

NGS Data Visualization and Exploration Using IGV

Pre-Workshop Training materials to move you from Data to Discovery. Get Science Done. Reproducibly.

Genomic Data Analysis Services Available for PL-Grid Users

Introduction to Galaxy

RNA-seq. Manpreet S. Katari

How To: Run the ENCODE histone ChIP- seq analysis pipeline on DNAnexus

Maize genome sequence in FASTA format. Gene annotation file in gff format

Performance analysis of parallel de novo genome assembly in shared memory system

modencode Galaxy: Uniform ChIP-Seq Processing Tools for modencode and ENCODE Data

Maximizing Public Data Sources for Sequencing and GWAS

INTRODUCTION TO NEXTFLOW

Performing de novo assemblies using the NBIC Galaxy instance

CLC Server. End User USER MANUAL

ChIP-seq (NGS) Data Formats

INF-BIO5121/ Oct 7, Analyzing mirna data using Lifeportal PRACTICALS

These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data.

Introduction to HPC Using zcluster at GACRC

wgmlst typing in BioNumerics: routine workflow

Short Read Sequencing Analysis Workshop

Using Galaxy for NGS Analyses Luce Skrabanek

Taller práctico sobre uso, manejo y gestión de recursos genómicos de abril de 2013 Assembling long-read Transcriptomics

Tutorial 4 BLAST Searching the CHO Genome

RNA-seq Data Analysis

DEWE v1.1 USER MANUAL

Analyzing Variant Call results using EuPathDB Galaxy, Part II

Basic User Guide Created By: 1 P a g e Last Modified: 11/7/2016

BGI Online Command Line Interface User Guide

How to Run NCBI BLAST on zcluster at GACRC

Using Galaxy to provide a NGS Analysis Platform GTC s NGS & Bioinformatics Summit Europe October 7-8, 2013 in Berlin, Germany.

biokepler: A Comprehensive Bioinforma2cs Scien2fic Workflow Module for Distributed Analysis of Large- Scale Biological Data

Genome Browsers - The UCSC Genome Browser

Fast-track to Gene Annotation and Genome Analysis

Bioinformatics in next generation sequencing projects

Examining De Novo Transcriptome Assemblies via a Quality Assessment Pipeline

UR Docs Indexer And Assessor

Super-Fast Genome BWA-Bam-Sort on GLAD

Exercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads

Introduction to High Performance Computing Using Sapelo2 at GACRC

8:15 Introduction/Overview Michelle Giglio. 8:45 CloVR background W. Florian Fricke. 9:15 Hands-on: Start CloVR W. Florian Fricke

New High Performance Computing Cluster For Large Scale Multi-omics Data Analysis. 28 February 2018 (Wed) 2:30pm 3:30pm Seminar Room 1A, G/F

BioHPC Lab at Cornell

Scalable RNA Sequencing on Clusters of Multicore Processors

LEMONS Database Generator GUI

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Cytidine-to-Uridine Recognizing Editor for Chloroplasts

DEWE v1.0.1 USER MANUAL

!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468,

Introduction to HPC Using zcluster at GACRC

High Performance Computing (HPC) Using zcluster at GACRC

The software and data for the RNA-Seq exercise are already available on the USB system

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

GEP Project Management System: Annotation Project Submission

Introduction to HPC Using zcluster at GACRC

DNA Sequencing analysis on Artemis

Quick Startup Guide - EnsureDR for Zerto

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Mapping NGS reads for genomics studies

Workflow management for data analysis with GNU Guix

Introduction to HPC Using zcluster at GACRC

Sequence Alignment: BLAST

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

RNA-Seq Analysis With the Tuxedo Suite

Transcription:

Using the Galaxy Local Bioinformatics Cloud at CARC Lijing Bu Sr. Research Scientist Bioinformatics Specialist Center for Evolutionary and Theoretical Immunology (CETI) Department of Biology, University of New Mexico CARC Galaxy Workshop @ UNM 1

Outline Self- introduction Galaxy Hands on activity Demo 1 Demo 2 Useful information CARC Galaxy Workshop @ UNM 2

Hands up if you have Got in touch with NGS Data? Known Fastq format what the 4 lines means? Done RNA- Seq? Run command lines Created a BLAST database Installed tools on Linux/Unix Used online Bioinformatics Platform CARC Galaxy Workshop @ UNM 3

Self- Introduction Name Department Project Lijing Bu CETI @ Biology RNA- Seq, Genome re- sequencing Tools Shell, Perl Blast, ClustaW Tophat, Cufflinks, Trinity R, edger, DESeq Abyss, SOAPdenovo, Velvet CARC Galaxy Workshop @ UNM 4

Big Data and Abundant Tools NGS Data: 2~50 GB initial data per project Analysis involves multiple steps and tools Computational challenge Command lines make it easy to construct workflows but it takes time to master them blastx - query Trinity.fasta - db /home/blast/2015-06- 18/swissprot - out blastx.outfmt6 - evalue 1e- 20 - num_threads 44 - max_target_seqs 1 - outfmt 6 CARC Galaxy Workshop @ UNM 5

Bioinformatics Clouds Easy to use Share data, analysis steps Workflows $$$$$ Fixed workflows Few Apps CARC Galaxy Workshop @ UNM 6

Open source Galaxy @ PSU 700 individual tools in 200 packages, 40 categories Easy to use Highly customizable Local instance Add almost any Bioinformatics tool Use customized reference database Capable to use high- performance computer clusters Developers can publish new tools CARC Galaxy Workshop @ UNM 7

Galaxy Interface Tools Panel View Panel History Panel CARC Galaxy Workshop @ UNM 8

Example Workflow RNA- Seq CARC Galaxy Workshop @ UNM 9

Galaxy @ CARC Ulam Cluster 16 nodes x 8 CPUs/32 GB Xena Cluster 1T ~ 3 T shared MEM UNM Local Cloud for Bioinformatics Galaxy Web Server CARC Manager User User User Administrator CARC Galaxy Workshop @ UNM 10

Agenda of Galaxy @ CARC Phase I - Sputnik: Proof of concept Local galaxy test run. Tools installation. Connect to CARC server, submit PBS jobs. Phase II - Pluto: Internal test. Hardware connection to cluster, install Linux and galaxy, set up to connect to submit PBS jobs, main page design. Continue to add software, separate cluster jobs (60 s lag) versus local jobs. For a few tools, do batch mark test to find best setting to provide best performance. For some tools, extend PBS jobs to be submitted to server of large shared memory (1TB ~ 3TB). Open to few internal users, workshops. Fix and add more tools and local databases based on feedback. Phase III - Pluto: Open to more users Install more tools as requested by users. Build workflows from repeated used tools. Develop tools/workflows for specific purpose, and publish/share them to all Galaxy group. Possible upgrade hardware. CARC Galaxy Workshop @ UNM 11

Register CARC Account PI apply a project (approve in 1-2 days) https://www.carc.unm.edu/getting- started/request- a- project.html Name, email, title Abstract Students apply for an account linked to PI s project (approve in 1-2 days) https://www.carc.unm.edu/getting- started/request- an- account.html Name, email and project name to link to. Select machines want to use Contact Lijing Bu to create a Galaxy account CARC Galaxy Workshop @ UNM 12

Recommend Links about Galaxy All about Galaxy https://galaxyproject.org/ Ask Questions on BioStar https://biostar.usegalaxy.org/ Videos of various analysis using Galaxy https://vimeo.com/galaxyproject/videos/page:1/sort :alphabetical/format:thumbnail CARC Galaxy Workshop @ UNM 13

Galaxy Pluto @ CARC http://pluto.alliance.unm.edu User name: workshop- user# where # is your seat number Password: carcgalaxy Change password after login! Temporary user accounts were created for workshop use only. All data/workflows of temp user accounts will be deleted one month after the workshop. CARC Galaxy Workshop @ UNM 14

Hands On Demo 1 Basic dataset management 1. Shared histories 2. NGS Reads QC 3. Workflow Handle Multiple Datasets 1. Select multiple datasets as input 2. Build datasets collection Demo 2 RNA- Seq workflow Copy datasets Upload data with a link View and run workflow Datasets management Manage history Delete/hide datasets Share history CARC Galaxy Workshop @ UNM Detailed instructions PDF file is at https://www.carc.unm.edu/education- outreach/workshops- - training/workshop- materials/index.html Derived from online Galaxy Project s video at https://vimeo.com/galaxyproject/videos/page:1/sort:alphabetical/format:thumbnail 15

Demo 1 Basic dataset management 1. Shared histories 2. Reads QC 3. Workflow Handle Multiple Datasets 1. Select multiple datasets as input 2. Build datasets collection CARC Galaxy Workshop @ UNM 16

Find Published History CARC Galaxy Workshop @ UNM 17

Import History - 1 CARC Galaxy Workshop @ UNM 18

Import History - 2 Click to view tools in this category. Click to have brief view Download dataset Eye: view dataset Pencil: change features Cross: delete dataset CARC Galaxy Workshop @ UNM 19

Check on the Reads Quality Click on the link to open the tool Single FastQC on fastq read file 1. Delete dataset 3, Click to check deleted data, and undelete dataset 3. CARC Galaxy Workshop @ UNM 20

FastQC Results Single Input FastQC generates 2 output files 1. HTML webpage report (shown here data6) 2. Raw text report (data 7) Good or bad Illumina Data? http://www.bioinformatics.babraham.ac.uk/projects/fastqc CARC Galaxy Workshop @ UNM 21

Reads Quality Filtering Single end data Find Trimmomatic in the tool panel, Click on the link Run with default setting CARC Galaxy Workshop @ UNM 22

Trimmomatic Results CARC Galaxy Workshop @ UNM 23

FastQC on Filtered Data 2. Switch to filtered dataset 3. Run 1. The re- run button CARC Galaxy Workshop @ UNM 24

Improved Reads Quality by Filtering Before After Trimmomatic CARC Galaxy Workshop @ UNM 25

Extract Workflow from History Uncheck dataset 2-5, keep the analysis steps on dataset 1 only. CARC Galaxy Workshop @ UNM 26

Extract Workflow from History CARC Galaxy Workshop @ UNM 27

Extract Workflow from History Save & Run Click tools to add them into current workflow Mark output files to hide the rest in the history. CARC Galaxy Workshop @ UNM 28

Select Multiple Files as Input Button to Select multiple files Shift + Select: press the shift key to select a series of files. Control or command key: press to select or deselect multiple files. View from individual tool. CARC Galaxy Workshop @ UNM 29

Create Datasets Collection for Multiple Step Analysis Build list for pair- end read files. CARC Galaxy Workshop @ UNM 30

Datasets Collection Created CARC Galaxy Workshop @ UNM 31

FastQC - Select Datasets Collection CARC Galaxy Workshop @ UNM 32

Results of FastQC on Collection Instead of two output files, there are two lists of output files. Each list has 4 files. CARC Galaxy Workshop @ UNM 33

Mange the History Share your analysis to another user or to everyone. CARC Galaxy Workshop @ UNM 34

Copy Datasets to a New History Select fastq datasets 1 to 5 Name the new history CARC Galaxy Workshop @ UNM 35

Demo 2 RNA- Seq workflow Copy datasets Upload data with a link View and run workflow Datasets management Manage history Delete/hide Datasets Share history CARC Galaxy Workshop @ UNM 36

RNA- Seq Technology Input 6 files Reads 2 Samples x 2 Replicates Reference Sequences General Feature Format file Tools NGS aligner TopHat2 Reads counter/stats - Cufflinks http://faculty.ucr.edu/~tgirke/html_presentations/manuals/workshop_dec_12_16_2013/rrnaseq/rrnaseq.pdf CARC Galaxy Workshop @ UNM 37

Upload Reference Sequence 1. On a new window, open the follow address UCSC FTP site of human reference genome sequences http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes 2. Right click on chromosome 19.fa.gz, and copy link address. Right click: On Mac use two fingers Note: Be careful where you get data! NCBI, UCSC, ENSEMBL databases store data in slightly different format (ID system, chromosome label, GFF). Correct link http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr19.fa.gz CARC Galaxy Workshop @ UNM 38

Upload Reference Galaxy is sensitive to data type! Most tools require fastqsanger type for fastq files, rather than fastq, fastcssanger, fastqillumina. CARC Galaxy Workshop @ UNM 39

Upload Reference Sequence When paste the link, make sure the size is not empty. If empty, type a space after your pasted link address. CARC Galaxy Workshop @ UNM 40

Find Published Workflows CARC Galaxy Workshop @ UNM 41

CARC Galaxy Workshop @ UNM 42

!!! The default input file is the last file that fit the type format. For multiple files with the same format type (here fastq), the input order need to be checked.!!! CARC Galaxy Workshop @ UNM 43

!!! The default input file is the last file that fit the type format. For multiple files with the same format type (here fastq), the input order need to be checked.!!! CARC Galaxy Workshop @ UNM 44

CARC Galaxy Workshop @ UNM 45

Jobs Running If the page didn t reload automatically, but the circle in the tab is circling, the job is running. Be patient. CARC Galaxy Workshop @ UNM 46

Grey box Jobs are waiting CARC Galaxy Workshop @ UNM 47

Yellow Jobs are running CARC Galaxy Workshop @ UNM 48

Red box Error messages CARC Galaxy Workshop @ UNM 49

Manage Datasets in the History Click to show deleted files. Click to show hidden files. Click again to hide them. In workflows, you can specify to hide unwanted intermediate files. (more details in workflow build section) CARC Galaxy Workshop @ UNM 50

Demo Results BAM files reads aligned to reference. Newly found transcripts in GFF format (two samples merged) Differential Expression Analysis results CARC Galaxy Workshop @ UNM 51