Using the Galaxy Local Bioinformatics Cloud at CARC Lijing Bu Sr. Research Scientist Bioinformatics Specialist Center for Evolutionary and Theoretical Immunology (CETI) Department of Biology, University of New Mexico CARC Galaxy Workshop @ UNM 1
Outline Self- introduction Galaxy Hands on activity Demo 1 Demo 2 Useful information CARC Galaxy Workshop @ UNM 2
Hands up if you have Got in touch with NGS Data? Known Fastq format what the 4 lines means? Done RNA- Seq? Run command lines Created a BLAST database Installed tools on Linux/Unix Used online Bioinformatics Platform CARC Galaxy Workshop @ UNM 3
Self- Introduction Name Department Project Lijing Bu CETI @ Biology RNA- Seq, Genome re- sequencing Tools Shell, Perl Blast, ClustaW Tophat, Cufflinks, Trinity R, edger, DESeq Abyss, SOAPdenovo, Velvet CARC Galaxy Workshop @ UNM 4
Big Data and Abundant Tools NGS Data: 2~50 GB initial data per project Analysis involves multiple steps and tools Computational challenge Command lines make it easy to construct workflows but it takes time to master them blastx - query Trinity.fasta - db /home/blast/2015-06- 18/swissprot - out blastx.outfmt6 - evalue 1e- 20 - num_threads 44 - max_target_seqs 1 - outfmt 6 CARC Galaxy Workshop @ UNM 5
Bioinformatics Clouds Easy to use Share data, analysis steps Workflows $$$$$ Fixed workflows Few Apps CARC Galaxy Workshop @ UNM 6
Open source Galaxy @ PSU 700 individual tools in 200 packages, 40 categories Easy to use Highly customizable Local instance Add almost any Bioinformatics tool Use customized reference database Capable to use high- performance computer clusters Developers can publish new tools CARC Galaxy Workshop @ UNM 7
Galaxy Interface Tools Panel View Panel History Panel CARC Galaxy Workshop @ UNM 8
Example Workflow RNA- Seq CARC Galaxy Workshop @ UNM 9
Galaxy @ CARC Ulam Cluster 16 nodes x 8 CPUs/32 GB Xena Cluster 1T ~ 3 T shared MEM UNM Local Cloud for Bioinformatics Galaxy Web Server CARC Manager User User User Administrator CARC Galaxy Workshop @ UNM 10
Agenda of Galaxy @ CARC Phase I - Sputnik: Proof of concept Local galaxy test run. Tools installation. Connect to CARC server, submit PBS jobs. Phase II - Pluto: Internal test. Hardware connection to cluster, install Linux and galaxy, set up to connect to submit PBS jobs, main page design. Continue to add software, separate cluster jobs (60 s lag) versus local jobs. For a few tools, do batch mark test to find best setting to provide best performance. For some tools, extend PBS jobs to be submitted to server of large shared memory (1TB ~ 3TB). Open to few internal users, workshops. Fix and add more tools and local databases based on feedback. Phase III - Pluto: Open to more users Install more tools as requested by users. Build workflows from repeated used tools. Develop tools/workflows for specific purpose, and publish/share them to all Galaxy group. Possible upgrade hardware. CARC Galaxy Workshop @ UNM 11
Register CARC Account PI apply a project (approve in 1-2 days) https://www.carc.unm.edu/getting- started/request- a- project.html Name, email, title Abstract Students apply for an account linked to PI s project (approve in 1-2 days) https://www.carc.unm.edu/getting- started/request- an- account.html Name, email and project name to link to. Select machines want to use Contact Lijing Bu to create a Galaxy account CARC Galaxy Workshop @ UNM 12
Recommend Links about Galaxy All about Galaxy https://galaxyproject.org/ Ask Questions on BioStar https://biostar.usegalaxy.org/ Videos of various analysis using Galaxy https://vimeo.com/galaxyproject/videos/page:1/sort :alphabetical/format:thumbnail CARC Galaxy Workshop @ UNM 13
Galaxy Pluto @ CARC http://pluto.alliance.unm.edu User name: workshop- user# where # is your seat number Password: carcgalaxy Change password after login! Temporary user accounts were created for workshop use only. All data/workflows of temp user accounts will be deleted one month after the workshop. CARC Galaxy Workshop @ UNM 14
Hands On Demo 1 Basic dataset management 1. Shared histories 2. NGS Reads QC 3. Workflow Handle Multiple Datasets 1. Select multiple datasets as input 2. Build datasets collection Demo 2 RNA- Seq workflow Copy datasets Upload data with a link View and run workflow Datasets management Manage history Delete/hide datasets Share history CARC Galaxy Workshop @ UNM Detailed instructions PDF file is at https://www.carc.unm.edu/education- outreach/workshops- - training/workshop- materials/index.html Derived from online Galaxy Project s video at https://vimeo.com/galaxyproject/videos/page:1/sort:alphabetical/format:thumbnail 15
Demo 1 Basic dataset management 1. Shared histories 2. Reads QC 3. Workflow Handle Multiple Datasets 1. Select multiple datasets as input 2. Build datasets collection CARC Galaxy Workshop @ UNM 16
Find Published History CARC Galaxy Workshop @ UNM 17
Import History - 1 CARC Galaxy Workshop @ UNM 18
Import History - 2 Click to view tools in this category. Click to have brief view Download dataset Eye: view dataset Pencil: change features Cross: delete dataset CARC Galaxy Workshop @ UNM 19
Check on the Reads Quality Click on the link to open the tool Single FastQC on fastq read file 1. Delete dataset 3, Click to check deleted data, and undelete dataset 3. CARC Galaxy Workshop @ UNM 20
FastQC Results Single Input FastQC generates 2 output files 1. HTML webpage report (shown here data6) 2. Raw text report (data 7) Good or bad Illumina Data? http://www.bioinformatics.babraham.ac.uk/projects/fastqc CARC Galaxy Workshop @ UNM 21
Reads Quality Filtering Single end data Find Trimmomatic in the tool panel, Click on the link Run with default setting CARC Galaxy Workshop @ UNM 22
Trimmomatic Results CARC Galaxy Workshop @ UNM 23
FastQC on Filtered Data 2. Switch to filtered dataset 3. Run 1. The re- run button CARC Galaxy Workshop @ UNM 24
Improved Reads Quality by Filtering Before After Trimmomatic CARC Galaxy Workshop @ UNM 25
Extract Workflow from History Uncheck dataset 2-5, keep the analysis steps on dataset 1 only. CARC Galaxy Workshop @ UNM 26
Extract Workflow from History CARC Galaxy Workshop @ UNM 27
Extract Workflow from History Save & Run Click tools to add them into current workflow Mark output files to hide the rest in the history. CARC Galaxy Workshop @ UNM 28
Select Multiple Files as Input Button to Select multiple files Shift + Select: press the shift key to select a series of files. Control or command key: press to select or deselect multiple files. View from individual tool. CARC Galaxy Workshop @ UNM 29
Create Datasets Collection for Multiple Step Analysis Build list for pair- end read files. CARC Galaxy Workshop @ UNM 30
Datasets Collection Created CARC Galaxy Workshop @ UNM 31
FastQC - Select Datasets Collection CARC Galaxy Workshop @ UNM 32
Results of FastQC on Collection Instead of two output files, there are two lists of output files. Each list has 4 files. CARC Galaxy Workshop @ UNM 33
Mange the History Share your analysis to another user or to everyone. CARC Galaxy Workshop @ UNM 34
Copy Datasets to a New History Select fastq datasets 1 to 5 Name the new history CARC Galaxy Workshop @ UNM 35
Demo 2 RNA- Seq workflow Copy datasets Upload data with a link View and run workflow Datasets management Manage history Delete/hide Datasets Share history CARC Galaxy Workshop @ UNM 36
RNA- Seq Technology Input 6 files Reads 2 Samples x 2 Replicates Reference Sequences General Feature Format file Tools NGS aligner TopHat2 Reads counter/stats - Cufflinks http://faculty.ucr.edu/~tgirke/html_presentations/manuals/workshop_dec_12_16_2013/rrnaseq/rrnaseq.pdf CARC Galaxy Workshop @ UNM 37
Upload Reference Sequence 1. On a new window, open the follow address UCSC FTP site of human reference genome sequences http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes 2. Right click on chromosome 19.fa.gz, and copy link address. Right click: On Mac use two fingers Note: Be careful where you get data! NCBI, UCSC, ENSEMBL databases store data in slightly different format (ID system, chromosome label, GFF). Correct link http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr19.fa.gz CARC Galaxy Workshop @ UNM 38
Upload Reference Galaxy is sensitive to data type! Most tools require fastqsanger type for fastq files, rather than fastq, fastcssanger, fastqillumina. CARC Galaxy Workshop @ UNM 39
Upload Reference Sequence When paste the link, make sure the size is not empty. If empty, type a space after your pasted link address. CARC Galaxy Workshop @ UNM 40
Find Published Workflows CARC Galaxy Workshop @ UNM 41
CARC Galaxy Workshop @ UNM 42
!!! The default input file is the last file that fit the type format. For multiple files with the same format type (here fastq), the input order need to be checked.!!! CARC Galaxy Workshop @ UNM 43
!!! The default input file is the last file that fit the type format. For multiple files with the same format type (here fastq), the input order need to be checked.!!! CARC Galaxy Workshop @ UNM 44
CARC Galaxy Workshop @ UNM 45
Jobs Running If the page didn t reload automatically, but the circle in the tab is circling, the job is running. Be patient. CARC Galaxy Workshop @ UNM 46
Grey box Jobs are waiting CARC Galaxy Workshop @ UNM 47
Yellow Jobs are running CARC Galaxy Workshop @ UNM 48
Red box Error messages CARC Galaxy Workshop @ UNM 49
Manage Datasets in the History Click to show deleted files. Click to show hidden files. Click again to hide them. In workflows, you can specify to hide unwanted intermediate files. (more details in workflow build section) CARC Galaxy Workshop @ UNM 50
Demo Results BAM files reads aligned to reference. Newly found transcripts in GFF format (two samples merged) Differential Expression Analysis results CARC Galaxy Workshop @ UNM 51